2026-02-05 04:24:57
![]()
The Wolfram Institute recently received a grant from the Templeton World Charity Foundation for “Computational Metaphysics”. I wrote this piece in part as a launching point for discussions with experts in traditional philosophy.

“What ultimately is there?” has always been seen as a fundamental—if thorny—question for philosophy, or perhaps theology. But despite a couple of millennia of discussion, I think it’s fair to say that only modest progress has been made with it. But maybe, just maybe, this is the moment where that’s going to change—and on the basis of surprising new ideas and new results from our latest efforts in science, it’s finally going to be possible to make real progress, and in the end to build what amounts to a formal, scientific approach to metaphysics.
It all centers around the ultimate foundational construct that I call the ruliad—and how observers like us, embedded within it, must perceive it. And it’s a story of how—for observers like us—fundamental concepts like space, time, mathematics, laws of nature, and indeed, objective reality, must inevitably emerge.
Traditional philosophical thinking about metaphysical questions has often become polarized into strongly opposing views. But one of the remarkable things we’ll see here is that with what we learn from science we’ll often be able to bring together these opposing views—typically in rather unexpected ways.
I should emphasize that my goal here is to summarize what we can now say about metaphysics on the basis of our recent progress in science. It’ll be very valuable to connect this to historical positions and historical thinking in philosophy and theology—but that’s not something I’m going to attempt to do here. I should also say that I’m going to concentrate on the major intellectual arc of what one can think of as a new scientific approach to metaphysics; the technical details of the science I’ve mostly already discussed elsewhere.

We’re going to begin our journey by talking about the traditional objective of physics: to find abstract theories that describe what we observe and measure in the physical world. From the history of physics we’ve come to expect that such theories will always end up being at best successive approximations. But the new possibility raised by our Physics Project is that we may now finally have reached the end: a truly fundamental theory of physics, that provides a complete description of the lowest-level “machine code” of our universe.
Already in antiquity the question arose of whether the universe is ultimately a continuum or is made of discrete atomic elements. By the end of the nineteenth century it was finally established that matter, at least, consists of discrete elements. And soon it became clear that light could be thought of in the same way. But what about space? Ever since Euclid, it had been assumed that space was a continuum. And efforts in the early twentieth century to see whether it, like matter, might be discrete did not work out.
But a century later, building on new, computationally inspired ideas, our Physics Project starts from the concept that space is not just a simple continuum. Instead, it’s a complicated discrete structure that in fact represents every aspect of our universe—both what we normally think of as space, and everything in it. There are many ways one can imagine describing this structure. A convenient one is to say that it consists of a very large number of discrete intrinsically identical “atoms of space”—that one can think of as being like disembodied geometrical points—whose only property (other than being distinct) is how they’re abstractly related to other atoms of space. In other words, we imagine describing the whole structure of the universe in terms of the pattern of relations between the atoms of space. And it’s convenient to represent this as a hypergraph whose nodes are atoms of space, and whose hyperedges define the relations between them. (If relations are only between pairs of nodes, this becomes an ordinary graph.)
An important piece of intuition that comes from our practical experience with computers is that it’s possible to represent everything we deal with in terms of bits. But when we also want to represent the structure of space it’s better to think not in terms of bits in some predetermined arrangement, but instead in terms of the lower level and more flexible “data structure” defined by a hypergraph.
So how can the universe as we normally perceive it emerge from this? It’s very much analogous to what happens with matter. For example, even though something like water consists of discrete molecules, the aggregate effect of them is to produce seemingly continuous fluid behavior. But then—still made up of the same underlying molecules—we can have discrete eddies in the fluid, analogous in the case of space to particles like electrons (or, for that matter, black holes).

If there’s a hypergraph that’s the ultimate “data structure” of the universe, what are the algorithms that get applied to it? Just as we imagine the data structure to consist of discrete elements, so also we imagine that changes to it occur by discrete events. And for now we can imagine that there’s some fixed rule that determines these elementary events. For example, the rule might be that whenever a piece of the hypergraph has some specified form, it should be replaced by a piece of hypergraph with some other specified form.
We can think of the application of such a rule as corresponding to the computation of the “next state” of the universe from the previous one. And if the rule is repeatedly applied, it will generate a whole sequence of updated states of the universe. And we can then identify the progression of these states as corresponding to the progression of time in the universe.
It’s notable that in this setup space and time are, at least at the outset, different kinds of things. Space is associated with the structure of the hypergraph, yet time is associated with computation on it.
Still, just as the hypergraph defines relations between atoms of space, we can imagine a causal graph that defines “causal relations” between events. Any particular event can be thought of as taking some collection of atoms of space as “inputs”, and producing some other collection of atoms of space as “outputs”. But this then implies a causal relation between events: any event that uses as input an atom of space that was generated as output by another event can be thought of as “causally dependent” on that other event.
And the whole pattern of these causal relations ultimately defines a causal graph for all events in the universe—that in a sense encodes the structure of the universe in both space and time.
But given such a causal graph, can we reconstruct a series of hypergraphs from it? We can think of such hypergraphs as representing successive “instantaneous states of space”. And—just like in relativity—it turns out that there isn’t a unique possible such sequence of states. Instead, there are many different sequences, all consistent with the underlying causal graph—and corresponding in traditional physics terms to different relativistic reference frames.
In effect, therefore, we can think of the causal graph as being the “true representation” of information about the universe. Any particular “reconstructed” sequence of hypergraphs inevitably involves arbitrary choices.
When we introduced the causal graph, we talked about building it by starting from a particular hypergraph, and then looking at the effect of applying rules to it. But the point is that it turns out there’s a lot of choice in both the hypergraph and how we apply the rules, but (as a result of the phenomenon of causal invariance) essentially all choices will lead us to the same causal graph.
We might have imagined that given a fundamental theory of physics we should be able to ask what the universe in some sense “statically is”. But what we’re discovering is that we should instead be talking about the processes that happen in the universe—as represented by the causal graph.
We can identify the passage of time as the progression of events in the causal graph. But why is there even something like space? Ultimately it turns out to be a reflection of the “entanglement” of different sequences of events in the causal graph—and its structure is in effect a map of the relations between these sequences of events (a structure which can conveniently be represented by a hypergraph).
Imagine starting from one event in the causal graph, then tracing a sequence of events that depend on it. We can think of the successive events as occurring progressively later in time. But what about two events that are both immediate successors of a given event? What is their relationship? The key idea is that even though these “sibling” events occur “at the same time”, they are still separated—in what we can think of as space.
But how then is “space as a whole” formed? Ultimately it’s something very dynamic. And indeed it’s the continual occurrence of events in the universe that “knits together” the structure of space. Without such “activity”, there would be nothing we could coherently consider as “space”.
At the level of atoms of space there is nothing permanent in the universe; in every elementary event, atoms of space are destroyed, and new ones created. But somehow at an aggregate level there is a certain stability to what emerges. It’s again a little like with fluids, where the microscopic motions of huge numbers of underlying molecules lead in the aggregate to the laws of fluid mechanics.
But what then are the aggregate laws that emerge from large numbers of hypergraph updates? Remarkably enough, they almost inevitably turn out to be exactly the Einstein equations: the equations that seem to govern the large-scale structure of spacetime. So even though what’s “there underneath” is just what we might think of as “abstract” atoms of space and rules for rewriting relations between them, what emerges is something that reproduces familiar elements of what we think of as “physical reality”.

If there’s a rule that can ultimately reproduce the behavior of the universe, how complicated a rule does that need to be? Our traditional intuition—say from experience from engineering—is that one needs a complicated rule if one wants to produce complicated behavior. But my big discovery from the early 1980s is that this isn’t the case—and that in fact it’s perfectly possible even for extremely simple underlying rules (like my favorite “rule 30”) to produce behavior of immense complexity.
But why ultimately does this happen? We can think of running a rule as being like running a program, or, in other words, like doing a computation. But how sophisticated is that computation? We might have thought that different rules would do incomparably different computations. But the existence of universal computation—discovered a century ago—implies that in fact there’s a class of universal rules that can effectively emulate any other rule (and this is why, for example, software is possible).
But actually there’s a lot more that can be said. And in particular my Principle of Computational Equivalence implies that essentially whenever one sees a system whose behavior is not obviously simple, the system will actually be doing a computation that is in some sense as sophisticated as it can be. In other words, sophisticated computation isn’t just a feature of specially set up “computer-like” systems; it’s ubiquitous, even among systems with simple underlying rules.
So what does this mean? It’s often considered a goal of science to be able to predict what systems will do. But to make such a prediction requires in a sense being able to “jump ahead” of the behavior of the system itself. But the Principle of Computational Equivalence tells us that this won’t in general be possible—because it’s ubiquitous for the system we’re trying to predict to be just as computationally sophisticated as the system we’re trying to use to predict it. And the result of this is the phenomenon of computational irreducibility.
You can always find out what a system will do just by explicitly running its rules step by step. But if the system is computationally irreducible there’ll be no general way to shortcut this, and to find the result with reduced computational effort.
Computational irreducibility is what irreducibly separates underlying rules from the behavior they produce. And it’s what causes even simple rules to be able to generate behavior that cannot be “decoded” except by irreducibly great computational effort—and therefore will be considered random by an observer with bounded computational capabilities.
Computational irreducibility is also what in a sense makes time something “real”. We discussed above that the passage of time corresponds to the progressive application of computational rules. Computational irreducibility is what makes that process “add up to something”. And the Principle of Computational Equivalence is what tells us that there’s something we can think of as time that is in effect “pure, irreducible computation” independent of the system in which we’re studying it.
It’s very much the same story with space. Computational irreducibility in general leads to a certain “uniform effective randomness” in the structure of hypergraphs, which is what allows us to imagine that there’s a definite “substrate independent” concept of space.
There’s a close analogy here to what happens in something like a fluid. At a molecular level there are lots of molecular collisions going on. But the point is that this is a computationally irreducible process—whose end result is enough “uniform effective randomness” that we can meaningfully talk about the properties of the fluid “in bulk”, as a thing in itself, without having to mention that it’s made of molecules.
So how does all this relate to our original metaphysical question of what there ultimately is? Computational irreducibility introduces the idea that there’s something robust and invariant about “pure computation”—something that doesn’t depend on the details of what’s “implementing” that computation. Or, in other words, that there’s a sense in which it’s meaningful to talk about things simply being “made of computation”.

In talking about things like a hypergraph representing space and everything in it, we’re giving in a sense an objective description of the universe “from the outside”. But what ultimately matters to us is not what’s “in principle out there”, but rather what we actually perceive. And indeed we can think of science as being first and foremost a way to find narrative descriptions which fit in our minds of certain aspects of what’s out there.
But given computational irreducibility, why is this even possible? Why are there ever, for example, “laws of nature” which let us make predictions about things, even with the bounded amount of computation that our finite minds can do?
The answer is related to an inevitable and fundamental feature of computational irreducibility: that within any computationally irreducible process there must always be an infinite number of pockets of computational reducibility. In other words, even though computational irreducibility makes it irreducibly difficult to say everything about what a system will do, there will always be pockets of reducibility which allow one to say certain things about it. And it’s such pockets of reducibility that our processes of perception—and our science—make use of.
Once again we can use fluid dynamics as an example. Even though the detailed pattern of underlying molecular motions in a fluid is computationally irreducible, there are still computationally simple overall laws of fluid flow—that we can think of as being associated with pockets of computational reducibility. And from our point of view as computationally bounded observers, we tend to think of these as the laws of the fluid.
In other words, the laws we attribute to a system depend on our capabilities as observers. Consider the Second Law of thermodynamics, and imagine starting from some simple configuration, say of gas molecules. The dynamics of these molecules will generically correspond to a computationally irreducible process—whose outcome to a computationally bounded observer like us will seem “increasingly random”. Of course, if we were not computationally bounded, then we’d be able to “decode” the whole underlying computationally irreducible process, and we wouldn’t believe in the presence of seemingly increasing randomness, or, for that matter, the Second Law. But—regardless of any details—as soon as we’re computationally bounded, we’ll immediately perceive the Second Law.
We might have assumed that the Second Law was some kind of intrinsic law of nature—directly related to what there ultimately is. But what we see is that the Second Law is something that emerges because of us, and our characteristics as observers, and in particular our computational boundedness.
There are other things that also work this way—for example, our belief in a coherent notion of space. At the lowest level we imagine that there’s a discrete hypergraph being updated through what’s ultimately a computationally irreducible process. But as computationally bounded observers we only perceive certain aggregate features—that correspond in effect to a pocket of computational reducibility associated with our simple, continuous perception of space.
When we think about spacetime—and for example about deriving its relativistic properties—there’s another feature of us as observers that also turns out to be important: the fact that we assume that we are persistent in time, and that—even though we might be made of different atoms of space at every successive moment of time—we can still successfully knit together perceptions at successive moments of time to form a single thread of experience. In a sense this is a “simplification” forced upon us by our computational boundedness. But it’s also in many ways at the core of what we think of as our notion of consciousness (which is something I’ve written about at some length elsewhere).
The Principle of Computational Equivalence implies that sophisticated computation is ubiquitous—and certainly not something special to brains. And indeed it seems that brains actually concentrate on a specific—and in many ways limited—form of computation. They take in large amounts of sensory data, and in effect compress it to derive what’s ultimately a thin stream of actions for us to take. At a biological level, there’s always all sorts of activity going on across the billions of neurons in our brains. But our brains are, it seems, specially constructed to concentrate all that activity down to what’s essentially a single thread of thought, action and “experience”. And it’s the fact that this is a single thread that seems to give us our sense of coherent existence, and in effect, of consciousness.

Traditional classical physics talks about definite things happening in the universe—say a projectile following a definite path, determined by its laws of motion. But quantum mechanics instead talks about many paths being followed—specifying only probabilities for their various outcomes.
In this history of physics quantum mechanics was a kind of “add on”. But in our Physics Project it’s immediately essential, and unavoidable. Because the rules that we define simply say that whenever there is a piece of a hypergraph that matches a particular pattern, it should be transformed. But in general there will be many such matches—each one producing a different transformation, and each one in effect initiating what we can think of as a different path of history. And in addition to such branching, there can also be merging—when different transformations end up producing the same hypergraph.
We can represent all these branching and merging paths of history by what I call a multiway graph. And we can think of such a multiway graph as giving a complete description of “what happens” in the universe.
But as we discussed above, observers like us maintain just a single thread of experience. And that means we can’t directly perceive a whole multiway graph. Instead, we have to effectively pick out just one path from it. But which path will it be? At the level of the formalism of quantum mechanics—or of our Physics Project—the only thing we talk about is the whole collection of all paths. So something else must determine the path.
In physical space, we’re used to the idea that we as observers are localized at a particular position, and only get to directly perceive what’s around where we are. Across all of physical space, there are lots of things going on. But because of where we happen to be, we only get to directly perceive a tiny sample of them.
So is something similar going on in picking paths of history from the multiway graph? It seems that it is. If we take a slice across the multiway graph at any particular time, we’ll have lots of “dangling ends” of paths of history, each associated with a different state of the universe. But inevitably there are lots of relations between these states. (For example, two states might have an immediate common ancestor.) And it turns out that we can think of the states as being laid out in what we can call “branchial space”.
And just like in physical space, we can expect that we as observers are localized in branchial space. So that means that even though there are at some level many different paths of history, we only get to perceive ones that are around “where we are”. And just like there’s no “theory” that tells us where we find ourselves in physical space (which planet, which galaxy, etc.), the same is true in branchial space. One day we might have some way to describe our location in branchial space, but for now the best we can do is say that it’s “random”.
And this, I believe, is why outcomes in quantum mechanics seem to us random. The whole multiway graph is completely determined (as wave functions etc. are even in the standard formalism of quantum mechanics). But which part of the multiway graph we as observers sample depends on where we are in branchial space.
And we can expect that just as we humans are all close together in physical space, so are we in branchial space. And this means that even though in the abstract the result of, say, some particular quantum measurement might seem “random”, all human observers—being nearby in branchial space—will tend to agree what that result is, and at least among them, there’ll be something they can consider “objective reality”.

The remarkable implication of our Physics Project is that our whole universe, in all its richness, can emerge just from the repeated application of a simple underlying rule. But which rule? How would it be selected?
The idea of the ruliad is to imagine that no selection is needed—because all rules are being used. And the ruliad is what comes out: the entangled limit of all possible computational processes.
We discussed in the context of quantum mechanics the idea that a given rule can get applied in multiple ways, leading to multiple paths of history. The ruliad takes this idea to the limit, applying not just one rule in all possible ways, but all possible rules in all possible ways.
We can imagine representing the ruliad by a giant multiway graph—in which there is a path that represents any conceivable specific computation. And what fundamentally gives the ruliad structure is that these paths can not only branch but also merge—with mergers happening when different states lead to equivalent outcomes which are merged in the multiway graph.
At first we can think of the ruliad as being built from all possible hypergraph rules in our Physics Project. But the Principle of Computational Equivalence implies that actually we can use any type of rule as our basis: since the ruliad contains all possible computational processes its final form will be the same.
In other words, however we end up representing it, the intrinsic form of the ruliad is still the same. Once we have the concept of computation (or of following rules), the ruliad is an inevitable consequence. In some sense it is the ultimate closure of the concept of computation: the unique object that encapsulates all possible computational processes and the inevitable relations between them.
We got to the ruliad by thinking about physics, and about the ultimate infrastructure of our physical universe. But the ruliad is something much more general than that. It’s an abstract object that captures everything that is computationally formalizable, along with the elaborate structure of relations between such things.
Of course, the idea that the ruliad can describe our actual physical universe is ultimately just a hypothesis—though one that’s strongly encouraged by the success of our Physics Project.
How could it be wrong? Well, our universe could involve hypercomputation—which is not finitely captured by the ruliad. And we might have to consider a whole hierarchy of possible hyperruliads. (Though as we’ll see, any effects from this would likely be beyond anything observers like us could perceive.)
But assuming that the ruliad is the ultimate infrastructure for everything we can then ask what it’s made of. At some level we could just say it’s made of abstract computational processes. But what are those processes operating on? Again, abstract things. But we can imagine decomposing those abstract things. And while inevitably there will be different ways to do this, it’ll often be convenient to imagine that they consist of relations between ultimate, indivisible objects—which we can describe as “atoms of existence”, or what I’ve called “emes”.
In our Physics Project, we identified emes with atoms of space. But in talking about the ruliad in general, we can think of them just as the “ultimate raw material for existence”. Emes have no structure of their own. And indeed the only intrinsic thing one can say about them is that they are distinct: in a sense they are elementary units of identity. And we can then think of it being the relations between them that build up the ruliad—and everything it underlies.

Our original metaphysical question was: “What ultimately is there?” And at some level our science has now led us to an answer: the ruliad is everything there ultimately is.
But what about what there is for us? In other words, what about what there ultimately is in what we perceive and experience? Inevitably, we as observers must be part of the ruliad. And our “inner experiences” must similarly be represented within the ruliad. But in and of itself that’s not enough to tell us much about what those experiences might be. And we might imagine that to work this out, we’d need to know a lot of the particular details of our construction, and our place in the ruliad.
But what’s emerged in the last few years is that in many important ways, we don’t. And instead just knowing certain coarse features of us as observers already implies a lot about what we must experience. In particular, if we assume that we are observers who are computationally bounded, and believe we are persistent in time, then we argue that it is inevitable that we must perceive certain laws to be operating—and those laws turn out to be exactly the three central laws of twentieth century physics: general relativity, quantum mechanics, and the Second Law of thermodynamics.
It’s a remarkable claim: the laws of physics we observe don’t just happen to be the way they are; they are inevitable for observers with the general characteristics we have. At the level of the underlying ruliad the laws of physics that we might observe are not determined. But as soon as we know something about what we’re like as observers, then we necessarily end up with our familiar laws of physics.
In a sense, therefore, the laws of physics that we experience are the way they are because we are observers that are the way we are. We already discussed this above in the case of the Second Law. And although we don’t yet know all the details, the basic conclusion is that by combining the abstract structure of the ruliad with our assumptions about what we’re like as observers, we are able to derive all three of the familiar core laws of physics from the twentieth century.
It’s worth emphasizing that what we can immediately derive are in a sense “general laws”. We know that spacetime has a certain overall structure, and its dynamics satisfy the Einstein equations. But we don’t, for example, know why the universe as we perceive it has (at least approximately) 3 dimensions of space—though my guess is that many such features of observed physics can ultimately be traced to features of the way we are as observers.
So what about observers not like us? They’re still part of the ruliad. But in a sense they’re sampling it in a different way. And they’ll potentially perceive quite different laws of physics.
It’s a very fundamental observation about our universe that we perceive it to follow fairly simple laws. But in a sense this too is just a feature of our nature as observers. Because given our computational boundedness we couldn’t really make use of—or even identify—any laws that were not in some sense simple.
The fact that simple laws are possible can be viewed as a reflection of the inevitable presence of pockets of computational reducibility within any computationally irreducible process. But it’s our computational boundedness as observers that causes us to pick them out. If we were not computationally bounded then we could operate at the level of raw computational irreducibility, and any need to pick out simple laws.
At the outset, we might have imagined that the laws of physics would somehow fundamentally be at the root of the question of “what ultimately is there?” But what we’re seeing is that actually these laws are in a sense higher-level constructs, whose form depends on our characteristics as observers. And to get to our original metaphysical question, we have to “drill down” beyond our perceived laws of physics to their “computational infrastructure”, and ultimately all the way to the ruliad.

When we ask what there ultimately is, we’re in some sense implicitly assuming that there actually is something definite—or in effect that there’s a single ultimate “objective reality”. But is that actually how things work, or does every observer, for example, in effect “have their own reality”?
In our approach, there’s a quite nuanced answer. At the very lowest level there is a single ultimate objective reality that knits everything together—and it’s the ruliad. But meanwhile, different observers can in principle experience different things. But as soon as we’re dealing with observers even vaguely like us (in the sense that they share our computational boundedness, and our belief in our own persistence) we’ve argued that it’s inevitable that they’ll always experience the core laws of physics as we know them. In other words, these laws in effect represent a single objective reality—at least across observers even vaguely like us.
But what about more detailed features of our experience? No doubt some we’ll be able to “objectively derive” on the basis of characteristics we identify as shared across all “observers like us”. But at some level, different observers will always have different experiences—not least because, for example, they are typically operating at different places in space, and indeed in general at different places in the ruliad.
Still, our everyday impression is that even though the detailed experiences, say, of different people looking at the same scene may be different, those experiences can nevertheless reasonably be thought of as all derived from the same “underlying objective reality”. So why is this? Essentially I think it’s because human observers are all very nearby in the ruliad—so they’re in a sense all sampling the same tiny part of the ruliad.
Observers at different places in the ruliad in effect sample different threads of history, that operate according to different rules. But the Principle of Computational Equivalence tells us that—just as it’s always possible to translate from one universal computational system to another—it’ll always in the end be possible to translate between what observers get by sampling at different places in the ruliad.
The difficulty of translation depends, though, on how far one is trying to go in the ruliad. Human minds exposed to similar knowledge, culture, etc. are nearby and fairly easy to translate between. Animal minds are further away, and more difficult to translate to. And when it comes to something like the weather, then even though in principle it’s computationally equivalent, the distance one has to go in the ruliad to reach it is sufficiently great that translation is very difficult.
Translation between places in the ruliad is in a sense just a generalization of translation in physical space. And the process of moving in physical space is what we describe as motion. But what actually is motion? In effect it’s having something move to a different place in space while still “being the same thing”. In our Physics Project, though, something must be made of different atoms of space if it’s at a different place in space. But somehow there must be some pattern of atoms of space that—a bit like an eddy in a fluid—one can say represents “the same thing” at different places in space.
And potentially one can think of particles—like electrons or photons—as being in a sense “elementary carriers of pure motion”: minimal objects that can move without changing.
But how does this work more generally in the ruliad? What is it that can “move” between observers, or between minds, without changing? Essentially it seems to be concepts (often in basic form represented by words). Within for example one human brain a thought corresponds to some complicated pattern of neural activity. But what allows it to be “moved” to another brain is “packaging it up” into a “concept” that can be unpacked by another brain. And at some level it’s this kind of communication that “aligns observers” to have similar inner experiences—to the point where they can be viewed as reflecting a common objective reality.
But all of this somehow presupposes that there are many observers—whose experiences can be thought of as “triangulating” to a common objective reality. If there were just one observer, though, there’s no triangulation to do, and one might imagine that all that would matter is the inner experience of that one observer.
So in a sense the very notion that we can usefully talk about objective reality is a consequence of there being many similar observers. And of course in the specific case of us humans there are indeed billions of us.
But from a fundamental point of view, why should there be many similar observers, or even any observers at all? As we discussed above, the core abstract characteristic of an observer is its ability to equivalence many possible inputs to produce a small set of possible outputs. And—although we don’t yet know how to do it—we can imagine that it would be possible to derive the fact that there must be a certain density of structures that do this within the ruliad. Could there inevitably be enough similar observers to be able to reasonably triangulate to an objective reality?
If we take the example of biology (or modern technology) it seems like what’s critical in generating large numbers of similar observers is some form of replication. And so, surprising as it might seem for something as apparently fundamental as this, it appears that our impression of the existence of objective reality is actually intimately tied up with the rather practical biological phenomenon of self replication.
OK, so what should we in the end think about objective reality? We might have imagined that having a scientific theory of the universe would immediately imply a certain objective reality. And indeed at the level of the ruliad that’s true. But what we’ve seen is that even to get our familiar laws of physics we need an observer “parsing” the raw ruliad. In other words, without the observer we can’t even talk about fundamental concepts in physics. But the point is that for a very wide range of observers even vaguely like us, many details of the observer don’t matter; certain things—like core laws of physics—inevitably and “objectively” emerge.
But the laws of physics don’t determine everything an observer perceives. Some things are inevitably determined by the particular circumstances of the observer: their position in space, in the ruliad, etc. But now the point is that observers—like us humans—are nearby enough in space, the ruliad, etc. that our perceptions will be to a large extent aligned, so that we can again usefully attribute them to what we can think of as an external objective reality.

In thinking about what there ultimately is, an obvious question is whether whatever there is has always been there—and will always be—or whether instead there’s in effect a beginning—and end—to time.
As we discussed above, in our computational paradigm, the passage of time is associated with the progressive computation of successive states of the universe. But the important point is that these states embody everything—including any potential observers. So there can never be a situation where an observer could say “the universe hasn’t started yet”—because if the universe hasn’t started, nor will the observer have.
But why does the universe start at all? We’ll say more about that later. But suffice it to say here that the ruliad in effect contains all possible abstract computations, each consisting of some chain of steps that follow from each other. There’s nothing that has to “actively start” these chains: they are just abstract constructs that inevitably follow from the definition of the ruliad.
There’ll be a beginning to each chain, though. But the ruliad contains all possible beginnings, or in other words, all possible initial states for computations. One might wonder, given all of this, how the ruliad can still have any kind of coherent structure. The answer, as we discussed above, is the entanglement of different threads of computation: the threads are not independent, but are related by the merging of equivalent states.
Of course one can then ask why equivalent states are in fact merged. And this is immediately a story about observers. One can imagine a raw construction of the ruliad in which every different thread of computation independently branches. But anytime states generated in different threads are identical, any observer will equivalence them. So this means that to any observer, these threads will be merged—and there will effectively be entanglement in the ruliad.
We’ve said that the passage of time corresponds to the progression of computation. And given this, we can imagine that the ruliad is built up “through time”, by progressively applying appropriate rules. But actually we don’t need to think of it this way. Because once a procedure for the construction of the ruliad is defined, it’s inevitable that the whole structure of the ruliad is, at least in principle, immediately determined.
In other words, we can imagine building up the ruliad step by step through time. Or we can imagine that the ruliad in some sense all immediately “just exists”. But the point is that to computationally bounded observers these are basically equivalent. In the first case the observer is “pulled along” by the irreducible computation that’s “happening anyway” to move forward the “frontier” of the ruliad. In the second case, the observer in a sense has to actively explore the “already-formed” ruliad, but because of the observer’s computational boundedness, can do so only at a certain limited rate—so that once again there is something corresponding to the passage of time, and relating it to computational irreducibility.
But what happens if one includes the fact that there are threads of computation in the ruliad starting from all possible initial states? Well, a bounded observer will only be able to probe all these states and their behaviors at some limited rate. So even if “from outside the ruliad” (if one could be there) one might see infinitely many different beginnings of the universe, any computationally bounded observer embedded in the ruliad would perceive only a limited set. And, indeed, depending a bit on the scales involved, the observer might well be able to conflate those limited possibilities into the perception of just a single, finite beginning of the universe—even though, underneath, there’s much more going on in the whole ruliad.
(One thing one might wonder is that since the ruliad contains every possible rule, why can’t there just be a single, very complicated rule that just creates our whole universe in a single step? The answer is that in principle there can be. But computationally bounded observers like us will never perceive it; our “narrative about the universe” might involve computationally limited steps.)
Another subtlety concerns the relationship of time to the equivalencing of states. Imagine that in the ruliad (or indeed just in a causal graph) a particular state is generated repeatedly when rules are applied. We can expect an observer to equivalence these different instances of the state—thus in effect forming a loop in the progression of states. Quite possibly such loops are associated with phenomena in quantum field theory. But to an observer like us such loops will be happening at a level “below” perceived time.
We talked about the beginning of time. What about the end? If the passage of time is the progression of computation, can the computation simply halt? The answer for any specific computation is yes. A particular rule might, for example, simply not apply anywhere in a given hypergraph. And that means that in effect time stops for that hypergraph. And indeed this is what presumably happens at the center of a black hole (at least in the simplest case).
But what about the whole ruliad? Inevitably parts of it will “keep running”, even if some threads of computation in it stop. But the question is what an observer will perceive of that.
Normally we’ve just taken it for granted that an observer does whatever they do forever. But in reality, as something embedded in the ruliad, an observer will at some level have to “navigate computational irreducibility” to maintain itself. And whether it’s because a biological observer dies, or because an observer ends up in a black hole, we can expect the actual span of experience of individual observers to be limited, in effect defining an end of time for the observer, even if not for the whole ruliad.

In our discussion of what there ultimately is, an obvious question is why there’s ultimately anything at all. Or, more specifically, why does our universe exist? Why is there something rather than nothing?
One might have imagined that there’d be nothing one could say about such questions in the framework of science. But it turns out that in the context of the ruliad there’s actually quite a lot one can say.
The key point is that the ruliad can be thought of as a necessary, abstract object. Given a definition of its elements it inevitably has the structure it has. There’s no choice about it. It’s like in mathematics: given the definitions of 1, +, etc.,
And so it is with the ruliad. The ruliad has to be the way it is. Every detail of it is abstractly determined. Or, in other words, at least as an abstract object, it necessarily exists.
But why, we might ask, is it actualized? We can imagine all sorts of formal systems with all sorts of structure. But why is the ruliad what is actualized to give us the physical world we experience?
The key here is to think about what we operationally mean by actualized. And the point is that it’s not something absolute; it’s something that depends on us as observers. After all, the only thing we can ever ultimately know about is our own inner experience. And for us something is then “actualized” if we can—as we discussed above—successfully “triangulate our experiences” to let us consider it to have an objective reality.
At some level, the ruliad is an abstract thing. And our inner experiences are abstract things. And we’re saying that there’s a certain abstract necessity to the way these things are linked. With what amounts to a description of our physical world being a necessary intermediate step in that linking.
Given its definition, it’s immediately inevitable that the ruliad must exist as an abstract object. But what about observers like us? We know ourselves that we exist from the inner experiences we have. But is it necessary that we exist? Or, put another way, is it inevitable that somewhere in the ruliad there must be structures that correspond to observers like us?
Well, that’s a question we can now study as a matter of science. And ultimately we can imagine an abstract derivation of the density of different levels of observers in the ruliad. To get to observers like us requires—as we discussed above—all sorts of details, probably including features from biology, like self replication. But if we require only the features of computational boundedness and a belief in persistence, there are no doubt many more “observer structures” in the ruliad.
It’s interesting to consider those “alien minds” distributed across the ruliad. In traditional searches for extraterrestrial intelligence one is seeking to bridge distances in physical space. But likely the distances across the ruliad—in rulial space—are vastly greater. And in effect the “minds” are more alien (like the “mind” of the weather)—and to “communicate” with them will require, in effect, an effort of translation that involves an immense amount of irreducible computation.
But for us, and our science, what matters is our own experience. And given our knowledge that we exist, the existence of the ruliad seems to make it in effect inevitable that we must consider the universe to exist.
One wrinkle to mention concerns generalizations of the ruliad. We’ve said that the ruliad encapsulates all possible computational processes. But by this we mean processes that can be implemented on one of our traditional models of computation—like Turing machines. But what about hypercomputations that would require an infinite number of steps for a Turing machine? One can imagine a whole hierarchy of hyperruliads based on these. And one could imagine observers embedded not in the ordinary ruliad, but in some hyperruliad. So what would be the experience of such observers? They’d never be able to perceive anything outside their hyperruliad, and in fact one can expect that through their own hypercomputations their perception of their own hyperruliad would be essentially equivalent to our perception of the ordinary ruliad—so that in the end there’s no perceptible distinction between being in the ruliad and in a hyperruliad: the “same universe”, with the same laws of physics we know, exists in both.
When one starts talking about the universe operating at the lowest level according to computational rules people sometimes seem to think that means our universe must ultimately be “running on a computer”. But to imagine that is essentially to misunderstand the whole concept of theoretical science. For the idea in theoretical science is to construct abstract models that allow one to reproduce certain aspects of what systems do. It’s not that the systems themselves mechanistically implement the models; it’s just that the models abstractly reproduce aspects of what the systems do.
And so it is with our model for physics. It’s not that somewhere “inside the universe” there’s a computer moving bits around to rearrange hypergraphs. Instead, it’s just that abstractly rearranging hypergraphs is a way (and, no doubt, not the only one) of representing what’s happening in the universe.
Typically the models one makes in science only aim to be approximate: they capture certain aspects one cares about in a system, and idealize away all others. But our Physics Project is different, because its goal is to make a model that—at least in principle—can reproduce in perfect detail what happens in the universe, without approximation or idealization. But what we have is still just a model: in effect, a way of making a bridge from what actually happens in the universe to what we can describe in essentially human terms.
There’s a little more subtlety when it comes to the whole ruliad. Because while the ruliad is precise and complete, the sampling of it that determines what we experience depends on our characteristics as observers, about which we’ll never be able to be completely precise. And what’s more, as we’ve discussed, while the ruliad is defined in an abstract way, it is what is in effect actualized for observers like us—to provide our physics and what we consider to be our objective reality.
But could all of that somehow still be a “simulation” running on some lower-level infrastructure? Not in any meaningful sense. In talking about “simulation” we’re implicitly imagining that, in effect, the ruliad is running in one place, and other things are running elsewhere. But the ruliad encapsulates the totality of all computational processes. So in a sense there’s no room for anything outside the ruliad—and the only thing the ruliad can “run on” is itself.
Still, when it comes to observers like us, we sample only some tiny part of the ruliad—and in some sense there’s a choice about what part that is. Indeed, insofar as we view ourselves as having free will and choosing freely what observations to make (and perhaps what our own structure should be), we are in control of that choice. Our basic nature as observers will nevertheless determine some of what we experience—most notably core laws of physics. But beyond that it’s our choices as observers that effectively determine “which possible program” we’re “running” in the ruliad. So if we think of such programs as “simulations” running on the ruliad, then it’s not as if there’s some outside entity that’s picking the programs; it’s our nature and our choices that are doing it.

We’ve talked a lot about what there ultimately is in the “concrete” physical world. But what about the “abstract” mathematical world? One of the surprising things about the ruliad is that it implies a remarkably close connection between the ultimate foundations of physics and of mathematics.
In our Physics Project we had to start by inventing a “machine-code-level” representation of the physical world in terms of hypergraphs, rewriting rules, etc. In mathematics it turns out that there’s already a well-established “machine-code-level” representation: networks of theorems stated as symbolic expressions, and transformed into each other according to (essentially structural) laws of inference.
At this level of description, any particular field of mathematics can be thought of as starting from certain axioms, then building up a whole multiway graph of all possible theorems they imply. The paths in this graph correspond to proofs—with the phenomenon of undecidability manifesting itself in the presence of arbitrarily long paths—and the theorems that human mathematicians find interesting are dotted around the graph.
So what happens if instead of looking at a single axiom system we look at all possible axiom systems? What we’ll get is a structure corresponding to the entangled limit of all possible proofs—based on rules derived from all possible axiom systems. But we’ve seen an equivalent structure before: it’s just the ruliad!
But now instead of interpreting emes as atoms of space we interpret them as “atoms of mathematics”, or the lowest level elements of mathematical expressions. And instead of interpreting slices of the ruliad as corresponding in the limit to physical space, we interpret them as defining metamathematical space.
So in addition to encapsulating all possible computational processes, the ruliad also encapsulates all possible mathematical ones. But how do human mathematicians—or what we can call mathematical observers—“perceive” this? How do they extract what they consider meaningful mathematics?
Physical observers get their “view of the world” in essence by building up a thread of experience through time. Mathematical observers get their “view of the world” by starting with some set of theorems (or axioms) they choose to assume, then “moving outwards” to build up a larger collection of mathematical results.
The raw ruliad is full of computational irreducibility. But in both physics and mathematics the goal is in effect to find pockets of reducibility that let observers like us get summaries that we can fit in our finite minds. In physics this manifests in identifying concepts like space, and then identifying laws that must hold about them.
So what is the analog for mathematics? In principle one could operate at the level of axioms (or even below). But in doing Euclidean geometry, for example, it’s perfectly reasonable to talk in terms of the Pythagorean theorem, without always going down to the lowest level of definitions, say for real numbers. It’s very much like in physics, where for many purposes one can talk about something like the flow of a fluid, without having to worry about what’s going on at the level of molecular dynamics.
And indeed, just like in physics, the fact that mathematics is done by observers like us has immediate implications for what mathematics is like, or in effect, for the “laws of mathematics”. What are these laws? The most important is that higher-level mathematics is possible: in other words, that mathematicians can in fact successfully do mathematics at the “fluid dynamics” level, without always having to drop down to the raw “molecular dynamics” level of axioms and below.
There are other laws of mathematics one can expect. For example, the homogeneity of metamathematical space implied by the structure of the ruliad has the consequence that “pure metamathematical motion” should be possible, so that there must be “dualities” that allow one to translate from one field of mathematics to another. As another example, there should be analogs of general relativity in metamathematical space, with the analog of black holes in which “time stops” being decidable mathematical theories in which proofs “always stop” (in the sense that they are of bounded length).
But—just like for physics—we’re in a sense getting from the ruliad the mathematics we get because we are observers of the kind we are. We might have imagined that we could just invent whatever mathematics we want just by setting up an appropriate axiom system. But the point is that only some axiom systems—or in effect some slices of the ruliad—will allow observers with our characteristics to coherently do mathematics.
Our everyday experience of the physical world gives us the impression that we have a kind of “direct access” to many foundational features of physics, like the existence of space and the phenomenon of motion. But our Physics Project implies that these are not concepts that are in any sense “intrinsically there”; they are just things that emerge from the raw ruliad when you “parse” it in the kinds of ways physical observers like us do.
In mathematics it’s less obvious (at least to anyone except perhaps experienced pure mathematicians) that there’s “direct access” to anything. But in our view of mathematics here, it’s ultimately just like physics—and ultimately also rooted in the ruliad, but sampled not by physical observers but by mathematical ones.
So from this point of view there’s just as much that’s “real” underneath mathematics as there is underneath physics. The mathematics is sampled slightly differently—but we should not in any sense consider it “fundamentally more abstract”.
When we think of ourselves as entities within the ruliad, we can build up what we might consider a “fully abstract” description of how we get our “experience” of physics. And we can basically do the same for mathematics. So if we take the commonsense point of view that the physical world exists “for real”, we’re forced into the same point of view for mathematics. In other words, if we say that the physical world exists, so must we also say that in some fundamental sense, mathematics also exists.
Underneath mathematics, just like underneath physics, is the ruliad. And so, in a sense, what is ultimately there in mathematics is the same as what is ultimately there in physics. Mathematics is not something we humans “just make”; it’s something that comes from the ruliad, through our particular way of observing it, defined by our particular characteristics as observers.

One of the things we’ve learned over the past few centuries is just how small we are compared to the universe. But now we realize that compared to the whole ruliad we’re still even vastly much smaller. We’re certainly not as small as we might be, though. And indeed at some level we’re actually quite large, being composed not just of a few atoms of space or, for that matter, emes, but an immense number.
We are in effect intermediate in scale: huge compared to emes, but tiny compared to the whole ruliad. And the fact that we as observers are at this scale is crucial to how we experience the universe and the ruliad.
We’re large enough that we can in some sense persistently exist and form solid, persistent experiences, not subject to constantly changing microscopic details. Yet we’re small enough that we can exist as coherent, independent entities in the ruliad. We’re large enough that we can have a certain amount of “inner life”; yet we’re small enough there’s also plenty of “external stimuli” impinging on us from elsewhere in the ruliad.
We’re also large enough that we can typically think in terms of continuous space, not atoms of space. And we can think in terms of continuous quantum amplitudes, not discrete multiway threads. But we’re small enough that we can have a consistent view of “where we are” in physical space and in branchial space. And all of us human observers are tightly enough packed in physical and branchial space that we basically agree about “what’s happening” around us—and this forms the basis for what we consider to be “objective reality”.
And it’s the same basic story when we think about the whole ruliad. But now “where we are” determines in effect what rules we attribute to the universe. And our scale is what makes those rules both fairly consistent and fairly definite. Some features of the universe—like the basic phenomena of general relativity and quantum mechanics—depend only on our general characteristics as observers. But others—likely like the masses of particles—depend on our “place in the ruliad”. (And, for example, we already know from traditional physics that something like the perceived mass of an electron depends on the momentum we use to probe it.)
When we ask what there ultimately is, one of the most striking things is how much there ultimately is. We don’t yet know the scale of the discreteness of space, but conceivably it’s around 10–90 meters—implying that at any given moment there might 10400 atoms of space in the universe, and about 10500 in the history of the universe so far. What about the whole ruliad? The total number of emes is exponentially larger, conceivably of order (10500)10500. It’s a huge number—but the fact that we can even guess at it gives us a sense that we can begin to think concretely about what there ultimately is.
So how does this all relate to human scales? We know that the universe is about 1080 times larger in volume than a human in physical space. And within one human at any given time there might be about 10300 atoms of space; our existence through our lives might be defined by perhaps 10400 emes. We can also guess at our extent in branchial space—and in the whole ruliad. There are huge numbers involved—that give us a sense of why we observe so much that seems so definite in the universe. In effect, it’s that we’re “big enough” that our “averaged” perceptions are very precise, yet we’re “small enough” that we’re essentially at a precise location within the ruliad.
(A remarkable feature of thinking in terms of emes and the ruliad is that we can do things like compare the “sizes” of human-scale physics and mathematics. And a rough estimate might be that all the mathematics done in human history has involved perhaps 10100 emes—vastly less than the number of emes involved in our physical existence.)
We can think of our “big but small” scale as being what allows us to be observers who can be treated as persistent in time. And it’s very much the same story for “persistence in space”. For us to be capable of “pure motion”, where we move from one place to another, and are still “persistently ourselves” we have to be large compared to the scale of emes, and tiny compared to the scale of the ruliad.
When it comes to moving in the physical universe, we know that to “actually move ourselves” (say with a spacecraft) takes time. But to imagine what it’s like to have moved is something that can be done abstractly, and quickly. And it’s the same at the level of the ruliad. To “move ourselves” in the ruliad in effect requires an explicit computational translation from one set of rules to another, which takes (typically irreducible) computational effort, and therefore time. But we can still just abstractly jump anywhere we want in the ruliad. And that is in effect what we do in ruliology—studying rules that we can, for example, just pick at random, or find by enumeration. We can discover all sorts of interesting things that way. But in a sense they’re—at least at first—alien things, not immediately connected to anything familiar to us from our normal location in the ruliad.
We see something similar in mathematics. We can start enumerating a huge network of possible theorems. But unless we can find a way to transport ourselves as mathematical observers more or less wholesale in metamathematical space, we won’t be able to contextualize most of those theorems; they’ll seem alien to us.
It’s not easy to get intuition for the “alien” things out there in the ruliad. One approach is to use generative AI, say to make pictures, and to ask what happens if the parameters of the AI are changed, in effect moving to different rules, and a different part of the ruliad. Sometimes one gets to recognizable pictures that are described by some concept, say associated with a word in human language. But in the vast majority of cases one finds oneself in “interconcept space”—in a place for which no existing human concept has yet been invented. And indeed in experiments with practical neural nets the fraction of this tiny corner of the ruliad spanned by our familiar concepts can easily be just 10–600 of the total. In other words, what we have ways to describe, say in human language, represents an absolutely tiny fraction of what there ultimately is.
But what happens if we invent more concepts? In some sense we then grow in rulial space—so that we as observers span a larger part of the ruliad. And perhaps we might see it as some kind of ultimate goal for science and for knowledge to expand ourselves throughout the ruliad.
But there’s a catch. The fact that we can view ourselves as definite, individual observers depends on us being small compared to the ruliad. If we could expand to fill the ruliad we would in some sense be everything—but we would also be nothing, and would no longer exist as coherent entities.

Metaphysics has historically been viewed as a branch of philosophy. But what I’ve argued here is that with our new results and new insights from science we can start to discuss metaphysics not just as philosophy but also as science—and we can begin to tell an actual scientific story of what there ultimately is, and how we fit into it.
Metaphysics has in the past basically always had to be built purely on arguments made with words—and indeed perhaps this is why it’s often been considered somewhat slippery and fragile. But now, with the kind of things we’ve discussed here, we’re beginning to have what we need to set up a solid, formal structure for metaphysics, in which we can progressively build a rich tower of definite conclusions.
Some of what there is to say relates to the ruliad, and is, in a sense, purely abstract and inevitable. But other parts relate to the “subjective” experience of observers—and, for us, basically human observers. So does metaphysics somehow need to involve itself with all the details of biology or, for that matter, psychology? The big surprise is that it doesn’t. Because the science says that knowing only very coarse things about observers (like that they are computationally bounded) already makes it possible to come to precise conclusions about certain features and laws they must perceive. There is in effect an emergent metaphysics.
The ruliad provides a form of answer to what there abstractly ultimately is. But to connect it to what for us there “really” is we need to know the essence of what we are like.
Investigating features of the ruliad is a lot about doing pure ruliology—and empirically studying what abstract simple programs do. Talking about observers is much more an exercise in metamodeling—and taking largely known models of the world, and trying to extract from them the abstract essence of what’s going on. To make a science of metaphysics somehow requires both of these.
But the exciting thing is that building on the computational paradigm and intuition from studying the computational universe we’re getting to the point where we can begin to give definite scientific answers to questions of metaphysics that for millennia have seemed like things about which one could just make arguments, but never come to conclusions. And like so many branches of philosophy before it, metaphysics now seems destined to make the transition from being a matter purely of philosophy to being one of science—finally giving us answers to the age old question of what there ultimately is.
For other discussions of ideas explored here, see my recent Philosophical Writings »
2026-01-31 03:26:48
![]()

“Could there be a faster program for that?” It’s a fundamental type of question in theoretical computer science. But except in special cases, such a question has proved fiendishly difficult to answer. And, for example, in half a century, almost no progress has been made even on the rather coarse (though very famous) P vs. NP question—essentially of whether for any nondeterministic program there will always be a deterministic one that is as fast. From a purely theoretical point of view, it’s never been very clear how to even start addressing such a question. But what if one were to look at the question empirically, say in effect just by enumerating possible programs and explicitly seeing how fast they are, etc.?
One might imagine that any programs one could realistically enumerate would be too small to be interesting. But what I discovered in the early 1980s is that this is absolutely not the case—and that in fact it’s very common for programs even small enough to be easily enumerated to show extremely rich and complex behavior. With this intuition I already in the 1990s began some empirical exploration of things like the fastest ways to compute functions with Turing machines. But now—particularly with the concept of the ruliad—we have a framework for thinking more systematically about the space of possible programs, and so I’ve decided to look again at what can be discovered by ruliological investigations of the computational universe about questions of computational complexity theory that have arisen in theoretical computer science—including the P vs. NP question.
We won’t resolve the P vs. NP question. But we will get a host of definite, more restricted results. And by looking “underneath the general theory” at explicit, concrete cases we’ll get a sense of some of the fundamental issues and subtleties of the P vs. NP question, and why, for example, proofs about it are likely to be so difficult.
Along the way, we’ll also see lots of evidence of the phenomenon of computational irreducibility—and the general pattern of the difficulty of computation. We’ll see that there are computations that can be “reduced”, and done more quickly. But there are also others where we’ll be able to see with absolute explicitness that—at least within the class of programs we’re studying—there’s simply no faster way to get the computations done. In effect this is going to give us lots of proofs of restricted forms of computational irreducibility. And seeing these will give us ways to further build our intuition about the ever-more-central phenomenon of computational irreducibility—as well as to see how in general we can use the methodology of ruliology to explore questions of theoretical computer science.
Click any diagram to get Wolfram Language code to reproduce it.
How can we enumerate possible programs? We could pick any model of computation. But to help connect with traditional theoretical computer science, I’ll use a classic one: Turing machines.
Often in theoretical computer science one concentrates on yes/no decision problems. But here it’ll typically be convenient instead to think (more “mathematically”) about Turing machines that compute integer functions. The setup we’ll use is as follows. Start the Turing machine with the digits of some integer n on its tape. Then run the Turing machine, stopping if the Turing machine head goes further to the right than where it started. The value of the function with input n is then read off from the binary digits that remain on its tape when the Turing machine stops. (There are many other “halting” criteria we could use, but this is a particularly robust and convenient one.)
So for example, given a Turing machine with rule
we can feed it successive integers as input, then run the machine to find the successive values it computes:
In this case, the function that the Turing machine computes is
or in graphical form:
For each input, the Turing machine takes a certain number of steps to stop and give its output (i.e. the value of the function):
But this particular Turing machine isn’t the only one that can compute this function. Here are two more:
The outputs are the same as before, but the runtimes are different:
Indicating these respectively by ![]()
![]()
and plotting them together, we see that there are definite trends—but no clear winner for “fastest program”:
In computational complexity theory, it’s common to discuss how runtime varies with input size—which here means taking each block of inputs with a given number of digits, and just finding its maximum:
And what we see is that in this case the first Turing machine shown is “systematically faster” than the other two—and in fact provides the fastest way to compute this particular function among Turing machines of the size we’re using.
Since we’ll be dealing with lots of Turing machines here, it’s convenient to be able to specify them just with numbers—and we’ll do it the way TuringMachine in the Wolfram Language does. And with this setup, the machines we’ve just considered have numbers 261, 3333 and 1285.
In thinking about functions computed by Turing machines, there is one immediate subtlety to consider. We’ve said that we find the output by reading off what’s on the Turing machine tape when the Turing machine stops. But what if the machine never stops? (Or in our case, what if the head of the Turing machine never reaches the right-hand end?) Well, then there’s no output value defined. And in general, the functions our Turing machines compute will only be partial functions—in the sense that for some of their inputs, there may be no output value defined (as here for machine 2189):
When we plot such partial functions, we’ll just have a gap where there are undefined values:
In what follows, we’ll be exploring Turing machines of different “sizes”. We’ll assume that there are two possible colors for each position on the tape—and that there are s possible states for the head. The total number of possible Turing machines with k = 2 colors and s states is (2ks)ks—which grows rapidly with s:
For any given function we’ll then be able to ask what machine (or machines) up to a given size compute it the fastest. In other words, by explicitly studying possible Turing machines, we’ll be able to establish an absolute lower bound on the computational difficulty of computing a function, at least when that computation is done by a Turing machine of at most a given size. (And, yes, the size of the Turing machine can be thought of as characterizing its “algorithmic information content”.)
In traditional computational complexity theory, it’s usually been very difficult to establish lower bounds. But our ruliological approach here will allow us to systematically do it (at least relative to machines of a given size, i.e. with given algorithmic information content). (It’s worth pointing out that if a machine is big enough, it can include a lookup table for any number of cases of any given function—making questions about the difficulty of computing at least those cases rather moot.)
To begin our systematic investigation of possible programs, let’s consider what is essentially the simplest possible case: Turing machines with one state and two possible colors of cells on their tape
Here’s what each of these machines does for successive integer inputs:
Looking at the outputs in each case, we can plot the functions these compute:
And here are the corresponding runtimes:
Out of all 16 machines, 8 compute total functions (i.e. the machines always terminate, so the values of the functions are defined for every input), and 8 don’t. Four machines produce “complicated-looking” functions; an example is machine 14, which computes the function:
There are a variety of representations for this function, including
and:
The way the function is computed by the Turing machine is
and the runtime is given by
which is simply:
For input of size n, this implies the worst-case time complexity for computing this function is
Each one of the 1-state machines works at least slightly differently. But in the end, all of them are simple enough in their behavior that one can readily give a “closed-form formula” for the value of f[i] for any given i:
One thing that’s notable is that—except in the trivial case where all values are undefined—there are no examples among
There are a total of 4096 possible 2-state, 2-color Turing machines. Running all these machines, we find that they compute a total of 350 distinct functions—of which 189 are total. Here are plots of these distinct total functions—together with a count of how many machines generate them (altogether 2017 of the 4096 machines always terminate, and therefore compute total functions):
Plotting the values of all these functions in 3D, we see that the vast majority have values f[i] that are close to their inputs i—indicating that in a sense the Turing machines usually “don’t do much” to their input:
To see more clearly what the machines “actually do”, we can look at the quantity
though in 6 cases it is 2, and in 3 cases (which include the “most popular” case
Dropping periodic cases, the remaining distinct
Some of what we see here is similar to the 1-state case. An example of different behavior occurs for machine 2223
which gives for
In this case f[i] turns out to be expressible simply as
or:
Another example is machine 2079
which gives for
This function once again turns out to be expressible in “closed form”:
Some functions grow rapidly. For example, machine 3239
has values:
These have the property that
There are many subtleties even in dealing with 2-state Turing machines. For example, different machines may “look like” they’re generating the same function f[i] up to a certain value of i, and only then deviate. The most extreme example of such a “surprise” among machines generating total functions occurs among:
Up to
What about partial functions? At least for 2-state machines, if undefined values in f[i] are ever going to occur, they always already occur for small i. The “longest holdouts” are machines 1960 and 2972, which are both first undefined for input 8
but which “become undefined” in different ways: in machine 1960, the head systematically moves to the left, while in machine 2972, it moves periodically back and forth forever, without ever reaching the right-hand end. (Despite their different mechanisms, both rules share the feature of being undefined for all inputs that are multiples of 8.)
What about runtimes? If a function f[i] is computed by several different Turing machines, the details of how it’s computed by each machine will normally be at least slightly different. Still, in many cases the mechanisms are similar enough that their runtimes are the same. And in the end, among all the 2017 machines that compute our 189 distinct total functions, there are only 103 distinct “profiles” of runtime vs. input (and indeed many of these are very similar):
The picture gets simpler if, rather than plotting runtimes for each specific input value, we instead plot the worst-case runtime for all inputs of a given size. (In effect we’re plotting against IntegerLength[i, 2] or Ceiling[Log2[i + 1]].) There turn out to be just 71 distinct profiles for such worst-case time complexity
and indeed all of these have fairly simple closed forms—which for even n are (with directly analogous forms for odd n):
If we consider the behavior of these worst-case runtimes for large input lengths n, we find that fairly few distinct growth rates occur—notably with linear, quadratic and exponential cases, but nothing in between:
The machines with the fastest growth
For a size-n input, the maximum value of the function is just the maximum integer with n digits, or
And at these maxima, the machine is effectively operating like a binary counter, generating all the states it can, with the head moving in a very regular nested pattern:
It turns out that for
Of the 8 machines with runtimes growing like
The two machines with asymptotic runtime growth
Here’s the actual behavior of these machines when given inputs 1 through 10:
(The lack of runtimes intermediate between quadratic and exponential is notable—and perhaps reminiscent of the rarity of “intermediate growth” seen for example in the cases of finitely generated groups and multiway systems.)
Our emphasis so far has been on worst-case runtimes: the largest runtimes required for inputs of any given size. But we can also ask about the distribution of runtimes within inputs of a given size.
So, for example, here are the runtimes for all size-11 inputs for a particular, fairly typical Turing machine (
The maximum (“worst-case”) value here is 43—but the median is only 13. In other words, while some computations take a while, most run much faster—so that the runtime distribution is peaked at small values:
(The way our Turing machines are set up, they always run for an even number of steps before terminating—since to terminate, the head must move one position to the right for every position it moved to the left.)
If we increase the size of the inputs, we see that the distribution, at least in this case, is close to exponential:
It turns out that this kind of exponential distribution is typical of what we see in almost all Turing machines. (It’s notable that this is rather different from the t –1/2 “stopping time” distribution we’d expect if the Turing machine head was “on average” executing a random walk with an absorbing boundary.) There are nevertheless machines whose distributions deviate significantly from exponential, examples being:
Some simply have long tails to their exponentials. Others, however, have an overall non-exponential form.
We’ve now seen lots of functions—and runtime profiles—that
We’ve seen that there are machines that compute functions quite slowly—like in exponential time. But are these machines the fastest that compute those particular functions? It turns out the answer is no.
And if we look across all 189 total functions computed by
In other words, there are 8 functions that are the “most difficult to compute” for
What are these functions? Here’s one of them (computed by machine 1511):
If we plot this function, it seems to have a nested form
which becomes somewhat more obvious on a log-log plot:
As it turns out, there’s what amounts to a “closed form” for this function
though unlike the closed forms we saw above, this one involves Nest, and effectively computes its results recursively:
How about machine 1511? Well, here’s how it computes this function—in effect visibly using recursion:
The runtimes are
giving worst-case runtimes for inputs of size n of the form:
It turns out all 8 functions with minimum runtimes growing like
For the functions with fastest computation times
So what can we conclude? Well, we now know some functions that cannot be computed by
We now know the fastest that certain functions can be computed by
And in fact it’s common for there to be only one machine that computes a given function. Out of the 189 total functions that can be computed by
OK, but if multiple machines compute the same function, we can then ask how their speeds compare. Well, it turns out that for 145 of our 189 total functions all the different machines that compute the same function do so with the same “runtime profile” (i.e. with the same runtime for each input i). But that leaves 44 functions for which there are multiple runtime profiles:
Here are all these 44 functions, together with the distinct runtime profiles for machines that compute them:
Much of the time we see that the possible runtime profiles for computing a given function differ only very little. But sometimes the difference is more significant. For example, for the identity function
Within these 10 profiles, there are 3 distinct rates of growth for the worst-case runtime by input size: constant, linear, and exponential
exemplified by machines 3197, 3589 and 3626 respectively:
Of course, there’s a trivial way to compute this particular function—just by having a Turing machine that doesn’t change its input. And, needless to say, such a machine has runtime 1 for all inputs:
It turns out that for
But although there are not different “orders of growth” for worst-case runtimes among any other (total) functions computed by
by slightly different methods
with different worst-case runtime profiles
or:
By the way, if we consider partial instead of total functions, nothing particularly different happens, at least with
that are again essentially computing the identity function.
Another question is how
But how fast are the computations? This compares the possible worst-case runtimes for
There must always be
, as in:
But can
We’ve seen that different Turing machines can take different times to compute particular functions. But how fast can any conceivable Turing machine—even in principle—compute a given function?
There’s an obvious absolute lower bound to the runtime: with the way we’ve set things up, if a Turing machine is going to take input i and generate output j, its head has to at least be able to go far enough to the left to reach all the bits that need to change in going from i to j—as well as making it back to the right-hand end so that the machine halts. The number of steps required for this is
which for values of i and j up to 8 bits is:
So how do the runtimes of actual Turing machine computations compare with these absolute lower bounds?
Here’s the behavior of s = 1, k = 2 machines 1 and 3, where for each input we’re giving the actual runtime along with the absolute lower bound:
In the second case, the machine is always as efficient as it absolutely can be; in the first case, it only sometimes is—though the maximum slowdown is only 2 steps.
For s = 2, k = 2 machines, the differences can be much larger. For example, machine 378 can take exponential time—even though the absolute lower bound in this case is just 1 step, since this machine computes the identity function:
Here’s another example (machine 1447) in which the actual runtime is always roughly twice the absolute lower bound:
But how does the smallest (worst-case) runtime for any s = 2 Turing machine to compute a given function compare to the absolute lower bound? Well, in a result that presages what we’ll see later in discussing the P vs. NP question, the difference can be increasingly large:
The functions being computed here are
and the fastest s = 2 Turing machines that do this are (machines 2205, 3555 and 2977):
Our absolute lower bound determines how fast a Turing machine can possibly generate a given output. But one can also think of it as something that measures how much a Turing machine has “achieved” when it generates a given output. If the output is exactly the same as the input, the Turing machine has effectively “achieved nothing”. The more they differ, the more one can think of the machine having “achieved”.
So now a question one can ask is: are there functions where little is achieved in the transformation from input to output, but where the minimum runtime to perform this transformation is still long? One might wonder about the identity function—where in effect “nothing is achieved”. And indeed we’ve seen that there are Turing machines that compute this function, but only slowly. However, there are also machines that compute it quickly—so in a sense its computation doesn’t need to be slow.
The function above computed by machine 2205 is a somewhat better example. The (worst-case) “distance” between input and output grows like 2n with the input size n, but the fastest the function can be computed is what machine 2205 does, with a runtime that grows like 10n. Yes, these are still both linear in n. But at least to some extent this is an example of a function that “doesn’t need to be slow to compute”, but is at least somewhat slow to compute—at least for any
How difficult is it to compute the value of a function, say with a Turing machine? One measure of that is the time it takes, or, more specifically, how many Turing machine steps it takes. But another measure is how much “space” it takes, or, more specifically, with our setup, how far to the left the Turing machine head goes—which determines how much “Turing machine memory” or “tape” has to be present.
Here’s a typical example of the comparison between “space” and “time” used in a particular Turing machine:
If we look at all possible space usage profiles as a function of input size we see that—at least for
(One could also consider different measures of “complexity”—perhaps appropriate for different kinds of idealized hardware. Examples include seeing the total length of path traversed by the head, the total area of the region delimited by the head, the number of times 1 is written to the tape during the computation, etc.)
We’ve talked quite a lot about how runtime varies with input (or input size) for a particular machine. But what about the complementary question: given a particular input, how does runtime vary across different machines? Consider, for example, the
The runtimes for these machines are:
Here’s what we see if we continue to larger inputs:
The maximum (finite) runtime across all
or in closed form:
For s = 2, k = 2 machines, the distribution of runtimes with input 1 is
where the maximum value of 17 is achieved for machine 1447. For larger inputs the maximum runtimes are:
Plotting these maximum runtimes
we see a big peak at input 127, corresponding to runtime 509 (achieved by machines 378 and 1351). And, yes, plotting the distribution for input 127 of runtimes for all machines, we see that this is a significant outlier:
If one computes runtimes maximized over all machines and all inputs for successively larger sizes of inputs, one gets (once again dominated by machines 378 and 1351):
By the way, one can compute not only runtimes but also values and widths maximized across machines:
And, no, the maximum value isn’t always of the form
We’ve so far looked at
The issue—as so often—is of computational irreducibility. Let’s say you have a machine and you’re trying to figure out if it computes a particular function. Or you’re even just trying to figure out if for input i it gives output j. Well, you might say, why not just run the machine? And of course you can do that. But the problem is: how long should you run it for? Let’s say the machine has been running for a million steps, and still hasn’t generated any output. Will the machine eventually stop, producing either output j or some other output? Or will the machine just keep running forever, and never generate any output at all?
If the behavior of the machine was computationally reducible, then you could expect to be able to “jump ahead” and figure out what it would do, without following all the steps. But if it’s computationally irreducible, then you can’t expect to do that. It’s a classic halting problem situation. And you have to conclude that the general problem of determining whether the machine will generate, say, output j is undecidable.
Of course, in lots of particular cases (say, for lots of particular inputs) it may be easy enough to tell what’s going to happen, either just by running for some number of steps, or by using some kind of proof or other abstract derivation. But the point is that—because of computational irreducibility—there’s no upper bound on the amount of computational effort that could be needed. And so the problem of “always getting an answer” has to be considered formally undecidable.
But what happens in practice? Let’s say we look at the behavior of all
And we then conclude that a bit more than half the machines halt—with the largest finite runtime being the fairly modest 53, achieved by machine 630283 (essentially equivalent to 718804):
But is this actually correct? Or do some of the machines we think don’t halt based on running for a million steps actually eventually halt—but only after more steps?
Here are a few examples of what happens:
And, yes, in all these cases we can readily see that the machines will never halt—and instead, potentially after some transient, their heads just move essentially periodically forever. Here’s the distribution of periods one finds
with the longest-period cases being:
And here’s the distribution of transients
with the longest-transient cases being:
But this doesn’t quite account for all the machines that don’t halt after a million steps: there are still 1938 left over. There are 91 distinct patterns of growth—and here are samples of what happens:
All of these eventually have a fundamentally nested structure. The patterns grow at different rates—but always in a regular succession of steps. Sometimes the spacings between these steps are polynomials, sometimes exponentials—implying either fractional power or logarithmic growth of the corresponding pattern. But the important point for our purposes here is that we can be confident that—at least with input 1—we know which
But what happens if we increase the input value we provide? Here are the first 20 maximum finite lifetimes we get:
In the “peak case” of input 10, the distribution of runtimes is
with, yes, the maximum value being a somewhat strange outlier.
What is that outlier? It’s machine 600720 (along with the related machine 670559)—and we’ll be discussing it in more depth in the next section. But suffice it to say now that 600720 shows up repeatedly as the
What about for larger inputs? Well, things get wilder then. Like, for example, consider the case of machine 1955095. For all inputs up to 41, the machine halts after a modest number of steps:
But then, at input 42, there’s suddenly a surprise—and the machine never halts:
And, yes, we can immediately tell it never halts, because we can readily see that the same pattern of growth repeats periodically—every 24 steps. (A more extreme example is
And, yes, things like this are the “long arm” of undecidability reaching in. But by successively investigating both larger inputs and longer runtimes, one can develop reasonable confidence that—at least most of the time—one is correctly identifying both cases that lead to halting, and ones that do not. And from this one can estimate that of all the 2,985,984 possible
Summarizing our results we find that—somewhat surprisingly—the halting fraction is quite similar for different numbers of states, and always close to 1/2:
And based on our census of halting machines, we can then conclude that the number of distinct total functions computed by
In looking at the runtimes of
I actually first noticed this machine in the 1990s as part of my work on A New Kind of Science—and with considerable effort was able to give a rather elaborate analysis of at least some of its behavior:
The first remarkable thing about the machine is the dramatic peaks it exhibits in the output values it generates:
These peaks are accompanied by corresponding (somewhat less dramatic) peaks in runtime:
The first of the peaks shown here occurs at input i = 34—with runtime 315,391, and output
but the basic point is that the machine seems to behave in a very “deliberate” way that one might imagine could be analyzed.
It turns out, though, that the analysis is surprisingly complicated. Here’s a table of maximum (worst-case) runtimes (and corresponding inputs and outputs):
For odd n > 3, the maximum runtime occurs when the input value i is:
The corresponding initial states for the Turing machine are of the form:
The output value with such an input (for odd n > 3) is then
while the runtime—derived effectively by “mathematicizing” what the Turing machine does for these inputs—is given by the bizarrely complex formula:
What is the asymptotic behavior? It’s roughly 6αn where α varies with n according to:
So this is how long it can take the Turing machine to compute its output. But can we find that output faster, say just by finding a “mathematical formula” for it? For inputs i with some particular forms (like the one above) it is indeed possible to find such formulas:
But in the vast majority of cases there doesn’t seem to be any simple mathematical-style formula. And indeed one can expect that this Turing machine is a typical computationally irreducible system: you can always find its output (here the value f[i]) by explicitly running the machine, but there’s no general way to shortcut this, and to systematically get to the answer by some reduced, shorter computation.
We discussed above that out of the 2.99 million possible
There are machines that give asymptotically constant runtime
with all odd asymptotic runtime values up to 21 (along with 25) being possible:
Then there are machines that give asymptotically linear runtimes, with even coefficients from 2 to 20 (along with 24)—for example:
By the way, note that (as we mentioned before) some machines realize their worst-case runtimes for many specific inputs, while in other machines such runtimes are rare (here illustrated for machines with asymptotic runtimes 24n):
Sometimes there are machines whose worst-case runtimes increase linearly, but in effect with fractional slopes:
There are many machines whose worst-case runtimes increase in an ultimately linear way—but with “oscillations”:
Averaging out the oscillations gives an overall growth rate of the form αn, where α is an integer or rational number with (as it turns out) denominator 2 or 3; the possible values for α are:
There are also machines with worst-case runtimes growing like αn2, with α an integer from 1 to 10 (though missing 7):
And then there are a few machines (such as 129559 and 1166261) with cubic growth rates.
The next—and, in fact, single largest—group of machines have worst-case runtimes that asymptotically grow exponentially, following linear recurrences. The possible asymptotic growth rates seem to be (ϕ is the golden ratio
):
Some particular examples of machines with these growth rates include (we’ll see 5n/2 and 4n examples in the next section):
The first of these is machine 1020827, and the exact worst-case runtime for input size n in this case is:
The second case shown (machine 117245) has exact worst-case runtime
which satisfies the linear recurrence:
The third case (machine 1007039) has exact worst-case runtime:
It’s notable that in all of these cases, the maximum runtime for input size n occurs for input
Continuing and squashing the results, it becomes clear that there’s a nested structure to these patterns:
By the way, it’s certainly not necessary that the worst-case runtime must occur at the largest input of a given size. Here’s an example (machine 888388) where that’s not what happens
and where in the end the 2n/2 growth is achieved by having the same worst-case runtime for input sizes n and n + 1 for all even n:
One feature of everything we’ve seen here is the runtimes we’ve deduced are either asymptotically powers or asymptotically exponentials. There’s nothing in between—for example nothing like nLog[n] or 4Sqrt[n]:
No doubt there are Turing machines with such intermediate growth, but apparently none with
As we discussed above, out of the 2.99 million possible
The functions computed by the most machines are (where, not surprisingly, the identity function
The minimum number of machines that can compute a given function is always 2—because there’s always one machine with a
transition, and another with a
transition, as in:
But altogether there are about 13,000 of these “isolate” machines, where no other
So what are these functions—and how long do they take to compute? And remember, these are functions that are computed by isolate machines—so whatever the runtime of those machines is, this can be thought of as defining a lower bound on the runtime to compute that function, at least by any
So what are the functions with the longest runtimes computed by isolate machines? The overall winner seems to be the function computed by machine 600720 that we discussed above.
Next appears to come machine 589111
with its asymptotically 4n runtime:
And although the values here, say for
Next appear to come machines like 599063
with asymptotic
Despite the seemingly somewhat regular pattern of values for this function, the machine that computes it is an isolate, so we know that at least among
What about the other machines with asymptotically exponential runtimes that we saw in the previous section? Well, the particular machines we used as examples there aren’t even close to isolates. But there are other machines that have the same exponentially growing runtimes, and that are isolates. And, just for once, there’s a surprise.
For asymptotic runtime 2n, it turns out that there is just a single isolate machine: 1342057:
But look at how simple the function this machine computes is. In fact,
But despite the simplicity of this, it still takes the Turing machine worst-case runtime
– 1
And, yes, after a transient at the beginning, all the machine is ultimately doing is to compute
Going on to asymptotic runtimes of the form 3n/2, it turns out there’s only one function for which there’s a machine (1007039) with this asymptotic runtime—and this function can be computed by over a hundred machines, many with faster runtimes, though some with slower (2n) runtimes (e.g. 879123).
What about asymptotic runtimes of order
? It’s more or less the same story as with 3n/2. There are 48 functions which can be computed by machines with this worst-case runtime. But in all cases there are also many other machines, with many other runtimes, that compute the same functions.
But now there’s another surprise. For asymptotic runtime 2n/2 there are two functions computed only by isolate machines (889249 and 1073017):
So, once again, these functions have the feature that they can’t be computed any faster by any other
When we looked at
Among
There are, of course, many more
Isolate machines immediately define lower bounds on runtime for the functions they compute. But in general (as we saw above) there can be many machines that compute a given function. For example, as mentioned above, there are 210,792
with asymptotic runtimes ranging from constant to linear, quadratic and exponential. (The most rapidly increasing runtime is ~2n.)
For each function that can be computed, there’s a slightly different collection of runtime profiles; here are the ones for the functions computed by the next largest numbers of machines:
We saw above that there are functions which cannot be computed asymptotically faster than particular bounds by, say, any
The first thing to say is that (as we discussed before for
Among
where the exact worst-case runtime is:
But now we can ask whether this function can be computed faster by any
An example is machine 1069163:
We can think of what’s happening as being that we start from the
and in effect optimize this by using a slightly more complicated “instruction set”:
In looking at
As an example, consider the function computed by the isolate machine 1342057:
This has asymptotic runtime 4n. But now if we look at
There are also machines with linearly and quadratically increasing runtimes—though, confusingly, for the first few input sizes, they seem to be increasing just as fast as our original
Here are the underlying rules for these particular Turing machines:
And here’s the full spectrum of runtime profiles achieved by
There are runtimes that are easy to recognize as exponentials—though with bases like 2,
, 3/2,
that are smaller than 4. Then there are linear and polynomial runtimes of the kind we just saw. And there’s some slightly exotic “oscillatory” behavior, like with machine 1418699063
that seems to settle down to a periodic sequence of ratios, growing asymptotically like 2n/4.
What about other functions that are difficult to compute by
One of these follows exactly the runtimes of 600720; the other is not the same, but is very close, with about half the runtimes being the same, and the other half having maximal differences that grow linearly with n.
And what this means is that—unlike the function computed by
Looking at other functions that are “hard to compute” with
We’ve been talking so far about very small Turing machines—with at most a handful of distinct cases in their rules. But what if we consider much larger Turing machines? Would these allow us to systematically do computations much faster?
Given a particular (finite) mapping from input to output values, say
it’s quite straightforward to construct a Turing machine
whose state transitions in effect just “immediately look up” these values:
(If we try to compute a value that hasn’t been defined, the Turing machine will simply not halt.)
If we stay with a fixed value of k, then for a “function lookup table” of size v, the number of states we need for a “direct representation” of the lookup table is directly proportional to v. Meanwhile, the runtime is just equal to the absolute lower bound we discussed above, which is linearly proportional to the sizes of input and output.
Of course, with this setup, as we increase v we increase the size of the Turing machine. And we can’t guarantee to encode a function defined, say, for all integers, with anything less than an infinite Turing machine.
But by the time we’re dealing with an infinite Turing machine we don’t really need to be computing anything; we can just be looking everything up. And indeed computation theory always in effect assumes that we’re limiting the size of our machines. And as soon as we do this, there starts to be all sorts of richness in questions like which functions are computable, and what runtime is required to compute them.
In the past, we might just have assumed that there is some arbitrary bound on the size of Turing machines, or, in effect, a bound on their “algorithmic information content” or “program size”. But the point of what we’re doing here is to explore what happens not with arbitrary bounds, but with bounds that are small enough to allow us to do exhaustive empirical investigations.
In other words, we’re restricting ourselves to low algorithmic (or program) complexity and asking what then happens with time complexity, space complexity, etc. And what we find is that even in that domain, there’s remarkable richness in the behavior we’re able to see. And from the Principle of Computational Equivalence we can expect that this richness is already characteristic of what we’d see even with much larger Turing machines, and thus larger algorithmic complexity.
In everything we’ve done so far, we’ve been looking at “binary” (i.e.
The setup we’ve been using translates immediately:
The simplest case is
Of these machines, 88 always halt—and compute 77 distinct functions. The possible runtimes are:
And unlike what we saw even for
For s = 2, k = 3, we have
In both cases, of the machines that don’t halt, the vast majority become periodic. For
Just as for
or in squashed form:
If we look beyond input 1, we find that about 1.12 million
A notable feature is that the tail consists of functions computed by only one machine. In the
and
means that there are always at least two machines computing a given function. But with
What about runtimes? The results for
In a sense it should not be surprising that there is so much similarity between the behavior of
However, if we look not at the kind of “one-sided” (or “halt if you go to the right”) Turing machines we are considering here, but instead at Turing machines where the head can go freely in either direction, then one difference emerges. Starting with a blank tape, all
And this fact provides a clue that such a machine (or, actually, the 14 essentially equivalent machines of which this is one example) might be capable of universal computation. And indeed it can be shown that—at least with appropriate (infinite) initial conditions, the machine can successfully be “programmed” to emulate systems that are known to be universal, thereby proving that it itself is universal.
How does this machine fare with our one-sided setup? Here’s what it does with the first few inputs:
And what one finds is that for any input, the head of the machine eventually goes to the right, so with our one-sided setup we consider the machine to halt:
It turns out that the worst-case runtime for input of size n grows according to:
But if we look at the function computed by this machine we can ask whether there are ways to compute it faster. And it turns out there are 11 other s = 2, k = 3 machines (though, for example, no
one might think they would be simple enough to have shorter runtimes. But in fact in the one-sided setup their behavior is basically identical to our original machine.
OK, but what about
We’ve been talking a lot about how fast Turing machines can compute functions. But what can we say about what functions they compute? With appropriate encoding of inputs and decoding of outputs, we know that (essentially by definition) any computable function can be computed by some Turing machine. But what about the simple Turing machines we’ve been using here? And what about “without encodings”?
The way we’ve set things up, we’re taking both the input and the output to our Turing machines to be the sequences of values on their tapes—and we’re interpreting these values as digits of integers. So that means we can think of our Turing machines as defining functions from integers to integers. But what functions are they?
Here are two
There are a total of 17
Still, for
If we restrict ourselves to even inputs, then we can compute
Similarly, there are
What about other “mathematically simple” functions, say
We’ve already seen a variety of examples where our Turing machines can be interpreted as evaluating bitwise functions of their inputs. A more minimal case would be something like a single bitflip—and indeed there is an
To be able to flip a higher-order digit, one needs a Turing machine with more states. There are two
And in general—as these pictures suggest—flipping the mth bit can be done with a Turing machine with at least
What about Turing machines that compute periodic functions? Strict (nontrivial) periodicity seems difficult to achieve. But here, for example, is an
With both
Another thing one might ask is whether one Turing machine can emulate another. And indeed that’s what we see happening—very directly—whenever one Turing machine computes the same function as another.
(We also know that there exist universal Turing machines—the simplest having
Computational irreducibility has been central to much of the science I’ve done in the past four decades or so. And indeed it’s guided our intuition in much of what we’ve been exploring here. But the things we’ve discussed now also allow us to take an empirical look at the core phenomenon of computational irreducibility itself.
Computational irreducibility is ultimately about the idea that there can be computations where in effect there is no shortcut: there is no way to systematically find their results except by running each of their steps. In other words, given an irreducible computation, there’s basically no way to come up with another computation that gives the same result, but in fewer steps. Needless to say, if one wants to tighten up this intuitive idea, there are lots of detailed issues to consider. For example, what about just using a computational system that has “bigger primitives”? Like many other foundational concepts in theoretical science, it’s difficult to pin down exactly how one should set things up—so that one doesn’t either implicitly assume what one’s trying to explain, or so restrict things that everything becomes essentially trivial.
But using what we’ve done here, we can explore a definite—if restricted—version of computational irreducibility in a very explicit way. Imagine we’re computing a function using a Turing machine. What would it mean to say that that function—and the underlying behavior of the Turing machine that computes it—is computationally irreducible? Essentially it’s that there’s no other faster way to compute that function.
But if we restrict ourselves to computation by a certain size of Turing machine, that’s exactly what we’ve studied at great length here. And, for example, whenever we have what we’ve called an “isolate” Turing machine, we know that no other Turing machine of the same size can compute the same function. So that means one can say that the function is computationally irreducible with respect to Turing machines of the given size.
How robust is such a notion? We’ve seen examples above where a given function can be computed, say, only in exponential time by an
But the important point here is that we can already see a restricted version of computational irreducibility just by looking explicitly at Turing machines of a given size. And this allows us to get concrete results about computational irreducibility, or at least about this restricted version of it.
One of the remarkable discoveries in looking at lots of kinds of systems over the years has been just how common the phenomenon of computational irreducibility seems to be. But usually we haven’t had a way to rigorously say that we’re seeing computational irreducibility in any particular case. All we typically know is that we can’t “visually decode” what’s going on, nor can particular methods we try. (And, yes, the fact that a wide variety of different methods almost always agree about what’s “compressible” and what’s not encourages our conclusions about the presence of computational irreducibility.)
In looking at Turing machines here, we’re often seeing “visual complexity”, not so much in the detailed—often ponderous—behavior with a particular initial condition, but more, for example, in what we get by plotting function values against inputs. But now we have a more rigorous—if restricted—test for computational irreducibility: we can ask whether the function that’s being computed is irreducible with respect to this size of Turing machine, or, typically equivalently, whether the Turing machine we’re looking at is an isolate.
So now we can, for example, explore how common irreducibility defined in this way might be. Here are results for some of the classes of small Turing machines we’ve studied above:
And what we see is that—much like our impression from computational systems like cellular automata—computational irreducibility is indeed very common among small Turing machines, where now we’re using our rigorous, if restricted, notion of computational irreducibility.
(It’s worth commenting that while “global” features of Turing machines—like the functions they compute—may be computationally irreducible, there can still be lots of computational reducibility in their more detailed properties. And indeed what we’ve seen here is that there are plenty of features of the behavior of Turing machines—like the back-and-forth motion of their heads—that look visually simple, and that we can expect to compute in dramatically faster ways than just running the Turing machine itself.)
So far, we’ve made a fairly extensive study of ordinary, deterministic (“single-way”) Turing machines. But the P vs. NP question is about comparing the capabilities of such deterministic Turing machines with the capabilities of nondeterministic—or multiway—Turing machines.
An ordinary (deterministic) Turing machine has a rule such as
that specifies a unique sequence of successive configurations for the Turing machine
which we can represent as:
A multiway Turing machine, on the other hand, can have multiple rules, such as
which are applied in all possible ways to generate a whole multiway graph of successive configurations for the Turing machine
where we have indicated edges in the multiway graph associated with the application of each rule respectively by
and
, and where identical Turing machine configurations are merged.
Just as we have done for ordinary (deterministic) Turing machines, we take multiway Turing machines to reach a halting configuration whenever the head goes further to the right than it started—though now this may happen on multiple branches—so that the Turing machine in effect can generate multiple outputs.
With the way we have set things up, we can think of an ordinary (deterministic) Turing machine as taking an input i and giving as output some value f[i] (where that value might be undefined if the Turing machine doesn’t halt for a given i). In direct analogy, we can think of a multiway Turing machine as taking an input i and giving potentially a whole collection of corresponding outputs:
Among the immediate complications is the fact that the machine may not halt, at least on some branches—as happens for input 3 here, indicated by a red dot in the plot above:
(In addition, we see that there can be multiple paths that lead to a given output, in effect defining multiple runtimes for that output. There can also be cycles, but in defining “runtimes” we ignore these.)
When we construct a multiway graph we are effectively setting up a representation for all possible paths in the evolution of a (multiway) system. But when we talk about nondeterministic evolution we are typically imagining that just a single path is going to be followed, but we don’t know which.
Let’s say that we have a multiway Turing machine that for every given input generates a certain set of outputs. If we were to pick just one of the outputs from each of these sets, we would effectively in each case be picking one path in the multiway Turing machine. Or, in other words, we would be “doing a nondeterministic computation”, or in effect getting output from a nondeterministic Turing machine.
As an example, let’s take our multiway Turing machine from above. Here is an example of how this machine—thought of as a nondeterministic Turing machine—can generate a certain sequence of output values:
Each of these output values is achieved by following a certain path in the multiway graph obtained with each input:
Keeping only the path taken (and including the underlying Turing machine configuration) this represents how each output value was “derived”:
The length of the path can then be thought of as the runtime required for the nondeterministic Turing machine to reach the output value. (When there are multiple paths to a given output value, we’ll typically consider “the runtime” to be the length of the shortest of these paths.) So now we can summarize the runtimes from our example as follows:
The core of the P vs. NP problem is to compare the runtime for a particular function obtained by deterministic and nondeterministic Turing machines.
So, for example, given a deterministic Turing machine that computes a certain function, we can ask whether there is a nondeterministic Turing machine which—if you picked the right branch—can compute that same function, but faster.
In the case of the example above, there are two possible underlying Turing machine rules indicated by
and
. For each input we can choose at each step a different rule to apply in order to get to the output:
The possibility of using different rules at different steps in effect allows much more freedom in how our computation can be done. The P vs. NP question concerns whether this freedom allows one to fundamentally speed up the computation of a given function.
But before we explore that question further, let’s take a look at what multiway (nondeterministic) Turing machines typically do; in other words, let’s study their ruliology.
As our first example of doing ruliology for multiway Turing machines, let’s consider the case of pairs of
Sometimes, as in machine {1,9}, there turns out to be a unique output value for every input:
Sometimes, as in machine {5,9}, there is usually a unique value, but sometimes not:
Something similar happens with {3,7}:
There are cases—like {1,3}—where for some inputs there’s a “burst” of possible outputs:
There are also plenty of cases where for some inputs
or for all inputs, there are branches that do not halt:
What about runtimes? Well, for each possible output in a nondeterministic Turing machine, we can see how many steps it takes to reach that output on any branch of the multiway graph, and we can consider that minimum number to be the “nondeterministic runtime” needed to compute that output.
It’s the quintessential setup for NP computations: if you can successfully guess what branch to follow, you can potentially get to an answer quickly. But if you have to explicitly check each branch in turn, that can be a slow process.
Here’s an example showing possible outputs and possible runtimes for a sequence of inputs for the {3,7} nondeterministic machine
or, combined in 3D:
So what functions can a nondeterministic machine like this “nondeterministically” generate? For each input we have to pick one of the possible corresponding (“multiway”) outputs. And in effect the possible functions correspond to possible “threadings” through these values
or:
To each function one can then associate a “nondeterministic runtime” for each output, here:
We’ve seen how a nondeterministic machine can in general generate multiple functions, with each output from the function being associated with a minimum (“nondeterministic”) runtime. But how do the functions that a particular nondeterministic machine can generate compare with the functions that deterministic machines can generate? Or, put another way, given a function that a nondeterministic machine can generate (or “compute”), what deterministic machine is required to compute the same function?
Let’s look at the
Can we find a deterministic
) branch in the multiway system
so that it inevitably gives the same results as deterministic machine 3.
But (apart from the other trivial case based on following “machine 7” branches) none of the other functions we can generate from this nondeterministic machine can be reproduced by any
What about
And here are the paths through the multiway graphs for that machine that get to these values
with the “paths on their own” being
yielding “nondeterministic runtimes”:
This is how the deterministic
Here are the pair of underlying rules for the nondeterministic machine
and here is the deterministic machine that reproduces a particular function it can generate:
This example is rather simple, and has the feature that even the deterministic machine always has a very small runtime. But now the question we can ask is whether a function that takes a deterministic machine of a certain class a certain time to compute can be computed in a smaller time if its results are “picked out of” a nondeterministic machine.
We saw above that
But what about a nondeterministic machine? How fast can this be?
It turns out that there are 15 nondeterministic machines based on pairs of
Here are the paths within the multiway graph for the nondeterministic machine that are sampled to generate the deterministic Turing machine result:
And here are these “paths on their own”:
We can compare these with the computations needed in the deterministic machine:
With our rendering, the lengths of the nondeterministic paths might look longer. But in fact they are considerably shorter, as we see by plotting them (in orange) along with the deterministic runtimes (in gray):
Looking now at the worst-case runtimes for inputs of size n, we get:
For the deterministic machine we found above for input size n, this worst-case runtime is given by:
But now the runtime in the nondeterministic machine turns out to be:
In other words, we’re seeing that nondeterminism makes it substantially faster to compute this particular function—at least by small Turing machines.
In a deterministic machine, it’s always the same underlying rule that’s applied at each step. But in a nondeterministic machine with the setup we’re using, we’re independently choosing one of two different rules to apply at each step. The result is that for every function value we compute, we’re making a sequence of choices:
And the core question that underlies things like the P vs. NP problem is how much advantage the freedom to make these choices conveys—and whether, for example, it allows us to “nondeterministically” compute in polynomial time what takes more than polynomial (say, exponential) time to compute deterministically.
As a first example, let’s look at the function computed by the
– 1
Well, it turns out that the
And indeed, while the deterministic machine takes exponentially increasing runtime, the nondeterministic machine has a runtime that quickly approaches the fixed constant value of 5:
But is this somehow trivial? As the plot above suggests, the nondeterministic machine (at least eventually) generates all possible odd output values (and for even input i, also generates
What makes the runtime end up being constant, however, is that in this particular case, the output f[i] is always close to i (in fact,
There are actually no fewer than
And while all of them are in a sense straightforward in their operation, they illustrate the point that even when a function requires exponential time for a deterministic Turing machine, it can require much less time for a nondeterministic machine—and even a nondeterministic machine that has a much smaller rule.
What about other cases of functions that require exponential time for deterministic machines? The functions computed by the
Something slightly different happens with
A deterministic Turing machine has a single, definite rule that’s applied at each step. In the previous sections we’ve explored what’s in a sense a minimal case of nondeterminism in Turing machines—where we allow not just one, but two different possible rules to be applied at each step. But what if we increase the nondeterminism—say by allowing more possible rules at each step?
We’ve seen that there’s a big difference between determinism—with one rule—and even our minimal case of nondeterminism, with two rules. But if we add in, say, a third rule, it doesn’t seem to typically make any qualitative difference. So what about the limiting case of adding in all conceivable rules?
We can think of what we get as an “everything machine”—a machine that has every possible rule case for any possible Turing machine, say for
Running this “everything machine” for one step starting with input 1 we get:
Four of the rule cases just lead back to the initial state. Then of the other four, two lead to halting states, and two do not. Dropping self-loops, going another couple of steps, and using a different graph rendering, we see that outputs 2 and 3 now appear:
Here are the results for input 2:
So where can the “everything machine” reach, and how long does it take? The answer is that from any input i it can eventually reach absolutely any output value j. The minimum number of steps required (i.e. the minimum path length in the multiway graph) is just the absolute lower bound that we found for runtimes in deterministic machines above:
Starting with input 1, the nondeterministic runtime to reach output j is then
which grows logarithmically with output value, or linearly with output size.
So what this means is that the “everything machine” lets one nondeterministically go from a given input to a given output in the absolutely minimum number of steps structurally possible. In other words, with enough nondeterminism every function becomes nondeterministically “easy to compute”.
An important feature of the “everything machine” is that we can think of it as being a fragment of the ruliad. The full ruliad—which appears at the foundations of physics, mathematics and much more—is the entangled limit of all possible computations. There are many possible bases for the ruliad; Turing machines are one. In the full ruliad, we’d have to consider all possible Turing machines, with all possible sizes. The “everything machine” we’ve been discussing here gives us just part of that, corresponding to all possible Turing machine rules with a specific number of states and colors.
In representing all possible computations, the ruliad—like the “everything machine”—is maximally nondeterministic, so that it in effect includes all possible computational paths. But when we apply the ruliad in science (and even mathematics) we are interested not so much in its overall form as in particular slices of it which are sampled by observers that, like us, are computationally bounded. And indeed in the past few years it’s become clear that there’s a lot to say about the foundations of many fields by thinking in this way.
And one feature of computationally bounded observers is that they’re not maximally nondeterministic. Instead of following all possible paths in the multiway system, they tend to follow specific paths or bundles of paths—for example reflecting the single thread of experience that characterizes our human perception of things. So—when it comes to observers—the “everything machine” is somehow too nondeterministic. An actual (computationally bounded) observer will be concerned with one or just a few “threads of history”. In other words, if we’re interested in slices of the ruliad that observers will sample, what will be relevant is not so much the “everything machine” but rather deterministic machines, or at most machines with the kind of limited nondeterminism that we’ve studied the past few sections.
But just how does what the “everything machine” can do compare with what all possible deterministic machines can do? In some ways, this is a core question in the comparison between determinism and nondeterminism. And it’s straightforward to start studying it empirically.
For example, here are successive steps in the multiway graph for the (
In a sense these pictures illustrate the “reach” of deterministic vs. nondeterministic computation. In this particular case, with
For
and the values that can be reached by deterministic machines are:
But how long does it take to reach these values? This shows as dots the possible (deterministic) runtimes; the filling represents the minimum (nondeterministic) runtimes for the “everything machine”:
The most dramatic outlier occurs with value 31, which is reached deterministically only by machine 1447, in 15 steps, but which can be reached in 9 (nondeterministic) steps by the “everything machine”:
For
The P vs. NP question asks whether every computation that can be done by any nondeterministic Turing machine with a runtime that increases at most polynomially with input size can also be done by some deterministic Turing machine with a runtime that also increases at most polynomially. Or, put more informally, it asks whether introducing nondeterminism can fundamentally speed up computation.
In its full form, this is an infinite question, that talks about limiting behavior over all possible inputs, in all possible Turing machines. But within this infinite question, there are definite, finite subquestions we can ask. And one of the things we’ve done here is in effect to explore some of these questions in an explicit, ruliological way. Looking at these finite subquestions won’t in any direct way be able to resolve the full P vs. NP question.
But it can give us important intuition about the P vs. NP question, and what some of the difficulties and subtleties involved in it are. When one analyzes specific, constructed algorithms, it’s common to see that their runtimes vary quite smoothly with input size. But one of the things we’ve seen here is that for arbitrary Turing machines “in the wild”, it’s very typical for the runtimes to jump around in complicated ways. It’s also not uncommon to see dramatic outliers that occur only for very specific inputs.
If there was just one outlier, then in the limit of arbitrarily large input size it would eventually become irrelevant. But what if there were an unending sequence of outliers, of unpredictable sizes at unpredictable positions? Ultimately we expect all sorts of computational irreducibility, which in the limit can make it infinitely difficult to determine in any particular case the limiting behavior of the runtime—and, for example, to find out if it’s growing like a polynomial or not.
One might imagine, though, that if one looked at enough inputs, enough Turing machines, etc. then somehow any wildness would get in some way averaged out. But our ruliological results don’t encourage that idea. And indeed they tend to show that “there’s always more wildness”, and it’s somehow ubiquitous. One might have imagined that computational irreducibility—or undecidability—would be sufficiently rare that it wouldn’t affect investigations of “global” questions like the P vs. NP one. But our results suggest that, to the contrary, there are all sorts of complicated details and “exceptions” that seem to get in the way of general conclusions.
Indeed, there seem to be issues at every turn. Some are related to unexpected behavior and outliers in runtimes. Some are related to the question of whether a particular machine ever even halts at all for certain inputs. And yet others are related to taking limits of sizes of inputs versus sizes of Turing machines, or amounts of nondeterminism. What our ruliological explorations have shown is that such issues are not obscure corner cases; rather they are generic and ubiquitous.
One has the impression, though, that they are more pronounced in deterministic than in nondeterministic machines. Nondeterministic machines in some sense “aggregate” over paths, and in doing so, wash out the “computational coincidences” which seem ubiquitous in determining the behavior of deterministic machines.
Certainly the specific experiments we’ve done on machines of limited size do seem to support the idea that there are indeed computations that can be done quickly by a nondeterministic machine, but for which in deterministic machines there are for example at least occasional large runtime outliers, which imply longer general runtimes.
I had always suspected that the P vs. NP question would ultimately get ensnared in issues of computational irreducibility and undecidability. But from our explicit ruliological explorations we get an explicit sense of how this can happen. Will it nevertheless ultimately be possible to resolve the P vs. NP question with a finite mathematical-style proof based, say, on standard mathematical axioms? The results here make me doubt it.
Yes, it will be possible to get at least certain restricted global results—in effect by “mining” pockets of computational reducibility. And, as we already know from what we have seen repeatedly here, it’s also possible to get definite results for, say, specific (ultimately finite) classes of Turing machines.
I’ve only scratched the surface here of the ruliological results that can be found. In some cases to find more just requires expending more computer time. In other cases, though, we can expect that new methodologies, particularly around “bulk” automated theorem proving, will be needed.
But what we’ve seen here already makes it clear that there is much to be learned by ruliological methods about questions of theoretical computer science—P vs. NP among them. In effect, we’re seeing that theoretical computer science can be done not only “purely theoretically”—say with methods from traditional mathematics—but also “empirically”, finding results and developing intuition by doing explicit computational experiments and enumerations.
My efforts on what I now call ruliology started at the beginning of the 1980s, and in the early years I almost exclusively studied cellular automata. A large part of the reason was just that these were the first types of simple programs I’d investigated, and in them I had made a series of discoveries. I was certainly aware of Turing machines, but viewed them as less connected than cellular automata to my goal of studying actual systems in nature and elsewhere—though ultimately theoretically equivalent.
It wasn’t until 1991, when I started systematically studying different types of simple programs as I embarked on my book A New Kind of Science that I actually began to do simulations of Turing machines. (Despite their widespread use in theoretical science for more than half a century, I think almost nobody else—from Alan Turing on—had ever actually simulated them either.) At first I wasn’t particularly enamored of Turing machines. They seemed a little less elegant than mobile automata, and had much less propensity to show interesting and complex behavior than cellular automata.
Towards the end of the 1990s, though, I was working to connect my discoveries in what became A New Kind of Science to existing results in theoretical computer science—and Turing machines emerged as a useful bridge. In particular, as part of the final chapter of A New Kind of Science—“The Principle of Computational Equivalence”—I had a section entitled “Undecidability and Intractability”. And in that section I used Turing machines as a way to explore the relation of my results to existing results on computational complexity theory.
And it was in the process of that effort that I invented the kind of one-sided Turing machines I’ve used here:
I concentrated on the s = 2, k = 2 machines (for some reason I never looked at s = 1, k = 2), and found classes of machines that compute the same function—sometimes at different speeds:

And even though the computers I was using at the time were much slower than the ones I use today, I managed to extend what I was doing to s = 3, k = 2. At every turn, though, I came face to face with computational irreducibility and undecidability. I tried quite hard do things like resolve the exact number of distinct functions for
Nearly three decades later, I think I finally have the exact number. (Note that some of the details from A New Kind of Science are also different from what I have here, because in A New Kind of Science I included partial functions in my enumeration; here I’m mostly insisting on total functions, that halt and give a definite result for all inputs.)
After A New Kind of Science was released in 2002, I made another foray into Turing machines in 2007, putting up a prize on the fifth anniversary of the book for a proof (or refutation) of my suspicion that s = 2, k = 3 machine 596440 was capable of universal computation. The prize was soon won, establishing this machine as the very simplest universal Turing machine:
Many years passed. I occasionally suggested projects on Turing machines to students at the summer research program we started in 2003 (more on that later…). And I participated in celebrations of Alan Turing’s centenary in 2012. Then in 2020 we announced the Wolfram Physics Project—and I looked at Turing machines again, now as an example of a computational system that could be encoded with hypergraph rewriting, and studied using physics-inspired causal graphs, etc.:
Less than two months after the launch of our Physics Project I was studying what I now call the ruliad—and I decided to use Turing machines as a model for it:
A crucial part of this was the idea of multiway Turing machines:
I’d introduced multiway systems in A New Kind of Science, and had examples close to multiway Turing machines in the book. But now multiway Turing machines were more central to what I was doing—and in fact I started studying essentially what I’ve here called the “everything machine” (though the details were different, because I wasn’t considering Turing machines that can halt):
I also started looking at the comparison between what can be reached deterministically and nondeterministically—and discussed the potential relation of this to the P vs. NP question:
By the next year, I was expanding my study of multiway systems, and exploring many different examples—with one of them being multiway Turing machines:
Soon I realized that the general approach I was taking could be applied not only to the foundations of physics, but also to foundations of other fields. I studied the foundations of mathematics, of thermodynamics, of machine learning and of biology. But what about the foundations of theoretical computer science?
Over the years, I’d explored the ruliology of many kinds of systems studied in theoretical computer science—doing deep dives into combinators for their centenary in 2020, as well as (last year) into lambdas. In all these investigations, I was constantly seeing concrete versions of phenomena discussed in theoretical computer science—even though my emphasis tended to be different. But I was always curious what one might be able to say about central questions in theoretical computer science—like P vs. NP.
I had imagined that the principal problem in doing an empirical investigation of something like P vs. NP would just be to enumerate enough cases. But when I got into it, I realized that the shadow of computational irreducibility loomed even larger than I’d imagined—and that even within particular cases it could be irreducibly difficult to figure out what one needed to know about their behavior.
Fairly late in the project I was trying to look up some “conventional wisdom” about NP problems. Most of it was couched in rather traditional mathematical terms, and didn’t seem likely to have too much to say about what I was doing. But then I found a paper entitled “Program-size versus Time Complexity: Slowdown and Speed-up Phenomena in the Micro-cosmos of Small Turing Machines”—and I was excited to see that it was following up on what I’d done in A New Kind of Science, and doing ruliology. But then I realized: the lead author of the paper, Joost Joosten, had been an (already-a-professor) student at our summer program in 2009, and I’d in fact suggested the original version of the project (though the paper had taken it further, and in some slightly different directions than I’d anticipated).
Needless to say, what I’ve now done here raises a host of new questions, which can now be addressed by future projects done at our summer programs, and beyond….
Note: For general historical background see my related writings from 2002 and 2021.
Thanks to Willem Nielsen, Nik Murzin and Brian Mboya of the Wolfram Institute for extensive help. Thanks also to Wolfram Institute affiliate Anneline Daggelinckx, as well as to Richard Assar and Pavel Hajek of the Wolfram Institute for additional help. Work at the Wolfram Institute on this project was supported in part by the John Templeton Foundation.
Additional input on the project was provided by Lenore & Manuel Blum, Christopher Gilbert, Josh Grochow, Don Knuth and Michael Sun. Matthew Szudzik also contributed relevant work in 1999 during the development of A New Kind of Science.
2026-01-13 07:05:20
![]()
Ruliology is taking off! And more and more people are talking about it. But what is ruliology? Since I invented the term, I decided I should write something to explain it. But then I realized: I actually already wrote something back in 2021 when I first invented the term. What I wrote back then was part of something longer. But here now is the part that explains ruliology:

If one sets up a system to follow a particular set of simple rules, what will the system do? Or, put another way, how do all those simple programs out there in the computational universe of possible programs behave?
These are pure, abstract questions of basic science. They’re questions one’s led to ask when one’s operating in the computational paradigm that I describe in A New Kind of Science. But at some level they’re questions about the specific science of what abstract rules (that we can describe as programs) do.
What is that science? It’s not computer science, because that would be about programs we construct for particular purposes, rather than ones that are just “out there in the wilds of the computational universe”. It’s not (as such) mathematics, because it’s all about “seeing what rules do” rather than finding frameworks in which things can be proved. And in the end, it’s clear it’s actually a new science—that’s rich and broad, and that I, at least, have had the pleasure of practicing for forty years.
But what should this science be called? I’ve wondered about this for decades. I’ve filled so many pages with possible names. Could it be based on Greek or Latin words associated with rules? Those are arch- and reg-: very well-trafficked roots. What about words associated with computation? That’d be logis- or calc-. None of these seem to work. But—in something akin to the process of metamodeling—we can ask: What is the essence of what we want to communicate in the word?
It’s all about studying rules, and what their consequences are. So why not the simple and obvious “ruliology”? Yes, it’s a new and slightly unusual-sounding word. But I think it does well at communicating what this science that I’ve enjoyed for so long is about. And I, for one, will be pleased to call myself a “ruliologist”.
But what is ruliology really about? It’s a pure, basic science—and a very clean and precise one. It’s about setting up abstract rules, and then seeing what they do. There’s no “wiggle room”. No issue with “reproducibility”. You run a rule, and it does what it does. The same every time.
What does the rule 73 cellular automaton starting from a single black cell do? What does some particular Turing machine do? What about some particular multiway string substitution system? These are specific questions of ruliology.
At first you might just do the computation, and visualize the result. But maybe you notice some particular feature. And then you can use whatever methods it takes to get a specific ruliological result—and to establish, for example, that in the rule 73 pattern, black cells appear only in odd-length blocks.
Ruliology tends to start with specific cases of specific rules. But then it generalizes, looking at broader ranges of cases for a particular rule, or whole classes of rules. And it always has concrete things to do—visualizing behavior, measuring specific features, and so on.
But ruliology quickly comes face to face with computational irreducibility. What does some particular case of some particular rule eventually do? That may require an irreducible amount of computational effort to find out—and if one insists on knowing what amounts to a general truly infinite-time result, it may be formally undecidable. It’s the same story with looking at different cases of a rule, or different rules. Is there any case that does this? Or any rule that does it?
What’s remarkable to me—even after 40 years of ruliology—is how many surprises there end up being. You have some particular kind of rule. And it looks as if it’s only going to behave in some particular way. But no, eventually you find a case where it does something completely different, and unexpected. And, yes, this is in effect computational irreducibility reaching into what one’s seeing.
Sometimes I’ve thought of ruliology as being at first a bit like natural history. You’re exploring the world of simple programs, finding what strange creatures exist in it—and capturing them for study. (And, yes, in actual biological natural history, the diversity of what one sees is presumably at its core exactly the same computational phenomenon we see in abstract ruliology.)
So how does ruliology relate to complexity? It’s a core part—and in fact the most fundamental part—of studying the foundations of complexity. Ruliology is like studying complexity at its ultimate source. And about seeing just how complexity is generated from its simplest origins.
Ruliology is what builds raw material—and intuition—for making models. It’s what shows us what’s possible in the computational universe, and what we can use to model—and understand—the systems we study.
In metamodeling we’re going from models that have been constructed, and drilling down to see what’s underneath them. In ruliology we’re in a sense going the other way, building up from the minimal foundations to see what can happen.
In some ways, ruliology is like natural science. It’s taking the computational universe as an abstracted analog of nature, and studying how things work in it. But in other ways, ruliology is something more generative than natural science: because within the science itself, it’s thinking not only about what is, but also about what can abstractly be generated.
Ruliology in some ways starts as an experimental science, and in some ways is abstract and theoretical from the beginning. It’s experimental because it’s often concerned with just running simple programs and seeing what they do (and in general, computational irreducibility suggests you often can’t do better). But it’s abstract and theoretical in the sense that what’s being run is not some actual thing in the natural world, with all its details and approximations, but something completely precise, defined and computational.
Like natural science, ruliology starts from observations—but then builds up to theories and principles. Long ago I found a simple classification of cellular automata (starting from random initial conditions)—somehow reminiscent of identifying solids, liquids and gases, or different kingdoms of organisms. But beyond such classifications, there are also much broader principles—with the most important, I believe, being the Principle of Computational Equivalence.
The everyday course of doing ruliology doesn’t require engaging directly with the whole Principle of Computational Equivalence. But throughout ruliology, the principle is crucial in guiding intuition, and having an idea of what to expect. And, by the way, it’s from ruliology that we can get evidence (like the universality of rule 110, and of the 2,3 Turing machine) for the broad validity of the principle.
I’ve been doing ruliology (though not by that name) for forty years. And I’ve done a lot of it. In fact, it’s probably been my top methodology in everything I’ve done in science. It’s what led me to understand the origins of complexity, first in cellular automata. It’s what led me to formulate the general ideas in A New Kind of Science. And it’s what gave me the intuition and impetus to launch our new Physics Project.
I find ruliology deeply elegant, and satisfying. There’s something very aesthetic—at least to me—about the purity of just seeing what simple rules do. (And it doesn’t hurt that they often make very pleasing images.) It’s also satisfying when one can go from so little and get so much—and do so automatically, just by running something on a computer.
And as well I like the fundamental permanence of ruliology. If one’s dealing with the simplest rules of some type, they’re going to be foundational not only now, but forever. It’s like simple mathematical constructs—like the icosahedron. There were icosahedral dice in ancient Egypt. But when we find them today, their shapes still seem completely modern—because the icosahedron is something fundamental and timeless. Just like the rule 30 pattern or countless other discoveries in ruliology.
In a sense perhaps one of the biggest surprises is that ruliology is such a comparatively new activity. But as I cataloged in A New Kind of Science, it has precursors going back hundreds and perhaps thousands of years. But without the whole paradigm of A New Kind of Science, there wasn’t a context to understand why ruliology is so significant.
So what constitutes a good piece of ruliology? I think it’s all about simplicity and minimality. The best ruliology happens after metamodeling is finished—and one’s really dealing with the simplest, most minimal class of rules of some particular type. In my efforts to do ruliology, for example in A New Kind of Science, I like to be able to “explain” the rules I’m using just by an explicit diagram, if possible with no words needed.
Then it’s important to show what the rules do—as explicitly as possible. Sometimes—as in cellular automata—there’s a very obvious visual representation that can be used. But in other cases it’s important to do the work to find some scheme for visualization that’s as explicit as possible, and that both shows the whole of what’s going on and doesn’t introduce distracting or arbitrary additional elements.
It’s amazing how often in doing ruliology I’ll end up making an array of thumbnail images of how certain rules behave. And, again, the explicitness of this is important. Yes, one often wants to do various kinds of filtering, say of rules. But in the end I’ve found that one needs to just look at what happens. Because that’s the only way to successfully notice the unexpected, and to get a sense of the irreducible complexity of what’s out there in the computational universe of possible rules.
When I see papers that report what amounts to ruliology, I always like it when there are explicit pictures. I’m disappointed if all I see are formal definitions, or plots with curves on them. It’s an inevitable consequence of computational irreducibility that in doing good ruliology, one has to look at things more explicitly.
One of the great things about ruliology as a field of study is how easy it is to explore new territory. The computational universe contains an infinite number of possible rules. And even among ones that one might consider “simple”, there are inevitably astronomically many on any human scale. But, OK, if one explores some particular ruliological system, what of it?
It’s a bit like chemistry where one explores properties of some particular molecule. Exploring some particular class of rules, you may be lucky enough to come upon some new phenomenon, or understand some new general principle. But what you know you’ll be doing is systematically adding to the body of knowledge in ruliology.
Why is that important? For a start, ruliology is what provides the raw material for making models, so you’re in effect creating a template for some potential future model. And in addition, when it comes to technology, an important approach that I’ve discussed (and used) quite extensively involves “mining” the computational universe for “technologically useful” programs. And good ruliology is crucial in helping to make that feasible.
It’s a bit like creating technology in the physical universe. It was crucial, for example, that good physics and chemistry had been done on liquid crystals. Because that’s what allowed them to be identified—and used—in making displays.
Beyond its “pragmatic” value for models and for technology, another thing ruliology does is to provide “empirical raw material” for making broader theories about the computational universe. When I discovered the Principle of Computational Equivalence, it was as a result of several years of detailed ruliology on particular types of rules. And good ruliology is what prepares and catalogs examples from which theoretical advances can be made.
It’s worth mentioning that there’s a certain tendency to want to “nail down ruliology” using, for example, mathematics. And sometimes it’s possible to derive a nice summary of ruliological results using, say, some piece of discrete mathematics. But it’s remarkable how quickly the mathematics tends to get out of hand, with even a very simple rule having behavior that can only be captured by large amounts of obscure mathematics. But of course that’s in a sense just computational irreducibility rearing its head. And showing that mathematics is not the methodology to use—and that instead something new is needed. Which is precisely where ruliology comes in.
I’ve spent many years defining the character and subject matter of what I’m now calling ruliology. But there’s something else I’ve done too, which is to build a large tower of practical technology for actually doing ruliology. It’s taken more than forty years to build up to what’s now the full-scale computational language that is the Wolfram Language. But all that time, I was using what we were building to do ruliology.
The Wolfram Language is great and important for many things. But when it comes to ruliology, it’s simply a perfect fit. Of course it’s got lots of relevant built-in features. Like visualization, graph manipulation, etc., as well as immediate support for systems like cellular automata, substitution systems and Turing machines. But what’s even more important is that its fundamental symbolic structure gives it an explicit way to represent—and run—essentially any computational rule.
In doing practical ruliological explorations—and for example searching the computational universe—it’s also useful to have immediate support for things like parallel computation. But another crucial aspect of the Wolfram Language for doing practical ruliology is the concept of notebooks and computable documents. Notebooks let one organize both the process of research and the presentation of its results.
I’ve been accumulating research notebooks about ruliology for more than 30 years now—with textual notes, images of behavior, and code. And it’s a great thing. Because the stability of the Wolfram Language (and its notebook format) means that I can immediately go back to something I did 30 years ago, run the code, and build on it. And when it comes to presenting results, I can do it as a computational essay, created in a notebook—in which the task of exposition is shared between text, pictures and computational language code.
In a traditional technical paper based on the mathematical paradigm, the formal part of the presentation will normally use mathematical notation. But for ruliology (as for “computational X” fields) what one needs instead is computational notation, or rather computational language—which is exactly what the Wolfram Language provides. And in a good piece of ruliology—and ruliology presentation—the notation should be simple, clear and elegant. And because it’s in computational language, it’s not just something people read; it’s also something that can immediately be executed or integrated somewhere else.
What should the future of ruliology be? It’s a huge, wide-open field. In which there are many careers to be made, and immense numbers of papers and theses and books that can be written—that will build up a body of knowledge that advances not just the pure, basic science of the computational universe but also all the science and technology that flows from it.


2025-12-03 03:23:40
![]()

To immediately enable Wolfram Compute Services in Version 14.3 Wolfram Desktop systems, run
(The functionality is automatically available in the Wolfram Cloud.)
Let’s say you’ve done a computation in Wolfram Language. And now you want to scale it up. Maybe 1000x or more. Well, today we’ve released an extremely streamlined way to do that. Just wrap the scaled up computation in RemoteBatchSubmit and off it’ll go to our new Wolfram Compute Services system. Then—in a minute, an hour, a day, or whatever—it’ll let you know it’s finished, and you can get its results.
For decades I’ve often needed to do big, crunchy calculations (usually for science). With large volumes of data, millions of cases, rampant computational irreducibility, etc. I probably have more compute lying around my house than most people—these days about 200 cores worth. But many nights I’ll leave all of that compute running, all night—and I still want much more. Well, as of today, there’s an easy solution—for everyone: just seamlessly send your computation off to Wolfram Compute Services to be done, at basically any scale.
For nearly 20 years we’ve had built-in functions like ParallelMap and ParallelTable in Wolfram Language that make it immediate to parallelize subcomputations. But for this to really let you scale up, you have to have the compute. Which now—thanks to our new Wolfram Compute Services—everyone can immediately get.
The underlying tools that make Wolfram Compute Services possible have existed in the Wolfram Language for several years. But what Wolfram Compute Services now does is to pull everything together to provide an extremely streamlined all-in-one experience. For example, let’s say you’re working in a notebook and building up a computation. And finally you give the input that you want to scale up. Typically that input will have lots of dependencies on earlier parts of your computation. But you don’t have to worry about any of that. Just take the input you want to scale up, and feed it to RemoteBatchSubmit. Wolfram Compute Services will automatically take care of all the dependencies, etc.
And another thing: RemoteBatchSubmit, like every function in Wolfram Language, is dealing with symbolic expressions, which can represent anything—from numerical tables to images to graphs to user interfaces to videos, etc. So that means that the results you get can immediately be used, say in your Wolfram Notebook, without any importing, etc.
OK, so what kinds of machines can you run on? Well, Wolfram Compute Services gives you a bunch of options, suitable for different computations, and different budgets. There’s the most basic 1 core, 8 GB option—which you can use to just “get a computation off your own machine”. You can pick a machine with larger memory—currently up to about 1500 GB. Or you can pick a machine with more cores—currently up to 192. But if you’re looking for even larger scale parallelism Wolfram Compute Services can deal with that too. Because RemoteBatchMapSubmit can map a function across any number of elements, running on any number of cores, across multiple machines.
OK, so here’s a very simple example—that happens to come from some science I did a little while ago. Define a function PentagonTiling that randomly adds nonoverlapping pentagons to a cluster:
For 20 pentagons I can run this quickly on my machine:
But what about for 500 pentagons? Well, the computational geometry gets difficult and it would take long enough that I wouldn’t want to tie up my own machine doing it. But now there’s another option: use Wolfram Compute Services!
And all I have to do is feed my computation to RemoteBatchSubmit:
Immediately, a job is created (with all necessary dependencies automatically handled). And the job is queued for execution. And then, a couple of minutes later, I get an email:

Not knowing how long it’s going to take, I go off and do something else. But a while later, I’m curious to check how my job is doing. So I click the link in the email and it takes me to a dashboard—and I can see that my job is successfully running:

I go off and do other things. Then, suddenly, I get an email:

It finished! And in the mail is a preview of the result. To get the result as an expression in a Wolfram Language session I just evaluate a line from the email:
And this is now a computable object that I can work with, say computing areas
or counting holes:
One of the great strengths of Wolfram Compute Services is that it makes it easy to use large-scale parallelism. You want to run your computation in parallel on hundreds of cores? Well, just use Wolfram Compute Services!
Here’s an example that came up in some recent work of mine. I’m searching for a cellular automaton rule that generates a pattern with a “lifetime” of exactly 100 steps. Here I’m testing 10,000 random rules—which takes a couple of seconds, and doesn’t find anything:
To test 100,000 rules I can use ParallelSelect and run in parallel, say across the 16 cores in my laptop:
Still nothing. OK, so what about testing 100 million rules? Well, then it’s time for Wolfram Compute Services. The simplest thing to do is just to submit a job requesting a machine with lots of cores (here 192, the maximum currently offered):
A few minutes later I get mail telling me the job is starting. After a while I check on my job and it’s still running:

I go off and do other things. Then, after a couple of hours I get mail telling me my job is finished. And there’s a preview in the email that shows, yes, it found some things:

I get the result:
And here they are—rules plucked from the hundred million tests we did in the computational universe:
But what if we wanted to get this result in less than a couple of hours? Well, then we’d need even more parallelism. And, actually, Wolfram Compute Services lets us get that too—using RemoteBatchMapSubmit. You can think of RemoteBatchMapSubmit as a souped up analog of ParallelMap—mapping a function across a list of any length, splitting up the necessary computations across cores that can be on different machines, and handling the data and communications involved in a scalable way.
Because RemoteBatchMapSubmit is a “pure Map” we have to rearrange our computation a little—making it run 100,000 cases of selecting from 1000 random instances:
The system decided to distribute my 100,000 cases across 316 separate “child jobs”, here each running on its own core. How is the job doing? I can get a dynamic visualization of what’s happening:
And it doesn’t take many minutes before I’m getting mail that the job is finished:

And, yes, even though I only had to wait for 3 minutes to get this result, the total amount of computer time used—across all the cores—is about 8 hours.
Now I can retrieve all the results, using Catenate to combine all the separate pieces I generated:
And, yes, if I wanted to spend a little more, I could run a bigger search, increasing the 100,000 to a larger number; RemoteBatchMapSubmit and Wolfram Compute Services would seamlessly scale up.
Like everything around Wolfram Language, Wolfram Compute Services is fully programmable. When you submit a job, there are lots of options you can set. We already saw the option RemoteMachineClass which lets you choose the type of machine to use. Currently the choices range from "Basic1x8" (1 core, 8 GB) through "Basic4x16" (4 cores, 16 GB) to “parallel compute” "Compute192x384" (192 cores, 384 GB) and “large memory” "Memory192x1536" (192 cores, 1536 GB).
Different classes of machine cost different numbers of credits to run. And to make sure things don’t go out of control, you can set the options TimeConstraint (maximum time in seconds) and CreditConstraint (maximum number of credits to use).
Then there’s notification. The default is to send one email when the job is starting, and one when it’s finished. There’s an option RemoteJobName that lets you give a name to each job, so you can more easily tell which job a particular piece of email is about, or where the job is on the web dashboard. (If you don’t give a name to a job, it’ll be referred to by the UUID it’s been assigned.)
The option RemoteJobNotifications lets you say what notifications you want, and how you want to receive them. There can be notifications whenever the status of a job changes, or at specific time intervals, or when specific numbers of credits have been used. You can get notifications either by email, or by text message. And, yes, if you get notified that your job is going to run out of credits, you can always go to the Wolfram Account portal to top up your credits.
There are many properties of jobs that you can query. A central one is "EvaluationResult". But, for example, "EvaluationData" gives you a whole association of related information:
If your job succeeds, it’s pretty likely "EvaluationResult" will be all you need. But if something goes wrong, you can easily drill down to study the details of what happened with the job, for example by looking at "JobLogTabular".
If you want to know all the jobs you’ve initiated, you can always look at the web dashboard, but you can also get symbolic representations of the jobs from:
For any of these job objects, you can ask for properties, and you can for example also apply RemoteBatchJobAbort to abort them.
Once a job has completed, its result will be stored in Wolfram Compute Services—but only for a limited time (currently two weeks). Of course, once you’ve got the result, it’s very easy to store it permanently, for example, by putting it into the Wolfram Cloud using CloudPut[expr]. (If you know you’re going to want to store the result permanently, you can also do the CloudPut right inside your RemoteBatchSubmit.)
Talking about programmatic uses of Wolfram Compute Services, here’s another example: let’s say you want to generate a compute-intensive report once a week. Well, then you can put together several very high-level Wolfram Language functions to deploy a scheduled task that will run in the Wolfram Cloud to initiate jobs for Wolfram Compute Services:
And, yes, you can initiate a Wolfram Compute Services job from any Wolfram Language system, whether on the desktop or in the cloud.
Wolfram Compute Services is going to be very useful to many people. But actually it’s just part of a much larger constellation of capabilities aimed at broadening the ways Wolfram Language can be used.
Mathematica and the Wolfram Language started—back in 1988—as desktop systems. But even at the very beginning, there was a capability to run the notebook front end on one machine, and then have a “remote kernel” on another machine. (In those days we supported, among other things, communication via phone line!) In 2008 we introduced built-in parallel computation capabilities like ParallelMap and ParallelTable. Then in 2014 we introduced the Wolfram Cloud—both replicating the core functionality of Wolfram Notebooks on the web, and providing services such as instant APIs and scheduled tasks. Soon thereafter, we introduced the Enterprise Private Cloud—a private version of Wolfram Cloud. In 2021 we introduced Wolfram Application Server to deliver high-performance APIs (and it’s what we now use, for example, for Wolfram|Alpha). Along the way, in 2019, we introduced Wolfram Engine as a streamlined server and command-line deployment of Wolfram Language. Around Wolfram Engine we built WSTPServer to serve Wolfram Engine capabilities on local networks, and we introduced WolframScript to provide a deployment-agnostic way to run command-line-style Wolfram Language code. In 2020 we then introduced the first version of RemoteBatchSubmit, to be used with cloud services such as AWS and Azure. But unlike with Wolfram Compute Services, this required “do it yourself” provisioning and licensing with the cloud services. And, finally, now, that’s what we’ve automated in Wolfram Compute Services.
OK, so what’s next? An important direction is the forthcoming Wolfram HPCKit—for organizations with their own large-scale compute facilities to set up their own back ends to RemoteBatchSubmit, etc. RemoteBatchSubmit is built in a very general way, that allows different “batch computation providers” to be plugged in. Wolfram Compute Services is initially set up to support just one standard batch computation provider: "WolframBatch". HPCKit will allow organizations to configure their own compute facilities (often with our help) to serve as batch computation providers, extending the streamlined experience of Wolfram Compute Services to on-premise or organizational compute facilities, and automating what is often a rather fiddly job process of submission (which, I must say, personally reminds me a lot of the mainframe job control systems I used in the 1970s).
Wolfram Compute Services is currently set up purely as a batch computation environment. But within the Wolfram System, we have the capability to support synchronous remote computation, and we’re planning to extend Wolfram Compute Services to offer this—allowing one, for example, to seamlessly run a remote kernel on a large or exotic remote machine.
But this is for the future. Today we’re launching the first version of Wolfram Compute Services. Which makes “supercomputer power” immediately available for any Wolfram Language computation. I think it’s going to be very useful to a broad range of users of Wolfram Language. I know I’m going to be using it a lot.
2025-11-11 23:19:17
![]()

It’s a key feature of living systems, perhaps even in some ways the key feature: that even right down to a molecular scale, things are orchestrated. Molecules (or at least large ones) don’t just move around randomly, like in a liquid or a gel. Instead, what molecular biology has discovered is that there are endless active mechanisms that in effect orchestrate what even individual molecules in living systems do. But what is the result of all that orchestration? And could there perhaps be a general characterization of what happens in systems that exhibit such “bulk orchestration”? I’ve been wondering about these questions for some time. But finally now I think I may have the beginnings of some answers.
The central idea is to consider the effect that “being adapted for an overall purpose” has on the underlying operation of a system. At the outset, one might imagine that there’d be no general answer to this, and that it would always depend on the specifics of the system and the purpose. But what we’ll discover is that there is in fact typically a certain universality in what happens. Its ultimate origin is the Principle of Computational Equivalence and certain universal features of the phenomenon of computational irreducibility that it implies. But the point is that so long as a purpose is somehow “computationally simple” then—more or less regardless of what in detail the purpose is—a system that achieves it will show certain features in its behavior.
Whenever there’s computational irreducibility we can think of it as exerting a powerful force towards unpredictability and randomness (as it does, for example, in the Second Law of thermodynamics). So for a system to achieve an overall “computationally simple purpose” this computational irreducibility must in some sense be tamed, or at least contained. And in fact it’s an inevitable feature of computational irreducibility that within it there must be “pockets of computational reducibility” where simpler behavior occurs. And at some level the way computationally simple purposes must be achieved is by tapping into those pockets of reducibility.
When there’s computational irreducibility it means that there’s no simple narrative one can expect to give of how a system behaves, and no overall “mechanism” one can expect to identify for it. But one can think of pockets of computational reducibility as corresponding to at least small-scale “identifiable mechanisms”. And what we’ll discover is that when there’s a “simple overall purpose” being achieved these mechanisms tend to become more manifest. And this means that when a system is achieving an overall purpose there’s a trace of this even down in the detailed operation of the system. And that trace is what I’ve called “mechanoidal behavior”—behavior in which there are at least small-scale “mechanism-like phenomena” that we can think of as acting together through “bulk orchestration” to achieve a certain overall purpose.
The concept that there might be universal features associated with the interplay between some kind of overall simplicity and underlying computational irreducibility is something not unfamiliar. Indeed, in various forms it’s the ultimate key to our recent progress in the foundations of physics, mathematics and, in fact, biology.
In our effort to get a general understanding of bulk orchestration and the behavior of systems that “achieve purposes” there’s an analogy we can make to statistical mechanics. One might have imagined that to reach conclusions about, say, gases, we’d have to have detailed information about the motion of molecules. But in fact we know that if we just consider the whole ensemble of possible configurations of molecules, then by taking statistical averages we can deduce all sorts of properties of gases. (And, yes, the foundational reason this works we can now understand in terms of computational irreducibility, etc.)
So could we perhaps do something similar for bulk orchestration? Is there some ensemble we can identify in which wherever we look there will with overwhelming probability be certain properties? In the statistical mechanics of gases we imagine that the underlying laws of mechanics are fixed, but there’s a whole ensemble of possible initial configurations for the molecules—almost all of which turn out to have the same limiting features. But in biology, for example, we can think of different genomes as defining different rules for the development and operation of organisms. And so now what we want is a new kind of ensemble—that we can call a rulial ensemble: an ensemble of possible rules.
But in something like biology, it’s not all possible rules we want; rather, it’s rules that have been selected to “achieve the purpose” of making a successful organism. We don’t have a general way to characterize what defines biological fitness. But the key point here is that at a fundamental level we don’t need that. Instead it seems that just knowing that our fitness function is somehow computationally simple tells us enough to be able to deduce properties of our “rulial ensemble”.
But are the fitness functions of biology in fact computationally simple? I’ve recently argued that their simplicity is precisely what makes biological evolution possible. Like so many other foundational phenomena, it seems that biological evolution is a reflection of the interplay between computational simplicity—in the case of fitness functions—and underlying computational irreducibility. (In effect, organisms have a chance only because the problems they have to solve aren’t too computationally difficult.) But now we can use the simplicity of fitness functions—without knowing any more details about them—to make conclusions about the relevant rulial ensemble, and from there to begin to derive general principles associated with bulk orchestration.
When one describes why a system does what it does, there are two different approaches one can take. One can go from the bottom up and describe the underlying rules by which the system operates (in effect, its “mechanism”). Or one can go from the top down and describe what the system “achieves” (in effect, its “goal” or “purpose”). It tends to be difficult to mix these two approaches. But what we’re going to find here is that by thinking in terms of the rulial ensemble we will be able to see the general pattern of both the “upward” effect of underlying rules, and the “downward” effect of overall purposes.
What I’ll do here is just a beginning—a first exploration, both computational and conceptual, of the rulial ensemble and its consequences. But already there seem to be strong indications that by thinking in terms of the rulial ensemble one may be able to develop what amounts to a foundational theory of bulk orchestration, and with it a foundational theory of certain aspects of adaptive systems, and most notably biology.
To begin our explorations and start developing our intuition let’s look at a few simple examples. We’ll use the same basic framework as in my recent work on biological evolution. The idea is to have cellular automata whose rules serve as an idealization of genotypes, and whole behavior serves as an idealization of the development of phenotypes. For our fitness function we’re going to start with something very specific: that after 50 steps (starting from a single-cell “seed”) our cellular automaton should generate “output” that consists of three equal blocks of cells colored red, blue and yellow.
Here are some examples of rules that “solve this particular problem”, in various different ways:
How can such rules be found? Well, we can use an idealization of biological evolution. Let’s consider the first rule above:
We start, say, from a null rule, then make successive random point mutations in the rule
keeping those mutations that don’t take us further from achieving our goal (and dropping those that do, each indicated here by a red dot):
The result is that we “progressively adapt” over the course of a few thousand mutations to successively reduce the number of “errors” and eventually (and in this case, perfectly) achieve our goal:
Early in this sequence there’s lots of computational irreducibility in evidence. But in the process of adapting to achieve our goal the computational irreducibility progressively gets “contained” and in effect squeezed out, until eventually the final solution in this case has an almost completely simple structure.
The example we’ve just seen succeeds in exactly achieving the objective we defined—though it takes about 4000 steps of adaptive evolution to do so. But if we limit the number of steps of adaptive evolution then in general we won’t be able to reach the exact objective we’ve defined. But here are some results we get with 10,000 steps of adaptive evolution (sorted by how close they get):
In all cases there’s a certain amount of “identifiable mechanism” to what these rules do. Yes, there can be patches of complex—and presumably computationally irreducible—behavior. But particularly the rules that do better at achieving our exact objective tend to “keep this computational irreducibility at bay”, and emphasize their “simple mechanisms”.
So what we see is that among all possible rules, those that get even close to achieving the “purpose” we have set in effect show at least “patches of mechanism”. In other words, the imposition of a purpose selects out rules that “exhibit mechanism”, and show what we’ve called mechanoidal behavior.
In the last section we saw that adaptive evolution can find (4-color) cellular automaton rules that generate the particular
output we specified. But what about other kinds of output?
What we manage to get will depend on how much “effort” of adaptive evolution we put in. If we limit ourselves to 10,000 steps of adaptive evolution here are some examples of what happens:
At a qualitative level the main takeaway here is that it seems that the “simpler” the objective is, the more closely it’s likely to be achieved by a given number of steps of adaptive evolution.
Looking at the first (i.e. “ultimately most successful”) examples above, here’s how they’re reached in the course of adaptive evolution:
And we see that the “simpler” sequences are reached both more successfully and more quickly; in effect they seem to be “easier” for adaptive evolution to generate.
But what do we mean by “simpler” here? In qualitative terms we can think of the “simplicity” of a sequence as being characterized by how short a description we can find of it. We might try to compress the sequence, say using some standard practical compression technique, like run-length encoding, block encoding or dictionary encoding. And for the sequences we’re using above, these will (mostly) agree about what’s simpler and what’s not. And then what we find is that sequences that are “simpler” in this kind of characterization tend to be ones that are easier for adaptive evolution to produce.
But, actually, our study of adaptive evolution itself gives us a way to characterize the simplicity—or complexity—of a sequence: we can consider a sequence more complex if the typical number of mutations it takes to come up with a rule to generate the sequence is larger. And we can define this number of mutations to be what we can call the “mutational complexity” of a sequence.
There are lots of details in tightening up this definition. But in some sense what we’re saying is that we can characterize the complexity of a sequence by how hard it is for adaptive evolution to get it generated.
To get more quantitative we have to address the issue that if we run adaptive evolution multiple times, it’ll generally take different numbers of steps to be able to get a particular sequence generated, or, say, to get to the point where there are fewer than m “errors” in the generated sequence. And, sometimes, by the way, a particular run of adaptive evolution might “get stuck” and never be able to generate a particular sequence—at least (as we’ll discuss below) with the kind of rules and single point mutations we’re using here.
But we can still compute the probability—across many runs of adaptive evolution—to have reached a specified sequence within m errors after a certain number of steps. And this shows how that probability builds up for the sequences we saw above:
And we immediately see more quantitatively that some sequences are faster and easier to reach than others.
We can go even further by computing in each case the median number of adaptive steps needed to get “within m errors” of each sequence:
Picking a certain “desired fidelity” (say allowing a maximum of 20 errors) we then get at least one estimate of mutational complexity for our sequences:
Needless to say, these kinds of numerical measures are at best a very coarse way to characterize the difficulty of being able to generate a given sequence through rules produced by adaptive evolution. And, for example, instead of just looking at the probability to reach a given “fidelity” we could be looking at all sorts of distributions and correlations. But in developing our intuition about the rulial ensemble it’s useful to see how we can derive even an admittedly coarse specific numerical measure of the “difficulty of reaching a sequence” through adaptive evolution.
We’ve seen above that some sequences are easier to reach by adaptive evolution than others. But can any sequence we might look for actually be found at all? In other words—regardless of whether adaptive evolution can find it—is there in fact any cellular automaton rule at all (say a 4-color one) that successfully generates any given sequence?
It’s easy to see that in the end there must be sequences that can’t be generated in this way. There are 443 possible 4-color cellular automaton rules. But even though that’s a large number, the number of possible 4-color length-101 sequences is still much larger: 4101 ≈ 1061. So that means it’s inevitable that some of these sequences will not appear as the output from any 4-color cellular automaton rule (run from a single-cell initial condition for 50 steps). (We can think of such sequences as having too high an “algorithmic complexity” to be generated from a “program” as short as a 4-color cellular automaton rule.)
But what about sequences that are “simple” with respect to our qualitative criteria above? Whenever we succeeded above in finding them by adaptive evolution then we obviously know they can be generated. But in general this is a quintessential computationally irreducible question—so that in effect the only way to know for sure whether there’s any rule that can generate a particular sequence is just to explicitly search through all ≈ 1038 possible rules.
We can get some intuition, however, by looking at much simpler cases. Consider, for example, the 128 (quiescent) “elementary” cellular automata (with k = 2, r =1):
The number of distinct sequences they can generate after t steps quickly stabilizes (the maximum is 32)
but this is soon much smaller than the total number of possible sequences of the same length (
So what are the sequences that get “left out” by cellular automata? Already by step 2 a quiescent elementary cellular automaton can only produce 14 of the 32 possible sequences—with sequences such as
and
being among those excluded. One might think that one would be able to characterize excluded sequences by saying that certain fixed blocks of cells could not occur at any step in any rule. And indeed that happens for the r = 1/2 rules. But for the quiescent elementary cellular automata—with r = 1—it seems as if every block of any given size will eventually occur, presumably courtesy of the likes of rule 30.
What about, say, periodic sequences? Here are some examples that no quiescent elementary cellular automaton can generate:
And, yes, these are, by most standards, quite “simple” sequences. But they just happen not to be “simple” for elementary cellular automata. And indeed we can expect that there will be plenty of such “coincidentally unreachable” but “seemingly simple” sequences even for our 4-color rules. But we can also expect that even if we can’t precisely reach some objective sequence, we’ll still be able to get to a sequence that is close. (The minimum “error” is, for example, 4 cells out of 15 for the first sequence above, and 2 for the last sequence).
But there still remains the question of whether adaptive evolution will be able to find such sequences. For the very simple case of quiescent elementary cellular automata we can readily map out the complete multiway graph of all possible mutations between rules. Here’s what we get if we run all possible rules for 3 steps, then show possible outcomes as nodes, and possible mutations between rules as edges (the edges are undirected, because every mutation can go either way):
That this graph has only 18 nodes reflects the fact that quiescent elementary cellular automata can produce only 18 of the 128 possible length-7 sequences. But even within these 18 sequences there are ones that cannot be reached through the adaptive evolution process we are using.
For example, let’s say our goal is to generate the sequence
(or, rather, to find a rule that will do so). If we start from the null rule—which generates
—then our adaptive evolution process defines a foliated version of the multiway graph above:
Starting from
some paths (i.e. sequences of mutations) successfully reach
. But this only happens about 25% of the time. The rest of the time the adaptive process gets stuck at
or
and never reaches
:
So what happens if we look at larger cellular automaton rule spaces? In such cases we can’t expect to trace the full multiway graph of possible mutations. And if we pick a sequence at random as our target, then for a long sequence the overwhelming probability is that it won’t be reachable at all by any cellular automaton with a rule of a given type. But if we start from a rule—say picked at random—and use its output as our target, then this guarantees that there’s at least one rule that produces this sequence. And then we can ask how difficult it is for adaptive evolution to find a rule that works (most likely not the original one).
Here are some examples—with the original rule on the left, and the best results found from 10,000 steps of adaptive evolution on the right:
What we see is fairly clear: when the pattern generated by the original rule looks simple, adaptive evolution can readily find rules that successfully produce the same output, albeit sometimes in quite different ways. But when the pattern generated by the original rule is more complicated, adaptive evolution typically won’t be able to find a rule that exactly reproduces its output. And so, for examples, in the cases shown here many errors remain even in the “best results” after 10,000 steps of adaptive evolution.
Ultimately this not surprising. When we pick a cellular automaton rule at random, it’ll often show computational irreducibility. And in a sense all we’re seeing here is that adaptive evolution can’t “break” computational irreducibility. Or, put another way, computationally irreducible processes generate mutational complexity.
In everything we’ve done so far we’ve been considering a particular type of “goal”: to have a cellular automaton produce a specified arrangement of cells after a certain number of steps. But what about other types of goals? We’ll look at several here. The general features of what will happen with them follow what we’ve already seen, but each will show some new effects and will provide some new perspectives.
As a first example, let’s consider trying to match not the horizontal arrangement of cells, but the vertical one—in particular the sequence of colors in the center column of the cellular automaton pattern. Here’s what we get with the goal of having a block of red cells followed by an equal block of blue:
The results are quite diverse and “creative”. But it’s notable that in all cases there’s definite “mechanism” to be seen “right around the center column”. There’s all sorts of complexity away from the center column, but it’s sufficiently “contained” that the center column itself can achieve its “simple goal”.
Things are similar if we ask to get three blocks of color rather than two—though this goal turns out to be somewhat more difficult to achieve:
It’s also possible to get
:
And in general the difficulty of getting a particular vertical sequence of blocks tends to track the difficulty of getting the corresponding horizontal sequence of blocks. Or, in other words, the pattern of mutational complexity seems to be similar for sequences associated with horizontal and vertical goals.
This also seems to be true for periodic sequences. Alternating colors are easy to achieve, with many “tricks” being possible in this case:
A sequence with period 5 is pretty much the same story:
When the period gets more comparable to the number of cellular automaton steps that we’re sampling it for, the “solutions” get wilder:
And some of them are quite “fragile”, and don’t “generalize” beyond the original number of steps for which they were adapted:
The types of goals we’ve considered so far have all involved trying to get exact matches to specified sequences. But what if we just ask for an average result? For example, what if we ask for all 4 colors to occur with equal frequency in our output, but allow them to be arranged in any way? With the same adaptive evolution setup as before we rather quickly find “solutions” (where because we’re running for 50 steps and getting patterns of width 101, we’re always “off by at least 1” relative to the exact 25:25:25:25 result):
A few of these solutions seem to have “figured out a mechanism” to get all colors equally. But others seem like they just “happen to work”. And indeed, taking the first three cases here, this shows the relative numbers of cells of different colors obtained at successive steps in running the cellular automaton:
The pattern that looks simple consistently has equal numbers of each color at every step. The others just “happen to hit equality” after running for exactly 50 steps, but on either side don’t achieve equality.
And, actually, it turns out that all these solutions are in some sense quite fragile. Change the color of just one cell and one typically gets an expanding region of change—that takes the output far from color equality:
So how can we get rules that more robustly achieve our color equality objective? One approach is to force the adaptive evolution to “take account of possible perturbations” by applying a few perturbations at every adaptive evolution step, and keeping a particular mutation only if neither it, nor any of its perturbed versions, has lower fitness than before.
Here’s an example of one particular sequence of successive rules obtained in this way:
And now if we apply perturbations to the final result, it doesn’t change much:
It’s notable that this robust solution looks simple. And indeed that’s common, with a few other examples of robust, exact solutions being:
In effect it seems that requiring a robust, exact solution “forces out” computational irreducibility, leaving only readily reducible patterns. If we relax the constraint of being an exact solution even a little, though, more complex behavior quickly creeps in:
We’ve just looked at trying to achieve particular frequencies of colors in the “output” of a cellular automaton (after running for 50 steps). But what if we try to achieve certain frequencies of colors throughout the pattern produced by the cellular automaton?
For example, let’s say that our goal is to have color frequencies in the ratios:
. Adaptive evolution fairly easily finds good “solutions” for this:
And in fact it does so for any “relative red level”:
But if we plot the median number of adaptive evolution steps needed to achieve these results (i.e. our approximation to mutational complexity) we see that there’s a systematic increase with “red level”:
In effect, the higher the red level the “more stringent” the constraints we’re trying to satisfy are—and the more steps of adaptive evolution it takes to do that. But looking at the actual patterns obtained at different red levels, we also see something else: that as the constraints get more stringent, the pattern seems to have computational irreducibility progressively “squeezed out” of them—leaving behavior that seems more and more mechanoidal.
As another example along the same lines, consider goals of the form
. Here are results one gets varying the “white level”:
We see two rather different approaches being taken to the “problem of having more white”. When the white level isn’t too large, the pattern just gets “airier”, with more white inside. But eventually the pattern tends to contract, “leaving room” for white outside.
In most of what we’ve done so far, the overall “shapes” of our cellular automaton patterns have ended up always just being simple triangles that expand by one cell on each side at each step—though we just saw that with sufficiently stringent constraints on colors they’re forced to be different shapes. But what if our actual goal is to achieve a certain shape? For example, let’s say we try to get triangular patterns that grow at a particular rate on each side.
Here are some results for growth rates of the form 1/n (i.e. growing by an average of 1 cell every n steps):
This is the sequence of adaptive evolution steps that led to the first example of growth rate ![]()
and this is the corresponding sequence for growth rate
:
And although the interiors of most of the final patterns here are complicated, their outer boundaries tend to be simple—at least for small n—and in a sense “very mechanically” generate the exact target growth rate:
For larger n, things get more complicated—and adaptive evolution doesn’t typically “find an exact solution”. And if we run our “solutions” longer we see that—particularly for larger n—they often don’t “generalize” very well, and soon start deviating from their target growth rates:
As we increase n we typically see that more steps of adaptive evolution are needed to achieve our goal, reflecting the idea that “larger n growth” has more mutational complexity:
For rational growth rates like
fairly simple exact solutions are possible. But for irrational growth rates, that’s no longer true. Still, it turns out that adaptive evolution is in a sense strong enough—and our cellular automaton rule space is large enough—that good approximations even to irrational growth rates can often be reached:
The “solutions” typically remain quite consistent when run longer:
But there are still some notable limitations. For example, while a growth rate of exactly 1 is easy to achieve, growth rates close to 1 are in effect “structurally difficult”. For example, above about 0.741 adaptive evolution tends to “cheat”—adding a “hat” at the top of the pattern instead of actually producing a boundary with consistent slope:
What about other shapes as goals? Here’s what happens with diamond shapes of difficult widths:
Adaptive evolution is quite constrained by what’s structurally possible in a cellular automaton of this type—and the results are not particularly good. And indeed if one attempts to “generalize” them, it’s clear none of them really “have the idea” of a diamond shape:
In the examples we’ve discussed so far, we’ve focused on what cellular automata do over the course of a fixed number of steps—not worrying about what they might do later. But another goal we might have—which in fact I have discussed at length elsewhere—is just to have our cellular automata produce patterns that go for a certain number of steps and then die out. So, for example, we can use adaptive evolution to find cellular automata whose patterns live for exactly 50 steps, and then die out:
Just like in our other examples, adaptive evolution finds all sorts of diverse and “interesting” solutions to the problem of living for exactly 50 steps. Some (like the last one and the yellow one) have a certain level of “obvious mechanism” to them. But most of them seem to “just work” for no “obvious reason”. Presumably the constraint of living for 50 steps in some sense just isn’t stringent enough to “squeeze out” computational irreducibility—so there is still plenty of computational irreducibility in these results.
What about mutational complexity? Approximating this, as before, by the median of the number of adaptive evolution steps—and sampling a few hundred cases for each lifetime goal (and plotting the quartiles as well as the median)—we see a systematic increase of mutational complexity as we increase the lifetime goal:
In effect this shows us that if we increase the lifetime goal, it becomes more and more difficult for adaptive evolution to reach that goal. (And, as we’ve discussed elsewhere, if we go far enough, we’ll ultimately reach the edge of what’s even in principle possible with, say, the particular type of 4-color cellular automata we’re using here.)
All the objectives we’ve discussed so far have the feature that they are in a sense explicit and fixed: we define what we want (e.g. a cellular automaton pattern that lives for exactly 50 steps) and then we use adaptive evolution to try to get it. But looking at something like lifetime suggests another possibility. Instead of our objective being fixed, our objective can instead be open ended. And so, for example, we might ask not for a specific lifetime, but to get the largest lifetime we can.
I’ve discussed this case at some length elsewhere. But how does it relate to the concept of the rulial ensemble that we’re studying here? When we have rules that are found by adaptive evolution with fixed constraints we end up with something that can be thought of as roughly analogous to things like the canonical (“specified temperature”) ensemble of traditional statistical mechanics. But if instead we look at the “winners” of open-ended adaptive evolution then what we have is more like a collection of extreme value than something we can view as typical of the “bulk of an ensemble”.
We’ve just looked at the goal of having patterns that survive for a certain number of steps. Another goal we can consider is having patterns that periodically repeat after a certain number of steps. (We can think of this as an extremely idealized analog of having multiple generations of a biological organism.)
Here are examples of rules found by adaptive evolution that lead to patterns which repeat after exactly 50 steps:
As usual, there’s a distribution in the number of steps of adaptive evolution required to achieve these results:
Looking at the median of the analogous distributions for different possible periods, we can get an estimate of the mutational complexity of different periods—which seems to increase somewhat uniformly with period:
By the way, it’s also possible to restrict our adaptive evolution so that it samples only symmetric rules; here are a few examples of period-50 results found in this way:
In our discussion of periodicity so far, we’ve insisted on “periodicity from the start”—meaning that after each period the pattern we get has to return to the single-cell state we started from. But we can also consider periodicity that “develops after a transient”. Sometimes the transient is short; sometimes it’s much longer. Sometimes the periodic pattern starts from a small “seed”; sometimes its seed is quite large. Here are some examples of patterns found by adaptive evolution that have ultimate period 50:
By the way, in all these cases the periodic pattern seems like the “main event” of the cellular automaton evolution. But there are other cases where it seems more like a “residue” from other behavior—and indeed that “other behavior” can in principle go on for arbitrarily long before finally giving way to periodicity:
We’ve been talking so far about the objective of finding cellular automaton rules that yield patterns with specific periodicity. But just like for lifetime, we can also consider the “open-ended objective” of finding rules with the longest periods we can. And here are the best results found with a few runs of 10,000 steps of adaptive evolution (here we’re looking for periodicity without transients):
We’ve now seen lots of examples of cellular automata found by adaptive evolution. And a key question is: “what’s special about them?” If we look at cellular automata with rules chosen purely at random, here are typical examples of what we get:
Some of these patterns are simple. But many are complicated and in fact look quite random—though often with regions or patches of regularity. But what’s striking is how visually different they look from what we’ve mostly seen above in cellular automata that were adaptively evolved “for a purpose”.
So how can we characterize—and ultimately measure—that difference? Our randomly picked cellular automata seem to show either almost total computational reducibility or “unchecked computational irreducibility” (albeit usually with regions or patches of computational reducibility). But cellular automata that were successfully “evolved for a purpose” tend to look different. They tend to show what we’re calling mechanoidal behavior: behavior in which there are identifiable “mechanism-like features”, albeit usually mixed in with at least some—typically highly contained—“sparks of computational irreducibility”.
At a visual level there are typically some clear characteristics to mechanoidal behavior. For example, there are usually repeated motifs that appear throughout a system. And there’s also usually a certain degree of modularity, with different parts of the system operating at least somewhat independently. And, yes, there’s no doubt a rich phenomenology of mechanoidal behavior to be studied (closely related to the study of class 4 behavior). But at a coarse and potentially more immediately quantitative level a key feature of mechanoidal behavior is that it involves a certain amount of regularity, and computational reducibility.
So how can one measure that? Whenever there’s regularity in a system it means there’s a way to summarize what the system does more succinctly than by just specifying every state of every element of the system. Or, in other words, there’s a way to compress our description of the system.
Perhaps, one might think, one could use modularity to do this compression, say by keeping only the modular parts of a system, and eliding away the “filler” in between—much like run-length encoding. But—like run-length encoding—the most obvious version of this runs into trouble when the “filler” consists, say, of alternating colors of cells. One can also think of using block-based encoding, or dictionary encoding, say leveraging repeated motifs that appear. But it’s an inevitable feature of computational irreducibility that in the end there can never be one universally best method of compression.
But as an example, let’s just use the Wolfram Language Compress function. (Using GZIP, BZIP2, etc. gives essentially identical results.) Feed in a cellular automaton pattern, and Compress will give us a (losslessly) compressed version of it. We can then use the length of this as a measure of what’s left over after we “compress out” the regularities in the pattern. Here’s a plot of the “compressed description length” (in “Compress output bytes”) for some of the patterns we’ve seen above:
And what’s immediately striking is that the patterns “evolved for a purpose” tend to be in between patterns that come from randomly chosen rules.
(And, yes, the question of what’s compressed by what is a complicated, somewhat circular story. Compression is ultimately about making a model for things one wants to compress. And, for example, the appropriate model will change depending on what those things are. But for our purposes here, we’ll just use Compress—which is set up to do well at compressing “typical human-relevant content”.)
OK, so how does adaptive evolution relate to our “compressed size” measure? Here’s an example of the typical progression of an adaptive evolution process—in this case based on the goal of generating the
sequence:
Given our way of starting with the null rule, everything is simple at the beginning—yielding a small compressed size. But soon the system starts developing computational irreducibility, and the compressed size goes up. Still, as the adaptive evolution proceeds, the computational irreducibility is progressively “squeezed out”—and the compressed size settles down to a smaller value.
The plot above is based on a particular (randomly chosen) sequence of mutations in the underlying rule. But if we look at the average from a large collection of mutation sequences we see very much the same thing:
Even though the “error rate” on average goes down monotonically, the compressed size of our candidate patterns has a definite peak before settling to its final value. In effect, it seems that the system needs to “explore the computational universe” a bit before figuring out how to achieve its goal.
But how general is this? If we don’t insist on reaching zero error we get a curve of the same general shape, but slightly smoothed out (here for 20 or fewer errors):
What about for other target sequences? Here are results for all the target sequences we considered before (with the number of errors allowed in each case indicated):
In all cases we get final compressed sizes that are much smaller than those for all-but-very-simple randomly chosen rules—indicating that our adaptive evolution process has indeed generated regularity, and our compression method has successfully picked this up.
Looking at the overall shape of the curves we’ve generated here, there seems to be a general phenomenon in evidence: the process of adaptive evolution seems to pass through a “computationally irreducible period” before getting to its final mechanoidal state. Even as it gets progressively closer to its goal, the adaptive evolution process ends up still “playing the field” before homing in on its “final solution”. (And, yes, phenomena like this are seen both in the fossil record of life on Earth and in the development of engineering systems.)
We began by asking the question “what consequences does being ‘evolved for a purpose’ have on a system?” We’ve now seen many examples of different “purposes”, and how adaptive evolution can achieve them. The details are different in different cases. But we’ve seen that there are general features that occur very broadly. Perhaps most notable is that in systems that were “adaptively evolved for a purpose” one tends to see what we can call mechanoidal behavior: behavior in which there are “mechanism-like” elements.
Rules that are picked purely at random—in effect uniformly from rulial space—rarely show mechanoidal behavior. But in rules that have been adaptively evolved for a purpose mechanoidal behavior is the norm. And what we’ve found here is that this is true essentially regardless of what the specific purpose involved is.
We can think of rules that have been adaptively evolved for a purpose as forming what we can call a rulial ensemble: a subset of the space of all possible rules, concentrated by the process of adaptive evolution. And what we’ve seen here is that there are in effect generic features of the rulial ensemble—features that are generically seen in rules that have been adaptively evolved for a purpose.
Given a particular purpose (like “produce this specific sequence as output”) there are typically many ways it can be achieved. It could be that one can engineer a solution that shows a “human-recognizable mechanism” all the way through. It could be that it would be a “lump of computationally irreducible behavior” that somehow “just happens” to have the result of doing what we want. But adaptive evolution seems to produce solutions with what amount to intermediate characteristics. There are elements of mechanism to be seen. And there are also usually certain “sparks of computational irreducibility”. Typically what we see is that in the process of adaptive evolution all sorts of computational irreducibility is generated. But to achieve whatever purpose has been defined requires “taming” that irreducibility. And, it seems, introducing the kind of “patches of mechanism” that are characteristic of the mechanoidal behavior we have seen.
That those “patches of mechanism” fit together to achieve an overall purpose is often a surprising thing. But it’s the essence of the kind of “bulk orchestration” that we see in the systems we’ve studied here, and that seems also to be characteristic of biological systems.
Having specified a particular purpose it’s often completely unclear how a given kind of system could possibly achieve it. And indeed what we’ve seen adaptive evolution do here often seems like magic. But in a sense what we’re seeing is just a reflection of the immense power that is available in the computational universe, and manifest in the phenomenon of computational irreducibility. Without computational irreducibility we’d somehow “get what we expect”; computational irreducibility is what adds what is ultimately an infinite element of surprise. And what we’ve seen is that adaptive evolution manages to successfully harness that “element of surprise” to achieve particular purposes.
Let’s say our goal is to generate a particular sequence of values. One might imagine just operating on a space of possible sequences, and gradually adaptively evolving to the sequence we want. But that’s not the setup we’re using here, and it’s not the setup biology has either. Instead what’s happening both in our idealized cellular automaton systems—and in biology—is that the adaptive evolution process is operating at the level of underlying rules, but the purposes are achieved by the results of running those rules. And it’s because of this distinction—which in biology is associated with genotypes vs. phenotypes—that computational irreducibility has a way to insert itself. In some sense both the underlying rules and overall patterns of purposes can be thought of as defining certain “syntactic structures”. The actual running of the rules then represents what one can think of as the “semantics” of the system. And it’s there that the power of computation is injected.
But in the end, just how powerful is adaptive evolution, with its ability to tap into computational irreducibility, and to “mine” the computational universe? We’ve seen that some goals are easier to reach than others. And indeed we introduced the concept of mutational complexity to characterize just how much “effort of adaptive evolution” (and, ultimately, how many mutations)—is needed to achieve a given objective.
If adaptive evolution is able to “fully run its course” then we can expect it to achieve its objective through rules that show mechanoidal behavior and clear “evidence of mechanism”. But if the amount of adaptive evolution stops short of what’s defined by the mutational complexity of the objective, then one’s likely to see more of the “untamed computational irreducibility” that’s characteristic of intermediate stages of adaptive evolution. In other words, if we “get all the way to a solution” there’ll be mechanism to be seen. But if we stop short, there’s likely to be all sorts of “gratuitous complexity” associated with computational irreducibility.
We can think of our final rulial ensemble as consisting of rules that “successfully achieve their purpose”; when we stop short we end up with another, “intermediate” rulial ensemble consisting of rules that are merely on a path to achieve their purpose. In the final ensemble the forces of adaptive evolution have in a sense tamed the forces of computational irreducibility. But in the intermediate ensemble the forces of computational irreducibility are still strong, bringing the various universal features of computational irreducibility to the fore.
It may be useful to contrast all of this with what happens in traditional statistical mechanics, say of molecules in a gas—in which I’ve argued (elsewhere) that there’s in a sense almost always “unchecked computational irreducibility”. And it’s this unchecked computational irreducibility that leads to the Second Law—by producing configurations of the system that we (as computationally bounded observers) can’t distinguish from random, and can therefore reasonably model statistically just as being “typical of the ensemble”, where now the ensemble consists of all possible configurations of the system that, for example, have the same energy or the same temperature. Adaptive evolution is a different story. First, it’s operating not directly on the configurations of a system, but rather on the underlying rules for the system. And second, if one takes the process of adaptive evolution far enough (so that, for example, the number of steps of adaptive evolution is large compared to the mutational complexity of the goal one’s pursuing), then adaptive evolution will “tame” the computational irreducibility. But there’s still an ensemble involved—though now it’s an ensemble of rules rather than an ensemble of configurations. And it’s not an ensemble that somehow “covers all rules”; rather, it’s an ensemble that is “sculpted” by the constraint of “achieving a purpose”.
I’ve argued that it’s the characteristics of “observers like us” that ultimately lead to the perceived validity of the Second Law. So is there a role for the notion of an observer in our discussion here of bulk orchestration and the rulial ensemble? Well, yes. And in particular it’s critical in grounding our concept of “purpose”. We might ask: what possible “purposes” are there? Well, for something to be a reasonable “purpose” there has to be some way to decide whether or not it’s been achieved. And to make such a decision we need something that’s in effect an “observer”.
In the case of statistical mechanics and the Second Law—and, in fact, in all our recent successes in deriving foundational principles in both physics and mathematics—we want to consider observers that are somehow “like us”, because in the end what matters for us is how we as observers perceive things. I’ve argued that the most critical feature of observers like us is that we’re computationally bounded (and also, somewhat relatedly, that we assume we’re persistent in time). And it then turns out that the interplay of these features with underlying computational irreducibility is what seems to lead to the core principles of physics (and mathematics) that we’re familiar with.
But what should we assume about the “observer”—or the “determiner of purpose”—in adaptive evolution, and in particular in biology? It turns out that once again it seems as if computational boundedness is the key. In biological evolution, the implicit “goal” is to have a successful (or “fit”) organism that will, for example, reproduce well in its environment. But the point is that this is a coarse constraint—that we can think of at a computational level as being computationally bounded.
And indeed I’ve recently argued that it’s this computational boundedness that’s ultimately responsible for the fact that biological evolution can work at all. If to be successful an organism always immediately had to satisfy some computationally very complex constraint, adaptive evolution wouldn’t typically ever be able to find the necessary “solution”. Or, put another way, biological evolution works because the objectives it ends up having to achieve are of limited mutational complexity.
But why should it be that the environment in which biological evolution occurs can be “navigated” by satisfying computational bounded constraints? In the end, it’s a consequence of the inevitable presence of pockets of computational reducibility in any ultimately computationally irreducible system. Whatever the underlying structure of things, there will always be pockets of computational reducibility to be found. And it seems that that’s what biology and biological evolution rely on. Specific types of biological organisms are often thought of as populating particular “niches” in the environment defined by our planet; what we’re saying here is that all possible “evolved entities” populate an abstract “meta niche” associated with possible pockets of computational reducibility.
But actually there’s more. Because the presence of pockets of computational reducibility is also what ultimately makes it possible for there to be observers like us at all. As I’ve argued elsewhere, it’s an abstract necessity that there must exist a unique object—that I call the ruliad—that is the entangled limit of all possible computational processes. And everything that exists must somehow be within the ruliad. But where then are we? It’s not immediately obvious that observers like us—with a coherent existence—would be possible within the ruliad. But if we are to exist there, we must in effect exist in some computationally reducible slice of the ruliad. And, yes, for us to be the way we are, we must in effect be in such a slice.
But we can still ask why and how we got there. And that’s something that’s potentially informed by the notion of adaptive evolution that we’ve discussed here. Indeed, we’ve argued that for adaptive evolution to be successful its objectives must in effect be computationally reducible. So as soon as we know that adaptive evolution is operating it becomes in a sense inevitable that it will lead to pockets of computational reducibility. That doesn’t in and of itself explain why adaptive evolution happens—but in effect it shows that if it does, it will lead to observers that at some level have characteristics like us. So then it’s a matter of abstract scientific investigation to show—as we in effect have here—that within the ruliad it’s at least possible to have adaptive evolution.
Any phenomenon can potentially be explained both in terms of purpose and in terms of mechanism. Why does a projectile follow that trajectory? One can explain it as following the mechanism defined by its laws of motion. Or one can explain it as following the purpose of satisfying some overall variational principle (say, extremizing the action associated with the trajectory). Sometimes phenomena are more conveniently explained in terms of mechanism; sometimes in terms of purpose. And one can imagine that the choice could be determined by which is somehow the “computationally simpler” explanation.
But what about the systems and processes we’ve discussed here? If we just run a system like a cellular automaton we always know its “mechanism”—it’s just its underlying rules. But we don’t immediately know its “purpose”, and indeed if we pick an arbitrary rule there’s no reason to think it will have a “computationally simple” purpose. In fact, insofar as the running of the cellular automaton is a computationally irreducible process, we can expect that it won’t “achieve” any such computationally simple purpose that we can identify.
But what if the rule is determined by some process of adaptive evolution? Well, then the objective of that adaptive evolution can be seen as defining a purpose that we can use to describe at least certain aspects of the behavior of the system. But what exactly are the implications of the “presence of purpose” on how the system operates? The key point that’s emerged here is that when there’s a computationally simple purpose that’s been achieved through a process of adaptive evolution, then the rules for the system will be part of what we’ve called the rulial ensemble. And then what we’ve argued is that there are generic features of the rulial ensemble—features that don’t depend on what specific purpose the adaptive evolution might have achieved, only that its purpose was computationally simple. And foremost among these generic features is the presence of mechanoidal behavior.
In other words, so long as there is an overall computationally simple purpose, what we’ve found is that—whatever in detail that purpose might have been—the presence of purpose tends to “push itself down” to produce behavior that locally is mechanoidal, in the sense that it shows evidence of “visible mechanism”. What do we mean by “visible mechanism”? Operationally, it tends to be mechanism that’s readily amenable to “human-level narrative explanation”.
It’s worth remembering that at the lowest level the systems we’ve studied are set up to have simple “mechanisms” in the sense that they have simple underlying rules. But once these rules run they generically produce computationally irreducible behavior that doesn’t have a simple “narrative-like” description. But when we’re looking at the results of adaptive evolution we’re dealing with a subset of rules that are part of the rulial ensemble—and so we end up seeing mechanoidal behavior with at least local “narrative descriptions”.
As we’ve discussed, though, if the adaptive evolution hasn’t “entirely run its course”, in the sense that the mutational complexity of the objective is higher than the actual amount of adaptive evolution that’s been done, then there’ll still be computational irreducibility that hasn’t been “squeezed out”.
So how does all this relate to biology? The first key element of biology as far as we are concerned here is that there’s a separate genotype and phenotype—related by what’s presumably in general the computationally irreducible process of biological growth, etc. The second key element of biology for our purposes here is the phenomenon of self reproduction—in which new organisms are produced with genotypes that are identical up to small mutations. Both these elements are immediately captured by the simple cellular automaton model we’ve used here.
And given them, we seem to be led inexorably to our conclusions here about the rulial ensemble, mechanoidal behavior, and bulk orchestration.
It’s often been seen as mysterious how there ends up being so much apparent complexity in biology. But once one knows about computational irreducibility, one realizes that actually complexity is quite ubiquitous. And instead what’s in many respects more surprising is the presence of any “explainable mechanism”. But what we’ve seen here through the rulial ensemble is that such explainable mechanism is in effect a shadow of overall “simple computational purposes”.
At some level there’s computational irreducibility everywhere. And indeed it’s
the driver for rich behavior. But what happens is that “in the presence of overall purpose”, patches of computational irreducibility have to be fitted together to achieve that purpose. Or, in other words, there’s inevitably a certain “bulk orchestration” of all those patches of computation. And that’s what we see so often in actual biological systems.
So what is it in the end that’s special about life—and biological systems? I think—more than anything—it’s that it’s involved so much adaptive evolution. All those ~1040 individual organisms in the history of life on earth have been links in chains of adaptive evolution. It might have been that all that “effort of adaptive evolution” would have the effect of just “solving the problem”—and producing some “simple mechanistic solution”.
But that’s not what we’ve seen here, and that’s not how we can expect things to work. Instead, what we have is an elaborate interplay of “lumps of computational irreducibility” being “harnessed” by “simple mechanisms”. It matters that there’s been so much adaptive evolution. And what we’re now seeing throughout biology—down to the smallest features—is the consequences of that adaptive evolution. Yes, there are “frozen accidents of history” to be seen. But the point is not those individual items, but rather the whole aggregate consequence of adaptive evolution: mechanoidal behavior and bulk orchestration.
And it’s because these features are generic that we can hope they can form the basis for what amounts to a robust “bulk” theory of biological systems and biological behavior: something that has roughly the same “bulk theory” character as the gas laws, or fluid dynamics. But that’s now talking about systems that have been subject to large-scale adaptive evolution at the level of their rules.
What I’ve done here is very much just a beginning. But I believe the ruliological investigations I’ve shown, and the general framework I’ve described, provide the raw material for something we’ve never had before: a well defined and general “fundamental theory” for the operation of living systems. It won’t describe all the “historical accident” details. But I’m hopeful that it will provide a useful global view of what’s going on, that can potentially be harvested for answers to all sorts of useful questions about the remarkable phenomenon we call life.
In our explorations of the rulial ensemble—and of mutational complexity—we’ve looked at a range of possible objective functions. But we’ve always considered just one adaptive evolution strategy: accepting or rejecting the results of single-point random mutations in the rule at each step. So what would happen if we were to adopt a different strategy? The main conclusion is: it doesn’t seem to matter much. For a particular objective function, there are adaptive evolution strategies that get to a given result in fewer adaptive steps, or that ultimately get further than our usual strategy ever does. But in the end the basic story of the rulial ensemble—and of mutational complexity—is robust relative to detailed changes in our adaptive evolution process.
As an example, consider allowing more than one random mutation in the rule at each adaptive evolution step. Let’s say—as at the very beginning above—that our objective is to generate
after 50 steps. With just one mutation at each step it’s only a small fraction of adaptive evolution “runs” that reach “perfect solutions”, but with more mutations at each step, more progress is made, at least in this case—as illustrated by histograms of the “number of errors” remaining after 10,000 adaptive evolution steps:
Looked at in terms of multiway systems, having fewer mutations at each step leads to fewer paths between possible rules, and a greater possibility of “getting stuck”. With more mutations at each step there are more paths overall, and a lower possibility of getting stuck.
So what about mutational complexity? If we still say that each adaptive evolution step accounts for a single unit of mutational complexity (even if it involves multiple underlying mutations in the rule), this shows how the mutational complexity (as computed above) for different objectives is affected by having different numbers of underlying mutations at each step (the height of each “tick” indicates the number of mutations):
So, yes, different numbers of mutations at each step lead to mutational complexities that are different in detail, but in most cases, surprisingly similar overall.
What about other adaptive evolution strategies? There are many one could consider. For example, a “gradient descent” approach where at each step we examine all possible mutations, and pick the “best one”—i.e. the one that increases fitness the most. (We can extend this by keeping not just the top rule at each step, but, say, the top 5—in an analog of “beam search”.) There’s also a “collaborative” approach, where multiple different “paths” of random mutations are followed, but where every so often all are reset to be the best found so far. And indeed, we can consider all sorts of techniques from reinforcement learning.
In particular cases, any of these approaches can have significant effects. But in general the phenomena we’re discussing here seem robust enough that the details of how adaptive evolution is done don’t matter much, and the single-mutation strategy we’ve mostly used here can be considered adequately representative.
It’s been obvious since antiquity that living organisms contain some kind of “gooey stuff” that’s different from what one finds elsewhere in nature. But what is it? And how universal might it be? In Victorian times it had various names, the most notable being “protoplasm”. The increasing effectiveness of microscopy made it clear that there was actually a lot of structure in “living matter”—as, for example, reflected in the presence of different organelles. But until the later part of the twentieth century there was a general belief that fundamentally what was going on in life was chemistry—or at least biochemistry—in which all sorts of different kinds of molecules were randomly moving and undergoing chemical reactions at certain rates.
The digital nature of DNA discovered in 1953 slowly began to erode this picture, adding in ideas about molecular-scale mechanisms and “machinery” that could be described in mechanical or informational—rather than “bulk statistical”—ways. And indeed a notable trend in molecular biology over the past several decades has been the discovery of more and more ways in which molecular processes in living systems are “actively orchestrated”, and not just the result of “random statistical behavior”.
When a single component can be identified (cell membranes, microtubules, biomolecular condensates, etc.) there’s quite often been “bulk theory” developed, typically based on ideas from statistical mechanics. But when it comes to constellations of different kinds of components, most of what’s been done has been to collect the material to fill many volumes of biology texts with what amount to narrative descriptions of how in detail things operate—with no attempt at any kind of “general picture” independent of particular details.
My own foundational interest in biology goes back about half a century. And when I first started studying cellular automata at the beginning of the 1980s I certainly wondered (as others had before) whether they might be relevant to biology. That question was then greatly accelerated when I discovered that even with simple rules, cellular automata could generate immensely complex behavior, which often looked visually surprisingly “organic”.
In the 1990s I put quite some effort into what amount to macroscopic questions in biology: how do things grow into different shapes, produce different pigmentation patterns, etc.? But somehow in everything I studied there was a certain assumed uniformity: many elements might be involved, but they were all somehow essentially the same. I was well aware of the complex reaction networks—and, later, pieces of “molecular machinery”—that had been discovered. But they felt more like systems from engineering—with lots of different detailed components—and not like systems where there could be a “broad theory”.
Still, within the type of engineering I knew best—namely software engineering—I kept wondering whether there might perhaps be general things one could say, particularly about the overall structure and operation of large software systems. I made measurements. I constructed dependency graphs. I thought about analogies between bugs and diseases. But beyond a few power laws and the like, I never really found anything.
A lot changed, however, with the advent of our Physics Project in 2020. Because, among other things, there was now the idea that everything—even the structure of spacetime—was dynamic, and in effect emerged from a graph of causal relationships between events. So what about biology? Could it be that what mattered in molecular biology was the causal graph of interactions between individual molecules? Perhaps there needed to be a “subchemistry” that tracked specific molecules, rather than just molecular species. And perhaps to imagine that ordinary chemistry could be the basis for biology was as wide of the mark as thinking that studying the physics of electron gases would let one understand microprocessors.
Back in the early 1980s I had identified that the typical behavior of systems like cellular automata could be divided into four classes—with the fourth class having the highest level of obvious complexity. And, yes, the visual patterns produced in class 4 systems often had a very “life like” appearance. But was there a foundational connection? In 2023, as part of closing off a 50-year personal journey, I had been studying the Second Law of thermodynamics, and identified class 4 systems as ones that—like living systems—aren’t usefully described just in terms of a tendency to randomization. Reversible class 4 systems made it particularly clear that in class 4 there really was a distinct form of behavior—that I called “mechanoidal”—in which a certain amount of definite structure and “mechanism” are visible, albeit embedded in great overall complexity.
At first far from thinking about molecules I made some surprise progress in 2024 on biological evolution. In the mid-1980s I had considered modeling evolution in terms of successive small mutations to things like cellular automaton rules. But at the time it didn’t work. And it’s only after new intuition from the success of machine learning that in 2024 I tried again—and this time it worked. Given an objective like “make a finite pattern that lives as long as possible”, adaptive evolution by successive mutation of rules (say in a cellular automaton) seemed to almost magically find—typically very elaborate—solutions. Soon I had understood that what I was seeing was essentially “raw computational irreducibility” that just “happened” to fit the fairly coarse fitness criteria I was using. And I then argued that the success of biological evolution was the result of the interplay between computationally bounded fitness and underlying computational irreducibility.
But, OK, what did that mean for the “innards” of biological systems? Exploring an idealized version of medicine on the basis of my minimal models of biological evolution led me to look in a bit more detail at the spectrum of behavior in adaptively evolved systems, and in variants produced by perturbations.
But was there something general—something potentially universal—that one could say? In the Second Law one often talks about how behavior one sees is somehow overwhelmingly likely to be “typical of the ensemble of all possibilities”. In the usual Second Law, the possibilities are just different initial configurations of molecules, etc. But in thinking about a system that can change its rules, the ensemble of all possibilities is—at least at the outset—much bigger: in effect it’s the whole ruliad. But the crucial point I realized a few months ago is that the changes in rules that are relevant for biology must be ones that can somehow be achieved by a process of adaptive evolution. But what should the objective of that adaptive evolution be?
One of the key lessons from our Physics Project and the many things informed by it is that, yes, the character of the observer matters—but knowing just a little about an observer can be sufficient to deduce a lot about the laws of behavior that observer will perceive. So that led to the idea that this kind of thing might be true about objective functions—and that just knowing that, for example, an objective function is computationally simple might be enough to tell one things about the “rulial ensemble” of possible rules.
But is that really true? Well, as I have done so many times, I turned to computer experiments to find out. In my work on the foundations of biological evolution I had used what are in a sense very “generic” fitness functions (like overall lifetime). But now I needed a whole spectrum of possible fitness functions—some of them by many measures in fact even simpler to set up than something like lifetime.
The results I’ve reported here I consider very encouraging. The underlying details don’t seem to matter much. But a broad range of computationally simple fitness functions seem to “worm their way down” to have the same kind of effect on small-scale behavior—and to make it in some sense mechanoidal.
Biology has tended to be a field in which one doesn’t expect much in the way of formal theoretical underpinning. And indeed—even after all the data that’s been collected—there are remarkably few “big theories” in biology: natural selection and the digital/informational nature of DNA have been pretty much the only examples. But now, from the type of thinking introduced by our Physics Project, we have something new and different: the concept of the rulial ensemble, and the idea that essentially just from the fact of biological evolution we can talk about certain features of how adaptively evolved systems in biology must work.
Traditional mathematical methods have never gotten very far with the broad foundations of biology. And nor, in the end, has specific computational modeling. But by going to a higher level—and in a sense thinking about the space of all possible computations—I believe we can begin to see a path to a powerful general framework not only for the foundations of biology, but for any system that shows bulk orchestration produced by a process of adaptive evolution.
Thanks to Willem Nielsen of the Wolfram Institute for extensive help, as well as to Wolfram Institute affiliates Júlia Campolim and Jesse Angeles Lopez for their additional help. The ideas here have ultimately come together rather quickly, but in both the recent and the distant past discussions with a number of people have provided useful ideas and background—including with Richard Assar, Charles Bennett, Greg Chaitin, Paul Davies, Walter Fontana, Nigel Goldenfeld, Greg Huber, Stuart Kauffman, Chris Langton, Pedro Márquez-Zacarías and Elizabeth Wolfram.
2025-09-16 03:20:59

Click any diagram to get Wolfram Language code to reproduce it.
It’s a story of pure, abstract computation. In fact, historically, one of the very first. But even though it’s something I for one have used in practice for nearly half a century, it’s not something that in all my years of exploring simple computational systems and ruliology I’ve ever specifically studied. And, yes, it involves some fiddly technical details. But it’ll turn out that lambdas—like so many systems I’ve explored—have a rich ruliology, made particularly significant by their connection to practical computing.
In Wolfram Language it’s the Function function. Back when Alonzo Church first discussed it in the 1930s he called it λ (lambda). The idea is to have something that serves as a “pure function”—which can be applied to an argument to give a value. For example, in the Wolfram Language one might have:
On its own it’s just a symbolic expression that evaluates to itself. But if we apply it to an argument then it substitutes that argument into its body, and then evaluates that body:
In the Wolfram Language we can also write this as:
Or, with λ = Function, we can write
which is close to Church’s original notation (λx .(1 + x)) 8.
But what if we want to “go even purer”, and effectively just “use λ for everything”—without any pre-defined functions like Plus (+)? To set up lambdas at all, we have to have a notion of function application (as in f[x] meaning “apply f to x”). So, for example, here’s a lambda (“pure function”) that just has a structure defined by function application:
Can we do familiar operations with this kind of setup? Let’s imagine we have a symbol z that represents 0, and another symbol s that represents the operation of computing a successor. Then here’s a representation of the integers 0 through 5:
But let’s look for example at s[s[s[s[z]]]]. We can “factor out” the s and z in this expression and write the “purely structural” part just in terms of lambdas:
But in a sense we don’t need the s and z here; we can perfectly well set up a representation for integers just in terms of lambdas, say as (often known as “Church numerals”):
It’s all very pure—and abstract. And, at least at first, it seems very clean. But pretty soon one runs into a serious problem. If we were to write
what would it mean? In x[x] are the x’s associated with the x in the inner λ or the outer one? Typing such an expression into the Wolfram Language, x’s turn red to indicate that there’s a problem:
And at this level the only way to resolve the problem is to use different names for the two x’s—for example:
But ultimately the particular names one uses don’t matter. For example, swapping x and y
represents exactly the same function as before.
How can one deal with this? One approach—that was actually invented more than a decade before lambdas, and that I in fact wrote a whole book about a few years ago—is to use so-called “combinators”: constructs S and K that in effect abstractly define how to rearrange symbolic expressions, without having to name anything, or, for that matter, introduce variables at all. So, for example, λ[x, λ[y, y[x]]] can be written
as one can confirm using:
At some level this is very elegant. But, yes, it’s extremely difficult to read. So is there a way to preserve at least some of the comparative readability of lambdas without having to explicitly introduce variables with names, etc.? It turns out there is. And it starts from the observation that a named variable (say x) is basically just a way of referring back to the particular lambda (say λ[x, …]) that introduced that variable. And the idea is that rather than making that reference using a named variable, we make it instead by saying “structurally” where the lambda is relative to whatever is referring to it.
So, for example, given
we can write this in the “de Bruijn index” (pronounced “de broin”) form
where the 1 says that the variable in that position is associated with the λ that’s “one λ back” in the expression, while that 2 says that the variable in that position is “two λ’s back” in the expression:
So, for example, our version of the integers as lambdas can now be written as:
OK, so lambdas can represent things (like integers). But can one “compute with lambdas”? Or, put another way, what can pure lambdas do?
Fundamentally, there’s just one operation—usually called beta reduction—that takes whatever argument a lambda is given, and explicitly “injects it” into the body of the lambda, in effect implementing the “structural” part of applying a function to its argument. A very simple example of beta reduction is
which is equivalent to the standard Wolfram Language evaluation:
In terms of de Bruijn indices this becomes
which can again be “evaluated” (using beta reduction) to be:
This seems like a very simple operation. But, as we’ll see, when applied repeatedly, it can lead to all sorts of complex behavior and interesting structure. And the key to this turns out to be the possibility of lambdas appearing as arguments of other lambdas, and through beta reduction being injected into their bodies.
But as soon as one injects one lambda into another, there are tricky issues that arise. If one writes everything explicitly with variables, there’s usually variable renaming that has to be done, with, for example
evaluating to:
In terms of de Bruijn indices there’s instead renumbering to be done, so that
evaluates to:
It’s easy to describe informally what beta reduction is doing. In Wolfram Language terms it’s basically just the transformation:
In other words, given λ[x, body][y], beta reduction is replacing every instance of x in body with y, and returning the result. But it’s not quite as simple as that. For example, if y contains a lambda—as in the example above—then we may have to do a certain amount of renaming of variables (usually called “alpha conversion”). And, yes, there can potentially be a cascade of renamings. And you might have to come up with an arbitrary number of distinct new names to use in these renamings.
And once you’ve done the substitution (“beta reduction”) of λ[x, body][y] this can “expose” another lambda in which you have to do more substitution, and so on. This might sound like a detail. But actually it’s quite fundamental—and it’s what lets lambdas represent computations that “keep running”, and that give rise to the rich ruliology we’ll discuss here.
Is it somehow easier to do beta reduction with de Bruijn indices? No, not really. Yes, you don’t have to invent names for variables. But instead you have to do rather fiddly recursive realignment of indices whenever you’re in effect inserting or deleting lambdas.
There’s another issue as well. As we’ll explore in detail later, there are often multiple beta reductions that can be done on a given lambda. And if we trace all the possibilities we get a whole multiway graph of possible “paths of lambda evaluation”. But a fundamental fact about lambdas (known as confluence or the Church–Rosser property, and related to causal invariance) is that if a sequence of beta reductions for a given lambda eventually reaches a fixed point (which, as we’ll see, it doesn’t always), that fixed point will be unique: in other words, evaluating the lambda will always give a definite result (or “normal form”).
So what should one call the things one does when one’s working with lambdas? In Wolfram Language we talk of the process of, say, turning Function[x, x[x]][a] (or λ[x, x[x]][a]) into a[a] as “evaluation”. But sometimes it’s instead called, for example, reduction or conversion or substitution or rewriting. In Wolfram Language we might also call λ[…] a “λ expression”, but it’s also sometimes called a λ term. What about what’s inside it? In λ[x, body] the x is usually called a “bound variable”. (If there’s a variable—say y—that appears in body and that hasn’t been “bound” (or “scoped”) by a λ, it’s called a “free variable”. The λ expressions that we’ll be discussing here normally have no free variables; such λ expressions are often called “closed λ terms”. )
Putting a lambda around something—as in λ[x, thing]—is usually called “λ abstraction”, or just “abstraction”, and the λ itself is sometimes called the “abstractor”. Within the “body” of a lambda, applying x to y to form x[y] is—not surprisingly—usually called an “application” (and, as we’ll discuss later, it can be represented as x@y or x•y). What about operations on lambdas? Historically three are identified: α reduction (or α conversion), β reduction (or β conversion) and η reduction (or η conversion). β reduction—which we discussed above—is really the core “evaluation-style” operation (e.g. replacing λ[x, f[x]][y] with f[y]). α and η reduction are, in a sense, “symbolic housekeeping operations”; α reduction is the renaming of variables, while η reduction is the reduction of λ[x, f[x]] to f alone. (In principle one could also imagine other reductions—say transformations directly defined for λ[_][_][_]—though these aren’t traditional for lambdas.)
The actual part of a lambda on which a reduction is done is often called a “redex”; the rest of the lambda is then the “context”—or “an expression with a hole”. In what we’ll be doing here, we’ll mostly be concerned with the—fundamentally computational—process of progressively evaluating/reducing/converting/… lambdas. But a lot of the formal work done in the past on lambdas has concentrated on the somewhat different problem of the “calculus of λ conversion”—or just “λ calculus”—which concentrates particularly on identifying equivalences between lambdas, or, in other words, determining when one λ expression can be converted into another, or has the same normal form as another.
By the way, it’s worth saying that—as I mentioned above—there’s a close relationship between lambdas and the combinators about which I wrote extensively a few years ago. And indeed many of the phenomena that I discussed for combinators will show up, with various modifications, in what I write here about lambdas. In general, combinators tend to be a bit easier to work with than lambdas at a formal and computational level. But lambdas have the distinct advantage that at least at the level of individual beta reductions it’s much easier to interpret what they do. The S combinator works in a braintwisting kind of way. But a beta reduction is effectively just “applying a function to an argument”. Of course, when one ends up doing lots of beta reductions, the results can be very complicated—which is what we’ll be talking about here.
How can one do familiar computations just using lambdas? First, we need a way to represent things purely in terms of lambdas. So, for example, as we discussed above, we can represent integers using successive nesting. But now we can also define an explicit successor function purely in terms of lambdas. There are many possible ways to do this; one example is
where this is equivalent to:
Defining zero as
or
we can now form a representation of the number 3:
But “evaluating” this by applying beta reduction this turns out to reduce to
which was exactly the representation of 3 that we had above.
Along the same lines, we can define
(where these functions take arguments in “curried” form, say, times[3][2], etc.).
So then for example 3×2 becomes
which evaluates (after 37 steps of beta reduction) to:
Applying this to “unevaluated” s and z we get
which under beta reduction gives
which we can recognize as our representation for the number 6.
And, yes, we can go further. For example, here’s a representation (likely not the absolutely minimal one) for the factorial function in terms of pure lambdas:
And with this definition
i.e. 4!, indeed eventually evaluates to 24:
Later on we’ll discuss the process of doing an evaluation like this and we’ll see that, yes, it’s fairly complicated. In the default way we’ll carry out such evaluations, this particular one takes 103 steps (for successive factorials, the number of steps required is 8, 13, 31, 103, 463, 2623, …).
From what we’ve seen so far, one obvious observation about lambdas is that—except in trivial cases—they’re pretty hard to read. But do they need to be? There are lots of ways to write—and render—them. But what we’ll find is that while different ones have different features and advantages (of which we’ll later make use), none of them really “crack the code” of making lambdas universally easy for us to read.
Let’s consider the following lambda, written out in our standard textual way:
This indicates how the variables that appear are related:
And, yes, since each λ “scopes” its bound variable, we could as well “reuse” y in place of z:
One obvious thing that makes an expression like this hard to read is all the brackets it contains. So how can we get rid of these? Well, in the Wolfram Language we can always write f@x in place of f[x]:
And, yes, because the @ operator in the Wolfram Language associates to the right, we can write f@g@x to represent f[g[x]]. So, OK, using @ lets us get rid of quite a few brackets. But instead it introduces parentheses. So what can we do about these? One thing we might try is to use the application operator • (Application) instead of @, where • is defined to associate to the left, so that f•g•x is f[g][x] instead of f[g[x]]. So in terms of • our lambda becomes:
The parentheses moved around, but they didn’t go away.
We can make these expressions look a bit simpler by using spaces instead of explicit @ or •. But to know what the expressions mean we have to decide whether spaces mean @ or •. In the Wolfram Language, @ tends to be much more useful than • (because, for example, f@g@x conveniently represents applying one function, then another). But in the early history of mathematical logic, spaces were effectively used to represent •. Oh, and λ[x, body] was written as λx . body, giving the following for our example lambda:
We’ve been trying to write lambdas out in linear form. But in some sense, they’re always really trees—just like expressions in the Wolfram Language are trees. So here for example is
shown as a tree:
The leaves of the tree are variables. The intermediate nodes are either λ[…, …]’s (“abstractions”) or …[…]’s (“applications”). Everything about lambdas can be done in this tree representation. So, for example, beta reduction can be thought of as a replacement operation on a piece of the tree representing a lambda:
But, OK, as we discussed above, it never matters what specific names have been given to the variables. All that matters is what lambdas they’re associated with. And we can indicate this by overlaying appropriate connections on the tree:
But what if we just route these connections along the branches in the tree?
To find out which λ a given variable goes with, we just have to count how many λ’s we encounter walking up the tree from the leaf representing that variable before we reach “its λ”. So this means we can represent the lambda just by filling in a number at each leaf of the tree:
And these numbers are precisely the de Bruijn indices we discussed earlier. So now we’ve seen that we can replace our “named variables” lambda tree with a de Bruijn index lambda tree.
Writing this particular tree out as an expression we get:
The λ’s now don’t have any explicit variables specified. But, yes, the expression is “hooked up” the same way it was when we were explicitly using variables:
Once again we can get rid of our “application brackets” by using the • (Application) operator:
And actually we don’t really need either the •’s or even the λ’s here: everything can be deduced just from the sequence of brackets. So in the end we can write our lambda out as
where there’s an “implicit λ” before every “[” character, and an implicit • between every pair of digits. (Yes, we’re only dealing with the case where all de Bruijn indices are below 10.)
So now we have a minimal way to write out a lambda as a string of characters. It’s compact, but hard to decode. So how can we do better?
At the very simplest level, we could, for example, color code every character (here with explicit λ’s and brackets included):
And indeed we’ll find this representation useful later.
But any “pure string” makes the nesting structure of the lambda visible only rather implicitly, in the sequence of brackets. So how can we make this more explicit? Well, we could explicitly frame each nested λ:
Or we could just indicate the “depth” of each element—either “typographically”
or graphically (where the depth is TreeDepth):
We can think of this as a kind of one-dimensionally-unrolled version of a tree representation of lambdas. Another approach is in effect to draw the tree on a 2D grid—to form what we can call a Tromp diagram:
To see what’s going on here, we can superimpose our standard tree on the Tromp diagram:
Each vertical line represents an instance of a variable. The top of the line indicates the λ associated with that variable—with the λ being represented by a horizontal line. When one instance of a variable is applied to another, as in u[v], there is a horizontal line spanning from the “u” vertical line to the “v” one.
And with this setup, here are Tromp diagrams for some small lambdas:
Another graphical approach is to use “string diagrams” that directly show how the elements of lambdas are “wired up”. Consider for example the simple lambda:
This can be represented as the string diagram:
The connectivity here is the same as in:
In the string diagram, however, the explicit variables have been “squashed out”, but are indicated implicitly by which λ’s the lines connect back to. At each λ node there is one outgoing edge that represents the argument of the λ, and one incoming edge that connects to its body. Just as when we represent the lambda with a tree, the “content” of the lambda “hangs off” the top of the diagram.
Here are examples of string diagrams for a few small lambdas:
In these diagrams, the edges that “exit the diagram” at the bottom correspond to variables that are unused in the lambda. The “cups” that appear in the diagrams correspond to “variables that are immediately used”, as in λ[x, x]. What about the
in, for example, λ[a,λ[b,b[b]]]? It represents the copying of a variable, which is needed here because b is used twice in the lambda.
Returning to our previous example (now written with explicit variables)
here is the (somewhat complicated) string diagram corresponding to this lambda:
One can think of a string diagram like this as somehow defining the “flow of symbolic content” through “pipes” associated with variables. But is there some purer way to represent such flows, without ever having to introduce variables? Well, yes there is, and it’s to use combinators. And indeed there’s an immediate correspondence between combinator expressions and lambdas. So, for example
can be represented in terms of combinators by:
As is typical of combinators, this is in a sense very pure and uniform, but it’s also very difficult to interpret. And, yes, as we’ll discuss later, small lambda expressions can correspond to large combinator expressions. And the same is true the other way around, so that, for example,
becomes as a lambda expression
or in compact form:
But, OK, this compact form is in some sense a fairly efficient representation. But it has some hacky features. Like, for example, if there are de Bruijn indices whose values are more than one digit long, they’ll need to be separated by some kind of “punctuation”.
But what if we insist on representing a lambda expression purely as a sequence of 0’s and 1’s? Well, one thing we can do is to take our compact form above, represent its “punctuation characters” by fixed strings, and then effectively represent its de Bruijn indices by numbers in unary—giving for our example above:
This procedure can be thought of as providing a way to represent any lambda by an integer (though it’s certainly not true that it associates every integer with a lambda).
To begin our ruliological investigation of lambdas and what they do, we need to discuss what possible forms of lambdas there are—or, in other words, how to enumerate lambda expressions. An obvious strategy is to look at all lambda expressions that have progressively greater sizes. But what is the appropriate measure of “size” for a lambda expression?
Mostly here we’ll use Wolfram Language LeafCount. So, for example
will be considered to have “size” 10 because in the most direct Wolfram Language tree rendering
there are 10 “leaves”. We’ll usually render lambdas instead in forms like
in which case the size is the number of “de Bruijn index leaves” or “variable leaves” plus the number of λ’s.
(Note that we could consider alternative measures of size in which, for example, we assign different weights to abstractions (λ’s) and to applications, or to de Bruijn index “leaves” with different values.)
Using LeafCount as our measure, there are then, for example, 14 lambdas of size 4:
As trees these become
while as Tromp diagrams they are:
The number of lambdas grows rapidly with size:
These numbers are actually given by c[n,0] computed from the simple recurrence:
For large n, this grows roughly like n!, though apparently slightly slower.
At size n, the maximum depth of the expression tree is n – 1; the mean is roughly 3 + n/4 and the limiting distribution is:
We can also display the actual forms of the lambdas, with a notable feature being the comparative diversity of forms seen:
What does it mean to “evaluate” a lambda? The most obvious interpretation is to do a sequence of beta reductions until one reaches a “fixed point” where no further beta reduction can be done.
So, for example, one might have:
If one uses de Bruijn indices, beta reduction becomes a transformation for λ[_][_], and our example becomes:
In compact notation this is
while in terms of trees it is:
By rendering the characters in the de Bruijn index form as color squares, we can also represent this evolution in the “spacetime” form:
But what about other lambdas? None of the lambdas of size 3
contain the pattern λ[_][_], and so none of them allow beta reduction. At size 4 the same is true of most of the 14 possible lambdas, but there is one exception, which allows a single step of beta reduction:
At size 5, 35% of the lambdas aren’t inert, but they still have short “lifetimes”
with the longest “evaluation chain” being:
At size 6, 44% of lambdas allow beta reductions, and aren’t immediately inert:
A couple of lambdas have length-4 evaluation chains:
And now there’s something new—a period-1 “looping lambda” (or “quine”) that just keeps on transforming into itself under beta reduction, and so never reaches a fixed point where no beta reduction can be applied:
At size 7, just over half of the lambdas aren’t inert:
There are a couple of “self-repeating” lambdas:
The first is a simple extension of the self-repeating lambda at size 6; the second is something at least slightly different.
At size 7 there are also a couple of cases where there’s a small “transient” before a final repeating state is reached (here shown in compact form):
Here’s what happens with the lambda trees in this case—where now we’ve indicated in red the part of each tree involved in each beta reduction:
And there’s also something new: three cases where beta reduction leads to lambdas that grow forever. The simplest case is
which grows in size by 1 at each step:
Then there’s
which, after a single-step transient, grows in size by 4 at each step:
And finally there’s
which grows in a slightly more complicated way
alternating between growing by 4 and by 12 on successive steps (so at step t > 1 it is of size 8t – 4 Mod[t + 1, 2]):
If we look at what’s happening here “underneath”, we see that beta reduction is just repeatedly getting applied to the first λ at each step:
And we can conveniently indicate this when we plot the successive sizes of lambdas:
At size 8, things begin to get more interesting. Of the 43,977 possible lambdas, 55% aren’t inert:
The longest evaluation chain that terminates occurs for
and is of length 12:
Among all 43,977 size-8 lambdas, there are only 164 distinct patterns of growth:
The 81 lambdas that don’t terminate have various patterns of growth:
Some, such as
alternate every other step:
Others, such as
have period 3:
Sometimes, as in
there’s a mixture of growth and periodicity:
There’s simple nested behavior, as in
where the envelope grows like
.
There’s more elaborate nested behavior in
where now the envelope grows linearly with t.
Finally, there’s
which instead shows highly regular
growth:
And the conclusion is that for lambdas up to size 8, nothing more complicated than nested behavior occurs. And in fact in all these cases it’s possible to derive exact formulas for the size at step t.
To see roughly how these formulas work, we can start with the sequence:
This sequence turns out to be given by the nestedly recursive recurrence
and the tth term here is just:
For the first nested sequence above, the corresponding result is (for t > 1):
For the second one it is (for t > 3):
And for the third one it is (for t > 1):
And, yes, it’s quite remarkable that these formulas exist, yet are as comparatively complicated as they are. In effect, they’re taking us a certain distance to the edge of what can be captured with traditional mathematics, as opposed to being findable only by irreducible computation.
We’ve looked systematically at lambdas up to size 8. So what happens with larger lambdas?
At size 9 there are 454,283 possible lambdas. Now the distribution of lifetimes is:
The longest finite lifetime—of 55 steps—occurs for
which eventually evolves to just:
Here are the sizes of the lambdas obtained at each step:
But why does this particular process terminate? It’s not easy to see. Looking at the sequence of trees generated, there’s some indication of how branches rearrange and then disappear:
The array plot isn’t very illuminating either:
As is so often the case, it doesn’t seem as if there’s any “identifiable mechanism” to what happens; it “just does”.
The runner-up for lifetime among length-9 lambdas—with lifetime 44—is
which eventually evolves to
(which happens to correspond to the integer 16 in the representation we discussed above):
Viewed in “array” form, there’s at least a hint of “mechanism”: we see that over the course of the evolution there’s a steady (linear) increase in the quantity of “pure applications” without any λ’s to “drive things”:
Of the 748 size-9 lambdas that grow forever, most have ultimately periodic patterns of growth. The one with the longest growth period (10 steps) is
which has a rather unremarkable pattern of growth:
Only 35 lambdas have more complicated patterns of growth:
Many have behavior similar to what we saw with size 8. There can be nested patterns of growth, with an envelope that’s linear:
There can be what’s ultimately pure exponential
growth:
There can also be slower but still exponential growth, in this case
:
There’s growth that at first looks like it might be complicated, but eventually ends up linear:
Then there’s what might seem to be irregular growth:
But taking differences reveals a nested structure
even though there’s no obvious nested structure in the detailed pattern of lambdas in this case:
And that’s about it for size 9.
OK, what about size 10? There are now 5,159,441 possible lambdas. Of these, 38% are inert and 0.15% of them don’t terminate. The distribution of lifetimes is:
Some examples of lambdas with long lifetimes are:
Some of these terminate with small lambdas; the one with lifetime 88 yields a lambda of size 342, while the second one with lifetime 56 yields one of size 386:
All of these work in somewhat different ways. In some cases the behavior looks quite systematic, and it’s at least somewhat clear they’ll eventually halt; in others it’s not at all obvious:
It’s notable that many, though not all, of these evolutions show linearly increasing “dead” regions, consisting purely of applications, without any actual λ’s—so that, for example, the final fixed point in the lifetime-123 case is:
What about other cases? Well, there are surprises. Like consider:
Running this, say, for 2000 steps it seems to be just uniformly growing:
But then, suddenly, after 3779 steps, there’s a surprise: it terminates—leaving a lambda of size 4950:
Looking at successive size differences doesn’t give any obvious “advance warning” of termination:
But if one looks at the actual sequence of lambdas generated (here just for 200 steps) we see what appears to be a somewhat telltale buildup of “dead” applications:
And, indeed, in the next section, we’ll see an even longer-lived example.
Even though it’s sometimes had to tell whether the evaluation of a lambda will terminate, there are plenty of cases where it’s obvious. An example is when the evaluation chain ultimately becomes periodic. At size 10, there are longer periods than before; here’s an example with period 27:
There’s also growth where the sequence of sizes has obvious nested structure—with a linearly growing envelope:
And then there’s nested structure, but superimposed on
growth:
At size 11, there are 63,782,411 possible lambdas. Most of the behavior one sees looks very much as before. But there are some surprises. One of them is:
And another is:
And, yes, there doesn’t seem to be any overall regularity here, and nor does any show up in the detailed sequence of lambdas produced:
So far as I can tell, like so many other simple computational systems—from rule 30 on—the fairly simple lambda
just keeps generating unpredictable behavior forever.
How do we know whether the evaluation chain for a particular lambda will terminate? Well, we can just run it for a certain number of steps and see if it terminates by then. And perhaps by that point we will see obvious periodic or nested behavior that will make it clear that it won’t ever terminate. But in general we have no way to know how long we might have to run it to be able to determine whether it terminates or not.
It’s a very typical example of the computational irreducibility that occurs throughout the world of even simple computational systems. And in general it leads us to say that the problem of whether a given lambda evaluation will terminate must be considered undecidable, in the sense that there’s no computation of bounded length that can guarantee to answer the question.
But what happens in practice? Well, we’ve seen above that at least for sufficiently small lambdas it’s not too hard to resolve the question of termination. But already at size 10 there are issues. Take for example:
The first 1000 steps of evaluation lead to lambdas with a sequence of sizes that seem like they’re basically just uniformly increasing:
After 10,000 steps it’s the same story—with the size growing roughly like t/3:
But if we look at the structure of the actual underlying lambdas—even for just 200 steps—we see that there’s a “dead region” that’s beginning to build up:
Continuing to 500 steps, the dead region continues to grow more or less linearly:
“Detrending” the sequences of sizes by subtracting t/3 we get:
There’s definite regularity here, which suggests that we might actually be able to identify a “pocket of computational reducibility” in this particular case—and be able to work out what will happen without having to explicitly run each step.
And, yes, in fact that’s true. Let’s look again at the form of the lambda we’re dealing with:
The second part turns out to be the (“Church numeral”) representation we used for the number 2. Calling this
the whole lambda evaluates according to:
Normally you might think of doing things like arithmetic operations on integers. But here we’re seeing something different: we’re seeing integers in effect applied to integers. So what, for example, does
or
evaluate to? Repeatedly doing beta reduction gives:
Or, in other words
gives
. Here are some other examples:
And in general:
So now we can “decode” our original lambda. Since
evaluates to
this means that
is just:
Or, in other words, despite how it might have looked at first, the evaluation of our original lambda will eventually terminate, giving a lambda of size 65539—and it’ll take about 200,000 steps to do this. (Curiously, this whole setup is extremely similar to something I studied about 30 years ago in connection with combinator-like systems of the form e[x_][y_]
x[x[y]].)
It’s somewhat remarkable how long it took our simple lambda to terminate. And actually, given what we’ve seen here, it’s straightforward to construct lambdas that will terminate, but will take even longer to do so. So, for example
evaluates to the n-fold “tetration”
and takes about that many steps to do so.
In the next section we’ll see how to set up lambdas that can compute not just tetration, but the Ackermann function, and any level of hyperoperation, as defined recursively by
(where h[1] is Plus, h[2] is Times, h[3] is Power, h[4] is tetration, etc.).
OK, so there are lambdas (and actually even rather simple ones) whose evaluation terminates, but only after a number of steps that has to be described with hyperoperations. But what about lambdas that don’t terminate at all? We can’t explicitly trace what the lambda does over the course of an infinite number of steps. So how can we be sure that a lambda won’t terminate?
Well, we have to construct a proof. Sometimes that’s pretty straightforward because we can easily identify that something fundamentally simple is going on inside the lambda. Consider for example:
The successive steps in its evaluation are (with the elements involved in each beta reduction highlighted):
It’s very obvious what’s going on here, and that it will continue forever. But if we wanted to, we could construct a formal proof of what’s going to happen—say using mathematical induction to relate one step to the next.
And indeed whenever we can identify a (“stem-cell-like”) core “generator” that repeats, we can expect to be able to use induction to prove that there won’t be termination. There are also plenty of other cases that we’ve seen in which we can tell there’s enough regularity—say in the pattern of successive lambdas—that it’s “visually obvious” there won’t be termination. And though there will often be many details to wrestle with, one can expect a straightforward path in such cases to a formal proof based on mathematical induction.
In practice one reason “visual obviousness” can sometimes be difficult to identify is that lambdas can grow too fast to be consistently visualized. The basic source of such rapid growth is that beta reduction ends up repeatedly replicating increasingly large subexpressions. If these replications in some sense form a simple tree, then one can again expect to do induction-based proofs (typically based on various kinds of tree traversals). But things can get more complicated if one starts ascending the hierarchy of hyperoperations, as we know lambdas can. And for example, one can end up with situations where every leaf of every tree is spawning a new tree.
At the outset, it’s certainly not self-evident that the process of induction would be a valid way to construct a proof of a “statement about infinity” (such as that a particular lambda won’t ever terminate, even after an infinite number of steps). But induction has a long history of use in mathematics, and for more than a century has been enshrined as an axiom in Peano arithmetic—the axiom system that’s essentially universally used (at least in principle) in basic mathematics. Induction in a sense says that if you always have a way to take a next step in a sequence (say on a number line of integers), then you’ll be able to go even infinitely far in that sequence. But what if what you’re dealing with can’t readily be “unrolled” into a simple (integer-like) sequence? For example, what if you’re dealing with trees that are growing new trees everywhere? Well, then ordinary induction won’t be enough to “get you to all the leaves”.
But in set theory (which is typically the ultimate axiom system in principle used for current mathematics) there are axioms that go beyond ordinary induction, and that allow the concept of transfinite induction. And with transfinite induction (which can “walk through” all possible ordered sets) one can “reach the leaves” in a “trees always beget trees” system.
So what this means is that while the nontermination of a sufficiently fast-growing lambda may not be something that can be proved in Peano arithmetic (i.e. with ordinary induction) it’s still possible that it’ll be provable in set theory. Inevitably, though, even set theory will be limited, and there’ll be lambdas whose nontermination it can’t prove, and for which one will have to introduce new axioms to be able to produce a proof.
But can we explicitly see lambdas that have these issues? I wouldn’t be surprised if some of the ones we’ve already considered here might actually be examples. But that will most likely be a very difficult thing to prove. And an easier (but still difficult) approach is to explicitly construct, say, a family of lambdas where within Peano arithmetic there can be no proof that they all terminate.
Consider the slightly complicated definitions:
The g[n] are the so-called Goodstein sequences for which it’s known that the statement that they always reach 0 is one that can’t be proved purely using ordinary induction, i.e. in Peano arithmetic.
Well, here’s a lambda—that’s even comparatively simple—that has been constructed to compute the lengths of the Goodstein sequences:
Feed in a representation of an integer n, and this lambda will compute a representation of the length of the corresponding sequence g[n] (i.e. how many terms are needed to reach 0). And this length will be finite if (and only if) the evaluation of the lambda terminates. So, in other words, the statement that all lambda evaluations of this kind terminate is equivalent to the statement that all Goodstein sequences eventually reach 0.
But what actually happens if we evaluate lambdas of this kind? Here are the results for the first few values of
:
For n = 1, the process terminates in 18 steps with final result λ[λ[2[2[1]]]]—our representation for the integer 2—reflecting the fact that g[1] = {1, 0}. For n = 2, after 38 steps we get the integer 4, reflecting the fact that g[2] = {2, 2, 1, 0}. For n = 3 many more steps are needed, but eventually the result is 6 (reflecting the fact that g[3] = {3, 3, 3, 2, 1, 0}). But what about n = 4? Well, it’s known that the process must eventually terminate in this case too. But it must take a very long time—since the final result is known to be
. And if we keep going, the sizes—and corresponding times—rapidly get still much, much larger.
This certainly isn’t the only way to “escape provability by Peano arithmetic”, but it’s an example of what this can look like.
We should remember that in general whether the evaluation of a lambda ever terminates is a question that’s fundamentally undecidable—in the sense that no finite computation whatsoever can guarantee to answer it. But the question of proving, say, that the evaluation of a particular lambda doesn’t terminate is something different. If the lambda “wasn’t generating much novelty” in its infinite lifetime (say it was just behaving in a repetitive way) then we’d be able to give a fairly simple proof of its nontermination. But the more fundamentally diverse what the lambda does, the more complicated the proof will be. And the point is that—essentially as a result of computational irreducibility—there must be lambdas whose behavior is arbitrarily complex, and diverse. But in some sense any given finite axiom system can only be expected to capture a certain “level of diversity” and therefore be able to prove nontermination only for a certain set of lambdas.
To be slightly more concrete, we can imagine a construction that will lead us to a lambda whose termination can’t be proved within a given axiom system. It starts from the fact that the computation universality of lambdas implies that any computation can be encoded as the evaluation of some lambda. So in particular there’s a lambda that systematically generates statements in the language of a given axiom system, determining whether each of those statements is a theorem in that axiom system. And we can then set it up so that if ever an inconsistency is found—in the sense that both a theorem and its negation appear—then the process will stop. In other words, the evaluation of the lambda is in a sense systematically looking for inconsistencies in the axiom system, terminating if it ever finds one.
So this means that proving the evaluation of the lambda will never terminate is equivalent to proving that the axiom system is consistent. But it’s a fundamental fact that a given axiom system can never prove its own consistency; it takes a more powerful axiom system to do that. So this means that we’ll never be able to prove that the evaluation of the lambda we’ve set up terminates—at least from within the axiom system that it’s based on.
In effect what’s happened is that we’ve managed to make our lambda in its infinite lifetime scan through everything our axiom system can do, with the result that the axiom system can’t “go further” and say anything “bigger” about what the lambda does. It’s a consequence of the computation universality of lambdas that it’s possible to package up the whole “metamathematics” of our axiom system into the (infinite) behavior of a single lambda. But the result is that for any given axiom system we can in principle produce a lambda where we know that the nontermination of that lambda is unprovable within that axiom system.
Of course, if we extend our axiom system to have the additional axiom “this particular lambda never terminates” then we’ll trivially be able to prove nontermination. But the underlying computation universality of lambdas implies (in an analog of Gödel’s theorem) that no finite collection of axioms can ever successfully let us get finite proofs for all possible lambdas.
And, yes, while the construction we’ve outlined gives us a particular way to come up with lambdas that “escape” a given axiom system, those certainly won’t be the only lambdas that can do it. What will be the smallest lambda whose nontermination is unprovable in, say, Peano arithmetic? The example we gave above based on Goodstein sequences is already quite small. But there are no doubt still much smaller examples. Though in general there’s no upper bound on how difficult it might be to prove that a particular lambda is indeed an example.
If we scan through all possible lambdas, it’s inevitable that we’ll eventually find a lambda that can “escape” from any possible axiom system. And from experience in the computational universe, we can expect that even for quite small lambdas there’ll be rapid acceleration in the size and power of axiom systems needed to “rein them in”. Yes, in principle we can just add a custom axiom to handle any given lambda, but the point is that we can expect that there’ll quickly be in a sense “arbitrarily small leverage”: each axiom added will cover only a vanishingly small fraction of the lambdas we reach.
So far, we’ve discussed the evaluation of “lambdas on their own”. But given a particular lambda, we can always apply it to “input”, and see “what it computes”. Of course, the input itself is just another lambda, so a “lambda applied to input” is just a “lambda applied to a lambda”, which is itself a lambda. And so we’re basically back to looking at “lambdas on their own”.
But what if the input we give is something “interpretable”? Say a lambda corresponding to an integer in the representation we used above. A lambda applied to such a thing will—assuming it terminates—inevitably just give another lambda. And most of the time that resulting lambda won’t be “interpretable”. But sometimes it can be.
Consider for example:
Here’s what happens if we apply this lambda to a sequence of integers represented as lambdas:
In other words, this lambda can be interpreted as implementing a numerical function from integers to integers—in this case the function:
So what kinds of such numerical functions can lambdas represent? Insofar as lambdas are capable of universal computation, they must in the end be capable of representing any integer function. But the necessary lambda could be very large.
Early on above, we saw a constructed representation for the factorial function as a lambda:
We also know that if we take the (“Church numeral”) lambda that represents an integer m and apply it to the lambda that represents n, we get the lambda representing nm. It’s then straightforward to construct a lambda that represents any function that can be obtained as a composition of powers:
And one can readily go on to get functions based on tetration and higher hyperoperations. For example
gives a diagonal of the Ackermann function whose successive terms are:
But what about “lambdas in the wild”? How often will they be “numerically interpretable”? And what kinds of functions will they represent? Among size-5 lambdas, for example, 10 out of 82—or about 12%—are “interpretable”, and represent the functions 0, 1, n and n2. Among size-10 lambdas the fraction that are “interpretable” falls to about 4%, and the frequency with which different functions occur is:
Looking at lambdas of successively greater size, here are some notable “firsts” of where functions appear:
And, yes, we can think of the size of the minimal lambda that gives a particular function to be the “algorithmic information content” of that function with respect to lambdas.
Needless to say, there are some other surprises. Like
, which gives 0 when fed any integer as input, but takes a superexponentially increasing number of steps to do so:
We’ve been discussing what “numerical” functions can be computed by what lambdas. But another question we can ask is how much time (i.e. how many beta reductions) such computations will take—or how much “space” they’ll need for intermediate expressions. And indeed, even if the outputs are the same, different lambdas can take different computational approaches and different computational resources to reach those outputs. So, for example, among size-8 lambdas, there are 41 that compute the function n2. And here are the sizes of intermediate expressions (all plotted on the same scale) generated by running these lambdas for the case n = 10:
The pictures suggest that these various lambdas have a variety of different approaches to computing n2. And some of these approaches take more “memory” (i.e. size of intermediate lambdas) than others. And in addition some take more time than others.
For this particular set of lambdas, the time required always grows essentially just linearly with n, though at several different possible rates:
But in other cases, different behavior is observed, including some very long “run times”. And indeed, in general, one can imagine building up a whole computational complexity theory for lambdas, starting with these kinds of empirical, ruliological investigations.
Everything we’ve discussed here so far is for functions with single arguments. But we can also look at functions that have multiple arguments, which can be fed to lambdas in “curried” form: f [x][y], etc. And with this setup,
, for example, gives x y while
gives x y3.
Let’s say you’re evaluating a lambda expression like:
You want to do a beta reduction. But which one? For this expression, there are 3 possible places where a beta reduction can be done, all giving different results:
In everything we’ve done so far, we’ve always assumed that we use the “first” beta reduction at each step (we’ll discuss what we mean by “first” in a moment). But in general we can look at all possible “paths of evaluation”—and form a multiway graph from them (the double edge reflects the fact that two different beta reductions happen to both yield the same result):
Our standard evaluation path in this case is then:
And we get this path by picking at each step the “leftmost outermost” possible beta reduction, i.e. the one that involves the “highest” relevant λ in the expression tree (and the leftmost one if there’s a choice):
But in the multiway graph above it’s clear we could have picked any path, and still ended up at the same final result. And indeed it’s a general property of lambdas (the so-called confluence or Church–Rosser property) that all evaluations of a given lambda that terminate will terminate with the same result.
By the way, here’s what the multiway graph looks like with the lambdas rendered in tree form:
At size 5, all lambdas give trivial multiway graphs, that involve no branching, as in:
At size 6, there starts to be branching—and merging:
And something else happens too: the “looping lambda” λ[1[1]][λ[1[1]]] gives a multiway graph that is a loop:
At size 7 many distinct topologies of multiway graphs start to appear. Most always lead to termination:
(The largest of these multiway graphs is associated with λ[1[1]][λ[λ[2][1]]] and has 12 nodes.)
Then there are cases that loop:
And finally there is
where something new happens: the multiway graph has many distinct branches. After 5 steps it is
where the size of each node indicates the size of the lambda it represents, and the pink nodes indicate lambdas that can be further beta reduced. After just one more step the multiway graph becomes
or, after another step, in a hyperbolic embedding:
In this particular case, no branch ever terminates, and the total size of the multiway system on successive steps is:
With size-8 lambdas, the terminating multiway graphs can be larger (the first case has 124 nodes):
And there are all sorts of new topologies for nonterminating cases, such as:
These graphs are what one gets by doing 5 successive beta reductions. The ones that end in loops won’t change if we do more reductions. But the ones with nodes indicated in pink will. In most cases one ends up with a progressively increasing number of reductions that can be done (often exponential, but here just 2t – 1 at step t):
Meanwhile, sometimes, the number of “unresolved reductions” just remains constant:
We know that if there’s a fixed point to beta reductions, it’s always unique. But can there be both a unique fixed point, and branches that continue forever? It turns out that there can. And the first lambda for which this happens is:
With our standard evaluation method, this lambda terminates in 5 steps at λ[1]. But there are other paths that can be followed, that don’t terminate (as indicted by pink nodes at the end):
And indeed on step t, the number of such paths is given by a Fibonacci-like recurrence, growing asymptotically
.
With size-9 lambdas there’s a much simpler case where the termination/nontermination phenomenon occurs
where now at each step there’s just a single nonterminating path:
There’s another lambda where there’s alternation between one and two nonterminating paths:
And another where there are asymptotically always exactly 5 nonterminating paths:
OK, so some lambdas have very “bushy” (or infinite) multiway graphs, and some don’t. How does this correlate with the lifetimes we obtain for these lambdas through our standard evaluation process?
Here’s a plot of the size of the multiway graph versus lifetime for all finite-lifetime size-8 lambdas:
In all but the one case discussed above (and indicated here with a red arrow) the multiway graphs are of finite size—and we see some correlation between their size and the lifetime obtained with our standard evaluation process. Although it’s worth noting that even the lambda with the longest standard-evaluation lifetime (12) still has a fairly modest (27-node) multiway graph
while the largest (finite) multiway graph (with 124 nodes) occurs for a lambda with standard-evaluation lifetime 7.
At size 9, here are the multiway graphs for the 10 lambdas with the longest standard-evaluation lifetimes, computed to 5 steps:
In the standard-evaluation lifetime-17 cases the multiway graph is finite:
But in all the other cases shown, the graph is presumably infinite. And for example, in the lifetime-55 case, the number of nodes at step t in the multiway graph grows like:
But somehow, among all these different paths, we know that there’s at least one that terminates—at least after 55 steps, to give λ[1].
The multiway graph shows us all possible sequences of beta reductions for a lambda. Our “standard evaluation process” picks one particular sequence—by having a certain strategy for choosing at each step what beta reduction to do. But what if we were to use a different strategy?
As a first example, consider:
For this lambda, there turn out to be 4 possible positions at which beta reductions can be done (where the positions are specified here using Wolfram Language conventions):
Our standard evaluation strategy does the beta reduction that’s listed first here. But how exactly does it determine that that’s the reduction it should do? Well, it looks at the lists of 0’s and 1’s that specify the positions of each of the possible reductions in the lambda expression tree. Then it picks the one with the shortest position list, sorting position lists lexicographically if there’s a tie.
In terms of the structure of the tree, this corresponds to picking the leftmost outermost beta reduction, i.e. the one that in our rendering of the tree is “highest and furthest to the left”:
So what are some other strategies we could use? One way we can specify a strategy is by giving an algorithm that picks one list out of any collection of lists of 0’s and 1’s. (For now we’ll consider only “sequential strategies”, that pick out a single position list—i.e. a single reduction—at a time.) But another way to specify a strategy is to define an ordering on the expression tree, then to say that the beta reduction we’ll do is the one we reach first if we traverse the tree in that order. And so, for example, our leftmost outermost strategy is associated with a breadth-first traversal of the tree. Another evaluation strategy is leftmost innermost, which is associated with a depth-first traversal of the tree, and a purely lexicographic ordering, independent of length, of all tree positions of beta reductions.
In the Wolfram Language, the option TreeTraversalOrder gives detailed parametrizations of possible traversal orders for use in functions like TreeMap—and each of these orders can be used to define an evaluation strategy for lambdas. The standard ReplaceAll (/.) operation applied to expression trees in the Wolfram Language also defines an evaluation strategy—which turns out to be exactly our standard outermost leftmost one. Meanwhile, using standard Wolfram Language evaluation after giving an assignment λ[x_][y_]:=… yields the leftmost innermost evaluation strategy (since at every level the head λ[…] is evaluated before the argument). If the λ had a Hold attribute, however, then we’ll instead get what we can call “applicative” order, in which the arguments are evaluated before the body. And, yes, this whole setup with possible evaluation strategies is effectively identical to what I discussed at some length for combinators a few years ago.
OK, so what happens with different strategies? Here are examples of the evaluation chains generated in various different cases:
Each of these corresponds to following a different path in the multiway graph:
And each, in effect, corresponds to a different way to reach the final result:
Plotting the sizes of lambdas obtained at each step, we see that different strategies have different profiles—notably in this case with the standard strategy taking more steps to reach the final result than, for example, depth first:
OK, but so what exactly do these strategies mean? In terms of Wolfram Language function TreeTraversalOrder they are (“applicative” involves dependence on whether an element is a λ or a •, and works slightly differently):
In terms of tree positions, these correspond to the following orderings
which correspond to these paths through nodes of a tree:
As we mentioned much earlier, it is a general feature of lambdas that if the evaluation of a particular lambda terminates, then the result obtained must always be the same—independent of the evaluation strategy used. In other words, if there are terminating paths in the multiway graph, they must all converge to the same fixed point. Some paths may be shorter than others, representing the fact that some evaluation strategies may be more efficient than others. We saw one example of this above, but the differences can be much more dramatic. For example,
takes 74 steps to terminate with our standard evaluation strategy, but only 10 with depth first:
And indeed it’s fairly common to see cases where standard evaluation and depth first produce very different behavior; sometimes one is more efficient, sometimes it’s the other. But despite differences in “how one gets there”, the final fixed-point results are inevitably always the same.
But what happens if only some paths in the multiway graph terminate? That’s not something that happens with any lambdas up to size 8. But at size 9, for example,
terminates with standard evaluation but not with depth first:
And an even simpler example is
:
In both these cases, standard evaluation leads to termination, but depth-first evaluation does not. And in fact it’s a general result that if termination is ever possible for a particular lambda, then our standard (“leftmost outermost”) evaluation strategy will successfully achieve it.
But what if termination is never possible? Different evaluation strategies can lead to different behavior. For example,
can get stuck in a loop of either period 2 or period 1:
It’s also possible for one strategy to give lambdas that grow forever, and another to give lambdas that are (as in this
case), say, periodic in size—in effect showing two different ways of not terminating:
But what if one has a whole collection of lambdas? What generically happens? Within every multiway graph there can be a fixed point, some number of loops and some (tree-like) unbounded growth. Every strategy defines a path in the multiway graph—which can potentially lead to any of these possible outcomes. It’s fairly typical to see one type of outcome be much more likely than others—and for it to require some kind of “special tuning” to achieve a different type of outcome.
For this case
almost all evaluation strategies lead to termination—except the particular strategy based on what we might call “tree layout order”:
So what happens if we use a strategy where at each step we just pick between different possible beta reductions at random, with equal probability? Here are some examples for the [1(11)][[2(211)]] lambda we saw above, that takes 74 steps to terminate with our standard evaluation strategy:
There’s quite a diversity in the shapes of curves generated. But it’s notable that no nonterminating cases are seen—and that termination tends to occur considerably more quickly than with standard evaluation. Indeed, the distribution of termination times only just reaches the value 74 seen in the standard evaluation case:
For lambdas whose evaluation always terminates, we can think of the sequence of sizes of expressions generated in the course of their evaluation as being “pinned at both ends”. When the evaluation doesn’t terminate, the sequences of sizes can be wilder, as in these [11][1(1(1[[3]]))] cases:
Consider the lambdas:
They each have a different form. But they all evaluate to the same thing: λ[1]. And particularly if we’re thinking about lambdas in a mathematically oriented way this means we can think of all these lambdas as somehow “representing the same thing”, and therefore we can consider them “equivalent”.
Out of all 18 lambdas up to size 4, these three are actually the only ones that are equivalent in this sense. So that means there are 15 “evaluation inequivalent” lambdas up to size 4.
Up to size 5, there are a total of 100 forms of lambdas—which evolve to 68 distinct final results—and there are 4 multi-lambda equivalence classes
which we can summarize graphically as:
Up to size 6, there are a total of 679 forms of lambdas—which evolve to 392 distinct final states (along with one “looping lambda”)—and there are 16 multi-lambda equivalence classes:
If we look only at lambdas whose evaluation terminates then we have the following:
For large sizes, the ratio conceivably goes more or less exponentially to a fixed limiting value around 1/4.
But this is only for lambdas that terminate. So what can we say about equivalences between lambdas that don’t terminate? Well, it depends what we mean by “equivalence”. One definition might be that two lambdas are equivalent if somewhere in their evolution they generate identical lambda expressions. Or, in other words, there is some overlap in their multiway graphs.
If we look up to size 7, there are 8 lambdas that don’t terminate, and there’s a single—rather simple—correspondence between their multiway systems:
Up to size 8, there are 89 lambdas whose evaluation doesn’t terminate. And among these, there are many correspondences. A simple example is:
More elaborate examples include
where now we are showing the combined multiway graphs for pairs of lambdas. In the first two cases, the overlap is at a looping lambda; in the third it is at a series of lambda expressions.
If we look at all 89 nonterminating lambdas, this shows which are the 434 overlapping pairs:
There are several subtleties here. First, whether two lambdas are “seen to be equivalent” can depend on the evaluation strategy one’s using. For example, a lambda could terminate for one strategy, but can “miss the fixed point” with another strategy that fails to terminate.
In addition, even if the multiway graphs for two lambdas overlap, there’s no guarantee that that overlap will be “found” with any given evaluation strategy.
And, finally, as so often, there’s an issue of undecidability. Even if a lambda is going to terminate, it can take an arbitrarily long time to do so. And, worse, if one progressively generates the multiway graphs for two lambdas, it can be arbitrarily difficult to see if they overlap. Determining a single evaluation path for a lambda can be computationally irreducible; determining these kinds of overlaps can be multicomputationally irreducible.
We can think of every beta reduction in the evaluation of a lambda as an “event” in which certain input is transformed to certain output. If the input to one event V relies on output from another event U, then we can say that V is causally dependent on U. Or, in other words, the event V can only happen if U happened first.
If we consider a lambda evaluation that involves a chain of many beta reductions, we can then imagine building up a whole “causal graph” of dependencies between beta reduction events.
As a simple example, consider this evaluation chain, where we’ve explicitly shown the beta reduction events, and (for clarity in this simple case) we’ve also avoided renaming the variables that appear:
The blue edges here show how one lambda transforms to another—as a result of an event. The dark red edges show the causal relations between events; each of them can be thought of as being associated with a subexpression that is the “carrier of causality” from one event to another.
Keeping only events and causal edges, the causal graph for this evolution is then:
If we follow a sequence of causal edges we get a “timelike” chain of events that must occur in order. But “transverse” to that we can have (in the language of posets) an “antichain” consisting of what we can think of as “spacelike separated” events that can occur in parallel. In this case the first event (i.e. the first beta reduction) leads to the expression
in which there are two reductions that can be done: for λ[b,…] and for the first λ[c,…]. But these reductions are completely independent of each other; they can be thought of as “spacelike separated”. In our standard evaluation scheme, these events occur in a definite order. But what the causal graph shows is that the result we get doesn’t require that order—and indeed allows these events to occur “in parallel”.
Many lambdas don’t allow parallelization in their evaluation, so have simple, linear causal graphs:
It’s common to have simple branching in the causal graph—associated with pieces of the λ expression that are completely separated as far as beta reduction is concerned, and can be evaluated separately:
One tricky issue that arises in constructing causal graphs for lambdas is figuring out exactly what is really dependent on what. In this case it might seem that the two events are independent, since the first one involves λ[a,…] while the second one involves λ[b,...]. But the second one can’t happen until the λ[b, ...] is “activated” by the reduction of λ[a, ...] in the first event—with the result that second event must be considered causally dependent on the first, as recorded in the causal graph:
While many lambdas have simple, linear or branched causal graphs, plenty have more complicated structures. Here are some examples for lambdas that terminate:
Lambdas that don’t terminate have infinite causal graphs:
But what’s notable here is that even when the evolution—as visualized in terms of the sequence of sizes of lambdas—isn’t simple, the causal graph is still usually quite simple and repetitive. And, yes, in some cases seeing this may well be tantamount to a proof that the evolution of the lambda won’t terminate. But a cautionary point is that even for lambdas that do terminate, we saw above that their causal graphs can still look essentially repetitive—right up to when they terminate.
The causal graphs we showed in the previous section are all based on analyzing causal dependencies in specific lambda evaluation chains, generated with our standard (leftmost outermost) evaluation scheme. But—just like in our Physics Project—it’s also possible to generate multiway causal graphs, in which we show causal connections between all states in the multiway graph representing all possible evaluation sequences.
Here’s the multiway version of the first causal graph we showed above:
Some causal graphs remain linear even in the multiway case, but others split into many paths:
It’s fairly typical for multiway causal graphs to be considerably more complicated than ordinary causal graphs. So, for example, while the ordinary causal graph for [1(11)][[212]] is
the multiway version is
while the multiway graph for states is:
So, yes, it’s possible to make multiway causal graphs for lambdas. But what do they mean? It’s a somewhat complicated story. But I won’t go into it here—because it’s basically the same as the story for combinators, which I discussed several years ago. Though lambdas don’t allow one to set up the kind of shared-subexpression DAGs we considered for combinators; the variables (or de Bruijn indices) in lambdas effectively lead to long-range correlations that prevent the kind of modular treelike approach we used for combinators.
We’ve seen that in general lambdas can do very complicated things. But there’s a particular class of lambdas—the “linear lambdas”—that behave in much simpler ways, allowing them to be rather completely analyzed. A linear lambda is a lambda in which every variable appears only once in the body of the lambda. So this means that λ[a, a] is a linear lambda, but λ[a, a[a]] is not.
The first few linear lambdas are therefore
or in de Bruijn index form:
Linear lambdas have the important property that under beta reduction they can never increase in size. (Without multiple appearances of one variable, beta reduction just removes subexpressions, and can never lead to the replication of a subexpression.) Indeed, at every step in the evaluation of a linear lambda, all that can ever happen is that at most one λ is removed:
Since linear lambdas always have one variable for each λ, their size is always an even number.
The number of linear lambdas is small compared to the total number of lambdas of a given size:
(Asymptotically, the number of linear lambdas of size 2m grows like
.)
Here are the Tromp diagrams for the size-6 linear lambdas:
Evaluation always terminates for a linear lambda; the number of steps it takes is independent of the evaluation scheme, and has a maximum equal to the number of λ’s; the distribution for size-6 linear lambdas is:
The multiway graphs for linear lambdas are rather simple. Since every λ is independent of every other, it never matters in what order they are reduced, and the multiway graph consists of a collection of diamonds, as in for example:
For size 6 the possible multiway graphs (with their multiplicities) are:
For size 8 it’s
where the last graph can be drawn in a more obviously symmetrical way as:
And for size 10 it’s:
In addition to looking at linear lambdas, we can also look at affine lambdas, which are defined to allow zero as well as one instance of each variable. The affine lambdas behave in a very similar way to the linear lambdas, except that now there are some “degenerate diamond structures” in the multiway graph, as in:
So now that we’ve explored all this ruliology for lambdas, how does it compare to what I found several years ago for combinators? Many of the core phenomena are certainly very much the same. And indeed it’s easy to convert between lambdas and combinators—though a lambda of one size may correspond to a combinator of a rather different size.
For example, the size-4 lambdas have these equivalent combinators (where several of the combinators are the same, because the lambdas—at least when fed an appropriate number of arguments—end up evaluating to the same canonical form):
Meanwhile, the size-3 combinators have these equivalent lambdas:
Not surprisingly, the (LeafCount) size of lambdas tends to be slightly smaller for the equivalent combinators, because the presence of de Bruijn indices means lambdas effectively have a larger “alphabet” to work with.
But in the end, lambdas and combinators can “do all the same things”, as in these cases:
The Principle of Computational Equivalence implies that at a fundamental level the ruliology of systems whose behavior is not obviously simple will be at least at some level the same.
But lambdas have definitely given us some unique challenges. First, there is the matter of variables and the nonlocality they imply. It’s not like in a cellular automaton or a Turing machine where in a sense every update happens locally. Nor is it even like in combinators, where things are local at least on a tree—or hypergraph rewriting systems where things are local on a hypergraph. Lambdas are both hierarchical and nonlocal. Lambda expressions have a tree structure, but variables knit pieces of this tree together in arbitrarily complex ways. And beta reduction can in effect pull on arbitrarily long threads.
All of this then gets entangled with another issue: lambdas routinely can get very big. And even that wouldn’t be such a problem if it wasn’t for the fact that there’s no obvious way to compress what the lambdas do. With the result that even though the Wolfram Language gives one a very efficient way to deal with symbolic structures, the things I’ve done here have often challenged my computational resources.
Still, in the end, we’ve managed to do at least the basics in exploring the ruliology of lambdas, and have seen that yet another corner of the computational universe is full of remarkable and surprising phenomena. I first wondered what lambdas might be “like in the wild” back at the beginning of the 1980s. And I’m pleased that finally now—so many years later—I’ve finally been able to get a sense of the world of lambdas and both the uniqueness and the sameness of their ruliology.
An immense amount has been written about lambdas over the past 90 years, though in most cases its focus has been formal and theoretical, not empirical and ruliological. When I wrote about combinators in 2020 I did the scholarship to compile a very extensive bibliography of the (considerably smaller) literature about combinators. For lambdas I have not attempted a similar exercise. But given the overlap—and equivalence—between combinators and lambdas, much of my combinator bibliography also applies to lambdas.
Thanks to Wolfram Institute Fellows Nik Murzin and Willem Nielsen for extensive help. Also thanks to several others affiliated with the Wolfram Institute, especially Richard Assar and Joseph Stocke. In addition, for specific input about what I describe here I thank Alex Archerman, Utkarsh Bajaj, Max Niederman, Matthew Szudzik, and Victor Taelin. Over the years I’ve interacted about lambdas with many people, including Henk Barendregt, Greg Chaitin, Austin Jiang, Jan Willem Klop, Roman Maeder, Dana Scott and John Tromp. And at a practical level I’ve had innumerable discussions since the late 1970s about the Function function in Wolfram Language and Mathematica, as well as its predecessor in SMP.
Note added September 18, 2025: I thank John Tromp for suggesting several significant improvements to the original version of this.