MoreRSS

site iconStephen WolframModify

The creator of Mathematica, Wolfram|Alpha and the Wolfram Language; the author of A New Kind of Science.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Stephen Wolfram

Foundations of Biological Evolution: More Results & More Surprises

2024-12-06 07:13:27

Foundations of Biological Evolution: More Results & More Surprises

This is a follow-on to Why Does Biological Evolution Work? A Minimal Model for Biological Evolution and Other Adaptive Processes [May 3, 2024].

Even More from an Extremely Simple Model

A few months ago I introduced an extremely simple “adaptive cellular automaton” model that seems to do remarkably well at capturing the essence of what’s happening in biological evolution. But over the past few months I’ve come to realize that the model is actually even richer and deeper than I’d imagined. And here I’m going to describe some of what I’ve now figured out about the model—and about the often-surprising things it implies for the foundations of biological evolution.

The starting point for the model is to view biological systems in abstract computational terms. We think of an organism as having a genotype that’s represented by a program, that’s then run to produce its phenotype. So, for example, the cellular automaton rules on the left correspond to a genotype which are then run to produce the phenotype on the right (starting from a “seed” of a single red cell):

The key idea in our model is to adaptively evolve the genotype rules—say by making single “point mutations” to the list of outcomes from the rules:

At each step in the adaptive evolution we “accept” a mutation if it leads to a phenotype that has a higher—or at least equal—fitness relative to what we had before. So, for example, taking our fitness function to be the height (i.e. lifetime) of the phenotype pattern (with patterns that are infinite being assigned zero fitness), a sequence of (randomly chosen) adaptive evolution steps that go from the null rule to the rule above might be:

What if we make a different sequence of randomly chosen adaptive evolution steps? Here are a few examples of what happens—each in a sense “using a different idea” for how to achieve high fitness:

And, yes, one can’t help but be struck by how “lifelike” this all looks—both in the complexity of these patterns, and in their diversity. But what is ultimately responsible for what we’re seeing? It’s long been a core question about biological evolution. Are the forms it produces the result of careful “sculpting” by the environment (and by the fitness functions it implies)—or are their most important features somehow instead a consequence of something more intrinsic and fundamental that doesn’t depend on details of fitness functions?

Well, let’s say we pick a different fitness function—for example, not the height of a phenotype pattern, but instead its width (or, more specifically, the width of its bounding box). Here are some results of adaptive evolution in this case:

And, yes, the patterns we get are now ones that achieve larger “bounding box width”. But somehow there’s still a remarkable similarity to what we saw with a rather different fitness function above. And, for example, in both cases, high fitness, it seems, is normally achieved in a complicated and hard-to-understand way. (The last pattern is a bit of an exception; as can also happen in biology, this is a case where for once there’s a “mechanism” in evidence that we can understand.)

So what in the end is going on? As I discussed when I introduced the model a few months ago, it seems that the “dominant force” is not selection according to fitness functions, but instead the fundamental computational phenomenon of computational irreducibility. And what we’ll find here is that in fact what we see is, more than anything, the result of an interplay between the computational irreducibility of the process by which our phenotypes develop, and the computational boundedness of typical forms of fitness functions.

The importance of such an interplay is something that’s very much come into focus as a result of our Physics Project. And indeed it now seems that the foundations of both physics and mathematics are—more than anything—reflections of this interplay. And now it seems that’s true of biological evolution as well.

In studying our model, there are many detailed phenomena we’ll encounter—most of which seem to have surprisingly direct analogs in actual biological evolution. For example, here’s what happens if we plot the behavior of the fitness function for our first example above over the course of the adaptive evolution process:

We see a sequence of “plateaus”, punctuated by jumps in fitness that reflect some “breakthrough” being made. In the picture, each red dot represents the fitness associated with a genotype that was tried. Many fall below the line of “best results so far”. But there are also plenty of red dots that lie right on the line. And these correspond to genotypes that yield the same fitness that’s already been achieved. But here—as in actual biological evolution—it’s important that there can be “fitness-neutral evolution”, where genotypes change, but the fitness does not. Usually such changes of genotype yield not just the same fitness, but also the exact same phenotype. Sometimes, however, there can be multiple phenotypes with the same fitness—and indeed this happens at one stage in the example here

and at multiple stages in the second example we showed above:

The Multiway Graph of All Possible Evolutions

In the previous section we saw examples of the results of a few particular random sequences of mutations. But what if we were to look at all possible sequences of mutations? As I discussed when I introduced the model, it’s possible to construct a multiway graph that represents all possible mutation paths. Here’s what one gets for symmetric k = 2, r = 2 rules—starting from the null rule, and using height as a fitness function:

The way this graph is constructed, there are arrows from a given phenotype to all phenotypes with larger (finite) height that can be reached by a single mutation.

But what if our fitness function is width rather than height? Well, then we get a different multiway graph in which arrows go to phenotypes not with larger height but instead with larger width:

So what’s really going on here? Ultimately one can think of there being an underlying graph (that one might call the “mutation graph”) in which every edge represents a transformation between two phenotypes that can be achieved by a single mutation in the underlying genotype:

At this level, the transformations can go either way, so this graph is undirected. But the crucial point is that as soon as one imposes a fitness function, it defines a particular direction for each transformation (at least, each transformation that isn’t fitness neutral for this fitness function). And then if one starts, say, from the null rule, one will pick out a certain “evolution cone” subgraph of the original mutation graph.

So, for example, with width as the fitness function, the subgraph one gets is what’s highlighted here:

There are several subtleties here. First, we simplified the multiway graph by doing transitive reduction and drawing only the minimal edges necessary to define the connectivity of the graph. If we want to see all possible single-mutation transformations between phenotypes we need to do transitive completion, in which case for the width fitness function the multiway graph we get is:

But now there’s another subtlety. The edges in the multiway graph represent fitness-changing transformations. But there are also fitness-neutral transformations. And occasionally these can even lead to different (though equal-fitness) phenotypes, so that really each node in the graph above (say, the transitively reduced one) should sometimes be associated with multiple phenotypes

which can “fitness neutrally” transform into each other, as in:

But even this isn’t the end of the subtleties. Fitness-neutral sets typically contain many genotypes differing by changes of rule cases that don’t affect the phenotype they produce. But it may be that just one or a few of these genotypes are “primed” to be able to generate another phenotype with just one additional mutation. Or, in other words, each node in the multiway graph above represents a whole class of genotypes “equivalent under fitness-neutral transformations”, and when we draw an arrow it indicates that some genotype in that class can be transformed by a single mutation to some genotype in the class associated with a different phenotype:

But beyond the subtleties, the key point is that particular fitness functions in effect just define particular orderings on the underlying mutation graph. It’s somewhat like choices of reference frames or families of simultaneity surfaces in physics. Different choices of fitness function in effect define different ways in which the underlying mutation graph can be “navigated” by evolution over the course of time.

As it happens, the results are not so different between height and width fitness functions. Here’s a combined multiway graph, indicating transformations variously allowed by these different fitness functions:

Homing in on a small part of this graph, we see that there are different “flows” associated with maximizing height and maximizing width:

With a single fitness function that for any two phenotypes systematically treats one phenotype as fitter than another, the multiway graph must always define a definite flow. But as soon as one considers changing fitness functions in the course of evolution, it’s possible to get cycles in the multiway graph, as in the example above—so that, in effect, “evolution can repeat itself”.

Fitness Functions Based on Aspect Ratio

We’ve looked at fitness functions based on maximizing height and on maximizing width. But what if we try to combine these? Here’s a plot of the widths and heights of all phenotypes that occur in the symmetric k = 2, r = 2 case we studied above:

We could imagine a variety of ways to define “fitness frontiers” here. But as a specific example, let’s consider fitness functions that are based on trying to achieve specific aspect ratios—i.e. phenotypes that are as close as possible to a particular constant-aspect-ratio line in the plot above.

With the symmetric k = 2, r = 2 rules we’re using here, only a certain set of aspect ratios can ever be obtained:

The corresponding phenotypes (with their aspect ratios) are:

As we change the aspect ratio that we’re trying to achieve, the evolution multiway graph will change:

In all cases we’re starting from the null rule. For target aspect ratio 1.0 this rule itself already achieves that aspect ratio—so the multiway graph in that case is trivial. But in general, different aspect ratios yield evolution multiway graphs that are different subgraphs of the complete mutation graph we saw above.

So if we follow all possible paths of evolution, how close can we actually get to any given target aspect ratio? This plot shows what final aspect ratios can be achieved as a function of target aspect ratio:

And in a sense this is a summary of the effect of “developmental constraints” for “adaptive cellular automaton organisms” like this. If there were no constraints then for every target aspect ratio it’d be possible to get an “organism” with that aspect ratio—so in the plot there’d be a point lying on the red line. But in actuality the process of cellular automaton growth imposes constraints—that in particular allows only certain phenotypes, with certain aspect ratios, to exist. And beyond that, which phenotypes can actually be reached by adaptive evolution depends on the evolution multiway graph, with “different turns” on the graph leading to different fitness (i.e. different aspect ratio) phenotypes.

But what the plot above shows overall is that for a certain range of target aspect ratios, adaptive evolution is successfully able to get at least close to those aspect ratios. If the target aspect ratio gets out of that range, however, “developmental constraints” come in that prevent the target from being reached.

With “larger genomes”, i.e. rules with larger numbers of cases to specify, it’s possible to do better, and to more accurately achieve particular aspect ratios, over larger ranges of values. And indeed we can see some version of this effect even for symmetric k = 2, r = 2 rules by plotting aspect ratios that can be achieved as a function of the number of cases that need to be specified in the rule:

As an alternative visualization, we can plot the “best convergence to the target” as a function of the number of rule cases—and once again we see that larger numbers of rule cases let us get closer to target aspect ratios:

It’s worth mentioning that—just as we discussed for height and width fitness functions above—there are subtleties here associated with fitness-neutral sets. For example, here are sets of phenotypes that all have the specified aspect ratios—with phenotypes that can be reached by single point mutations being joined:

In the evolution multiway graphs above, we included only one phenotype for each fitness-neutral set. But here’s what we get for target aspect ratio 0.7 if we show all phenotypes with a given fitness:

Note that on the top line, we don’t just get the null rule. Instead, we get four phenotypes, all of which, like the null rule, have aspect ratio 1, and so are equally far from the target aspect ratio 0.7.

The picture above is only the transitively reduced graph. But if we include all possible transformations associated with single point mutations, we get instead:

Based on this graph, we can now make what amounts to a foliation, showing collections of phenotypes reached by a certain minimum number of mutations, progressively approaching our target aspect ratio (here 0.7):

Here’s what we get from the range of target aspect ratios shown above (where, as above, “terminal phenotypes” are highlighted):

In a sense these sequences show us what phenotypes can appear at progressive stages in the “fossil record” for different (aspect-ratio) fitness functions in our very simple model. The highlighted cases are “evolutionary dead ends”. The others can evolve further.

Unreachable Cases

Our model takes the process of adaptive evolution to never “go backwards”, or, in other words, to never evolve from a particular genotype to one with lower fitness. But this means that starting with a certain genotype (say the null rule) there may be genotypes (and hence phenotypes) that will never be reached.

With height as a fitness function, there are just two single (“orphan”) phenotypes that can’t be reached:

And with width as the fitness function, it turns out the very same phenotypes also can’t be reached:

But if we use a fitness function that, for example, tries to achieve aspect ratio 0.7, we get many more phenotypes that can’t be reached starting from the null rule:

In the original mutation graph all the phenotypes appear. But when we foliate (or, more accurately, order) that graph using a particular fitness function, some phenotypes become unreachable by evolutionarily-possible transformations—in a rough analogy to the way some events in physics can become unreachable in the presence of an event horizon.

Multiway Graphs for Larger Rule Spaces

So far we’ve discussed multiway graphs here only for symmetric k = 2, r = 2 rules. There are a total of 524,288 (= 219) possible such rules, producing 77 distinct phenotypes. But what about larger classes of rules? As an example, we can consider all k = 2, r = 2 rules, without the constraint of symmetry. There are 2,147,483,648 (= 231) possible such rules, and there turn out to be 3137 distinct phenotypes.

For the height fitness function, the complete multiway graph in this case is

or, annotated with actual phenotypes:

If instead we just show bounding boxes, it’s easier to see where long-lifetime phenotypes occur:

With a different graph layout the evolution multiway graph (with initial node indicated) becomes:

One subtlety here is that the null rule has no successors with single point mutation. When we were talking about symmetric k = 2, r = 2 rules, we took a “single point mutation” always to change both a particular rule case and its mirror image. But if we don’t have the symmetry requirement, a single point mutation really can just change a single rule case. And if we start from the null range and look at the results of changing just one bit (i.e. the output of just one rule case) in all possible ways we find that we either get the same pattern as with the null rule, or we get a pattern that grows without bound:

Or, put another way, we can’t get anywhere with single bit mutations starting purely from the null rule. So what we’ve done is instead to start our multiway graph from k = 2, r = 2 rule 20, which has two bits “on”, and gives phenotype:

But starting from this, just one mutation (together with a sequence of fitness-neutral mutations) is sufficient to give 94 phenotypes—or 49 after removing mirror images:

The total number of new phenotypes we can reach after successively more (non-fitness-neutral) mutations is

while the successive longest-lifetime patterns are:

And what we see here is that it’s in principle possible to achieve long lifetimes even with fairly few mutations. But when the mutations are done at random, it can still take a very large number of steps to successfully “random walk” to long lifetime phenotypes.

And out of a total of 2407 distinct phenotypes, 984 are “dead ends” where no further evolution is possible. Some of these dead ends have long lifetimes

but others have very short lifetimes:

There’s much more to explore in this multiway graph—and we’ll continue a bit below. But for now let’s look at another evolution multiway graph of accessible size: the one for symmetric k = 3, r = 1 rules. There are a total of 129,140,163 (= 317) possible such rules, that yield a total of 14,778 distinct phenotypes:

Showing only bounding boxes of patterns this becomes:

Unlike the k = 2, r = 2 case, we can now start this whole graph with the null rule. However, if we look at all possible symmetric k = 3, r = 1 rules, there turn out to be 6 “isolates” that can’t be reached from the null rule by adaptive evolution with the height fitness function:

Starting from the null rule, the number of phenotypes reached after successively more (non-fitness-neutral) mutations is

and the successive longest-lived of these phenotypes are:

Aspect Ratio Fitness

Just as we looked at fitness functions based on aspect ratio above for symmetric k = 2, r = 2 rules, so now we can do this for the whole space of all possible k = 2, r = 2 rules. Here’s a plot of the heights and widths of patterns that can be achieved with these rules:

These are the possible aspect ratios this implies:

And here’s their distribution (on a log scale):

The range of possible values extends much further than for symmetric k = 2, r = 2 rules: to rather than to . The patterns now with the largest aspect ratios are

while those with the smallest aspect ratios are:

Note that just as for symmetric k = 2, r = 2 rules, to reach a wider range of aspect ratios, more cases in the rule have to be specified:

So what happens if we use adaptive evolution to try to reach different possible target aspect ratios? Most of the time (at least up to aspect ratio ≈ 3) there’s some sequence of mutations that will do it—though often we can get stuck at a different aspect ratio:

If we look at the “best convergence” to a given target aspect ratio then we see that this improves as we increase the number of cases specified in the rule:

So what does the multiway graph look like for a fitness function associated with a particular aspect ratio? Here’s the result for aspect ratio 3:

The initial node involves patterns with aspect ratio 1—actually a fitness-neutral set of 263 of them. And as we go through the multiway graph, the aspect ratios get nearer to 3. The very closest they get, though, are for the patterns (whose locations are indicated on the graph):

But actually (as we saw in the lineup above), there is a rule that gives aspect ratio exactly 3:

But it turns out that this rule can’t be reached by adaptive evolution using single point mutations. In effect, adaptive evolution isn’t “strong enough” to achieve the exact aspect ratio we want; we can think of it as being “unpredictably prevented” by computationally irreducible “developmental constraints”.

OK, so what about the symmetric k = 3, r = 1 rules? Here’s how they’re distributed in width and height:

And, yes, in a typical “there are always surprises” story, there’s a strange height 265, width 173 pattern that shows up:

The overall possible aspect ratios are now

and their (log) distribution is:

The phenotypes with the largest aspect ratios are

while those with the smallest aspect ratios are:

Once again, to reach a larger range of aspect ratios, one has to specify more cases in the rule:

If we try to target a certain aspect ratio, there’s somewhat more of a tendency to get stuck than for k = 2, r = 2 rules—perhaps somewhat as a result of there now being fewer total rules (though more phenotypes) available:

Branching in the Multiway Evolution Graph

Looking at a typical multiway evolution graph such as

we see that different phenotypes can be quite separated in the graph—a bit like organisms on different branches of the tree of life in actual biology. But how can we characterize this separation? One approach is to compute the so-called dominator tree of the graph:

We can think of this as a way to provide a map of the least common ancestors of all nodes. The tree is set up so that given two nodes you just trace up the tree to find their common ancestor. Another interpretation of the tree is that it shows you what nodes you have no choice but to pass through in getting from the initial node to any given node—or, in other words, what phenotypes adaptive evolution has to produce on the way to a given phenotype.

Here’s another rendering of the tree:

We can think of this as the analog of the biological tree of life, with successive branchings picking out finer and finer “taxonomic domains” (analogous to kingdoms, phyla, etc.)

The tree also shows us something else: how significant different links or nodes are—and how much of the tree one would “lop off” if they were removed. Or, put a different way, how much would be achieved by blocking a certain link or node—as one might imagine doing to try to block the evolution of bacteria or tumor cells?

What if we look at larger multiway evolution graphs, like the complete k = 2, r = 2 one? Once again we can construct a dominator tree:

It’s notable that there’s tremendous variance in the “fan out” here, with the phenotypes with largest successor counts being the rather undistinguished:

But what if one’s specifically trying to reach, say, one of the maximum lifetime (length 308) phenotypes? Well, then one has to follow the paths in a particular subgraph of the original multiway evolution graph

corresponding to the phenotype graph:

If one goes off this “narrow path” then one simply can’t reach the length-308 phenotype; one inevitably gets stuck in what amounts to another branch of the analog of the “tree of life”. So if one is trying to “guide evolution” to a particular outcome, this tells one that one needs to block off lots of “exit ramps”.

But what “fraction of the whole graph” is the subgraph that leads to the length-308 phenotype? The whole graph has 2409 vertices and 3878 edges, while the subgraph has 64 vertices and 119 edges, i.e. in both cases about 3%. A different measure is what fraction of all paths through the graph lead to the length-308 phenotype. The total number of paths is 606,081, while the number leading to the length-308 phenotype is 1260, or about 0.2%. Does this tell us what the probability of reaching that phenotype will be if we just make a random sequence of mutations? Not quite, because in the multiway evolution graph many equivalencings have been done, notably for fitness-neutral sets. And if we don’t do such equivalencings, it turns out (as we’ll discuss below) that the corresponding number is significantly smaller—about 0.007%.

Exact-Match Fitness Functions

The fitness functions we’ve been considering so far look only at coarse features of phenotype patterns—like their height, width and aspect ratio. But what happens if we have a fitness function that’s maximal only for a phenotype that exactly matches a particular pattern?

As an example, let’s consider k = 2, r = 1 cellular automata with phenotypes grown for a specific number of steps—and with a fitness function that counts the number of cells that agree with ones in a target:

Let’s say we start with the null rule, then adaptively evolve by making single point mutations to the rule (here just 8 bits). With a target of the rule 30 pattern, this is the multiway graph we get:

And what we see is that after a grand tour of nearly a third of all possible rules, we can successfully reach the rule 30 pattern. But we can also get stuck at rule 86 and rule 190 patterns—even though their fitness values are much lower:

If we consider all possible k = 2, r = 1 cellular automaton patterns as targets, it turns out that these can always be reached by adaptive evolution from the null rule—though a little less than half the time there are other possible endpoints (here specified by rule numbers) at which the evolution process can get stuck:

So far we’ve been assuming that we have a fitness function that’s maximized by matching some pattern generated by a cellular automaton pattern. But what if we pick some quite different pattern to match against? Say our pattern is:

With k = 2, r = 1 rules (running with wraparound in a finite-size region), we can construct a multiway graph

and find out that the maximum fitness endpoints are the not-very-good approximations:

We can also get to these by applying random mutations:

But what if we try a larger rule space, say k = 2, r = 2 rules? Our approximations to the “A” image get a bit better:

Going to k = 2, r = 3 leads to slightly better (but not great) final approximations:

If we try to do the same thing with our target instead being

we get for example

while with target

we get (even less convincing) results like:

What’s going on here? Basically it’s that if we try to set up too intricate a fitness function, then our rule spaces won’t contain rules that successfully maximize it, and our adaptive evolution process will end up with a variety of not-very-good approximations.

How Fitness Builds Up

When one looks at an evolution process like

one typically has the impression that successive phenotypes are achieving greater fitness by somehow progressively “building on the ideas” of earlier ones. And to get a more granular sense of this we can highlight cells at each step that are using “newly added cases” in the rule:

We can think of new rule cases as a bit like new genes in biology. So what we’re seeing here is the analog of new genes switching on (or coming into existence) as we progress through the process of biological evolution.

Here’s what happens for some other paths of evolution:

What we see is quite variable. There are a few examples where new rule cases show up only at the end, as if a new “incrementally engineered” pattern was being “grafted on at the end”. But most of the time new rule cases show up sparsely dotted all over the pattern. And somehow those few “tweaks” lead to higher fitness—even though there’s no obvious reason why, and no obvious way to predict where they should be.

It’s interesting to compare this with actual biology, where it’s pretty common to see what appear to be “random gratuitous changes” between apparently very similar organisms. (And, yes, this can lead to all sorts of problems in things like comparing toxicity or drug effectiveness in model animals versus humans.)

There are many ways to consider quantitatively characterizing how “rule utilization” builds up. As just one example, here are plots for successive phenotypes along the evolution paths shown above of what stages in growth new rule cases show up:

But Is It Explainable?

Here are two “adaptively evolved” long-lifetime rules that we discussed at the beginning:

We can always run these rules and see what patterns they produce. But is there a way to explain what they do? And for example to analyze how they manage to yield lifetimes? Or is what we’re seeing in these rules basically “pure computational irreducibility” where the only way to tell what patterns they will generate—and how long they’ll live—is just explicitly to run them step by step?

The second rule here seems to have a bit more regularity than the first, so let’s tackle it first. Let’s look at the “blade” part. Once such an object—of any width—has formed, its behavior will basically be repetitive, and it’s easy to predict what will happen:

The left-hand edge moves by 1 position every 7 steps, and the right-hand edge by 4 positions every 12 steps. And since , however wide the initial configuration is, it’ll always die out, after a number of steps that’s roughly times the initial width.

But OK, how does a configuration like this get produced? Well, that’s far from obvious. Here’s what happens with a sequence of few-cell initial conditions …:

So, yes, it doesn’t always directly make the “blade”. Sometimes, for example, it instead makes things like these, some of which basically just become repetitive, and live forever:

And even if it starts with a “blade texture” unexpected things can happen:

There are repetitive patterns that can persist—and indeed the “blade” uses one of these:

Starting from a random initial condition one sees various kinds of behavior, with the blade being fairly common:

But none of this really makes much of a dent in “explaining” why with this rule, starting from a single red cell, we get a long-lived pattern. Yes, once the “blade” forms, we know it’ll take a while to come to a point. But beyond this little pocket of computational reducibility we can’t say much in general about what the rule does—or why, for example, a blade forms with this initial condition.

So what about our other rule? There’s no obvious interesting pocket of reducibility there at all. Looking at a sequence of few-cell initial conditions we get:

And, yes, there’s all sorts of different behavior that can occur:

The first of these patterns is basically periodic, simply shifting 2 cells to the left every 56 steps. The third one dies out after 369 steps, and the fourth one becomes basically periodic (with period 56) after 1023 steps:

If we start from a random initial condition we see a few places where things die out in a repeatable pattern. But mostly everything just looks very complicated:

As always happens, the rule supports regions of repetitive behavior, but they don’t normally extend far enough to introduce any significant computational reducibility:

So what’s the conclusion? Basically it’s that these rules—like pretty much all others we’ve seen here—behave in essentially computationally irreducible ways. Why do they have long lifetimes? All we can really say is “because they do”. Yes, we can always run them and see what happens. But we can’t make any kind of “explanatory theory”, for example of the kind we’re used to in mathematical approaches to physics.

Distribution in Morphospace

We can think of the pattern of growth seen in each phenotype as defining what we might call in biology its “morphology”. So what happens if we try to operate as “pure taxonomists”, laying out different phenotypes in “morphospace”? Here’s a result based on using machine learning and FeatureSpacePlot:

And, yes, this tends to group “visually similar” phenotypes together. But how does proximity in morphospace relate to proximity in genotypes? Here is the same arrangement of phenotypes as above, but now indicating the transformations associated with single mutations in genotype:

If for example we consider maximizing for height, only some of the phenotypes are picked out:

For width, a somewhat different set are picked out:

And here is what happens if our fitness function is based on aspect ratio:

In other words, different fitness functions “select out” different regions in morphospace.

We can also construct a morphospace not just for symmetric but for all k = 2, r = 2 rules:

The detailed pattern here is not particularly significant, and, more than anything, just reflects the method of dimension reduction that we’ve used. What is more meaningful, however, is how different fitness functions select out different regions in morphospace. This shows the results for fitness functions based on height and on width—with points colored according to the actual values of height and width for those phenotypes:

Here are the corresponding results for fitness functions based on different aspect ratios, where now the coloring is based on closeness to the target aspect ratio:

What’s the main conclusion here? We might have expected that different fitness functions would cleanly select visibly different parts of morphospace. But at least with our machine-learning-based way of laying out morphospace that’s not what we’re seeing. And it seems likely that this is actually a general result—and that there is no layout procedure that can make any “easy to describe” fitness function “geometrically simple” in morphospace. And once again, this is presumably a consequence of underlying computational irreducibility—and to the fact that we can’t expect any morphospace layout procedure to be able to provide a way to “untangle the irreducibility” that will work for all fitness functions.

Probabilities and the Time Course of Evolution

In what we’ve done so far, we’ve mostly been concerned with things like what sequences of phenotypes can ever be produced by adaptive evolution. But in making analogies to actual biological evolution—and particularly to how it’s captured in the fossil record—it’s also relevant to discuss time, and to ask not only what phenotypes can be produced, but also when, and how frequently.

For example, let’s assume there’s a constant rate of point mutations in time. Then starting from a given rule (like the null rule) there’ll be a certain rate at which transitions to other rules occur. Some of these transitions will lead to rules that are selected out. Others will be kept, but will yield the same phenotype. And still others will lead to transitions to different phenotypes.

We can represent this by a “phenotype transition diagram” in which the thickness of each outgoing edge from a given phenotype indicates the fraction of all possible mutations that lead to the transition associated with that edge:

Gray self-loops in this diagram represent transitions that lead back to the same phenotype (because they change cases in the rule that don’t matter). Pink self-loops correspond to transitions that lead to rules that are selected out. We don’t show rules that have been selected out here; instead we assume that in this case we just “wait at the original phenotype” and don’t make a transition.

We can annotate the whole symmetric k = 2, r = 2 multiway evolution graph with transition probabilities:

Underlying this graph is a matrix of transition probabilities between all 219 possible symmetric k = 2, r = 2 rules (where the structure reflects the fact that many rules transform to rules which differ only by one bit):

Keeping only distinct phenotypes and ordering by lifetime, we can then make a matrix of phenotype transition probabilities:

Treating the transitions as a Markov process, this allows us to compute the expected frequency of each phenotype as a function of time (i.e. number of mutations):

What’s basically happening here is that there’s steady evolution away from the single-cell phenotype. There are some intermediate phenotypes that come and go, but in the end, everything “flows” to the final (“leaf”) phenotypes on the multiway evolution graph—leading to a limiting “equilibrium” probability distribution:

Stacking the different curves, we get an alternative visualization of the evolution of phenotype frequencies:

If we were “running evolution” with enough separate individuals, these would be the limiting curves we’d get. If we reduced the number of individuals, we’d start to see fluctuations—and there’d be a certain probability, for example, for a particular phenotype to end up with zero individuals, and effectively go extinct.

So what happens with a different fitness function? Here’s the result using width instead of height:

And here are results for fitness functions based on a sequence of targets for aspect ratio:

And, yes, the fitness function definitely influences the time course of our adaptive evolution process.

So far we’ve been looking only at symmetric k = 2, r = 2 rules. If we look at the space of all possible k = 2, r = 2 rules, the behavior we see is similar. For example, here’s the time evolution of possible phenotypes based on our standard height fitness function:

And this is what we see if we look only at the longest-lifetime (i.e. largest-height) cases:

As the scale here indicates, such long-lived phenotypes are quite rare—though most still occur with nonzero frequency even after arbitrarily large times (which is an inevitable given that they appear as “maximal fitness” terminal nodes in the multiway graph).

And indeed if we plot the final frequencies of phenotypes against their lifetimes we see that there are a wide range of different cases:

The phenotypes with the highest “equilibrium” frequencies are

with some having fairly small lifetimes, and others larger.

The Macroscopic Flow of Evolution

In the previous section, we looked at the time course of evolution with various different—but fixed—fitness functions. But what if we had a fitness function that changes with time—say analogous to an environment for biological evolution that changes with time?

Here’s what happens if we have an aspect ratio fitness function whose target value increases linearly with time:

The behavior we see is quite complex, with certain phenotypes “winning for a while” but then dying out, often quite precipitously—with others coming to take their place.

If instead the target aspect ratio decreases with time, we see rather different behavior:

(The discontinuous derivatives here are basically associated with the sudden appearance of new phenotypes at particular target aspect ratio values.)

It’s also possible to give a “shock to the system” by suddenly changing the target aspect ratio:

And what we see is that sometimes this shock leads to fewer surviving phenotypes, and sometimes to more.

We can think of a changing fitness function as being something that applies a “macroscopic driving force” to our system. Things happen quickly down at the level of individual mutation and selection events—but the fitness function defines overall “goals” for the system that in effect change only slowly. (It’s a bit like a fluid where there are fast molecular-scale processes, but typically slow changes of macroscopic parameters like pressure.)

But if the fitness function defines a goal, how well does the system manage to meet it? Here’s a comparison between an aspect ratio goal (here, linearly increasing) and the distribution of actual aspect ratios achieved, with the darker curve indicating the mean aspect ratio obtained by a weighted average over phenotypes, and the lighter blue area indicating the standard deviation:

And, yes, as we might have expected from earlier results, the system doesn’t do particularly well at achieving the goal. Its behavior is ultimately not “well sculpted” by the forces of a fitness function; instead it is mostly dominated by the intrinsic (computationally irreducible) dynamics of the underlying adaptive evolution process.

One important thing to note however is that our results depend on the value of a parameter: essentially the rate at which underlying mutations occur relative to the rate of change of the fitness function. In the picture above 5000 mutations occur over the time the fitness function goes from minimum to maximum value. This is what happens if we change the number of mutations that occur (or, in effect, the “mutation rate”):

Generally—and not surprisingly—adaptive evolution does better at achieving the target when the mutation rate is higher, though in both the cases shown here, nothing gets terribly close to the target.

In their general character our results here seem reminiscent of what one might expect in typical studies of continuum systems, say based on differential equations. And indeed one can imagine that there might be “continuum equations of adaptive evolution” that govern situations like the ones we’ve seen here. But it’s important to understand that it’s far from self evident that this is possible. Because underneath everything is a multiway evolution graph with a definite and complicated structure. And one might think that the details of this structure would matter to the overall “continuum evolution process”. And indeed sometimes they will.

But—as we have seen throughout our Physics Project—underlying computational irreducibility leads to a certain inevitable simplicity when looking at phenomena perceived by computationally bounded observers. And we can expect that something similar can happen with biological evolution (and indeed adaptive evolution in general). Assuming that our fitness functions (and their process of change) are computationally bounded, then we can expect that their “aggregate effects” will follow comparatively simple laws—which we can perhaps think of as laws for the “flow of evolution” in response to external input.

Can Evolution Be Reversed?

In the previous section we saw that with different fitness functions, different time series of phenotypes appear, with some phenotypes, for example, sometimes “going extinct”. But let’s say evolution has proceeded to a certain point with a particular fitness function—and certain phenotypes are now present. Then one question we can ask is whether it’s possible to “reverse” that evolution, and revert to phenotypes that were present before. In other words, if we change the fitness function, can we make evolution “go backwards”?

We’ve often discussed a fitness function based on maximizing total (finite) lifetime. But what if, after using this fitness function for a while, we “reverse it”, now minimizing total lifetime?

Consider the multiway evolution graph for symmetric k = 2, r = 2 rules starting from the null rule, with the fitness function yet again being maximizing lifetime:

But what if we now say the fitness function minimizes lifetime? If we start from the longest-lifetime phenotype we get the “lifetime minimization” multiway graph:

We can compare this “reversed graph” to the “forward graph” based on all paths from the null rule to the maximum-lifetime rule:

And in this case we see that the phenotypes that occur are almost the same, with the exception of the fact that can appear in the reverse case.

So what happens when we look at all k = 2, r = 2 rules? Here’s the “reverse graph” starting from the longest-lifetime phenotype:

A total of 345 phenotypes appear here eventually leading all the way back to . In the overall “forward graph” (which has to start from rather than ) a total of 2409 phenotypes appear, though (as we saw above) only 64 occur in paths that eventually lead to the maximum lifetime phenotype:

And what we see here is that the forward and reverse graphs look quite different. But could we perhaps construct a fitness function for the reverse graph that will successfully corral the evolution process to precisely retrace the steps of the forward graph?

In general, this isn’t something we can expect to be able to do. Because to do so would in effect require “breaking the computational irreducibility” of the system. It would require having a fitness function that can in essence predict every detail of the evolution process—and in so doing be in a position to direct it. But to achieve this, the fitness function would in a sense have to be computationally as sophisticated as the evolution process itself.

It’s a variant of an argument we’ve used several times here. Realistic fitness functions are computationally bounded (and in practice often very coarse). And that means that they can’t expect to match the computational irreducibility of the underlying evolution process.

There’s an analogy to the Second Law of thermodynamics. Just as the microscopic collisions of individual molecules are in principle easy to reverse, so potentially are individual transitions in the evolution graph. But putting many collisions or many transitions together leads to a process that is computationally sophisticated enough that the fairly coarse means at our disposal can’t “decode” and reverse it.

Put another way, there is in practice a certain inevitable irreversibility to both molecular dynamics and biological evolution. Yes, with enough computational effort—say carefully controlling the fitness function for every individual organism—it might in principle be possible to precisely “reverse evolution”. But in practice the kinds of fitness functions that exist in nature—or that one can readily set up in a lab—are computationally much too weak. And as a result one can’t expect to be able to get evolution to precisely retrace its steps.

Random or Selected? Can One Tell?

Given only a genotype, is there a way to tell whether it’s “just random” or whether it’s actually the result of some long and elaborate process of adaptive evolution? From the genotype one can in principle use the rules it defines to “grow” the corresponding phenotype—and then look at whether it has an “unusually large” fitness. But the question is whether it’s possible to tell anything directly from the genotype, without going through the computational effort of generating the phenotype.

At some level it’s like asking, whether, say, from a cellular automaton rule, one can predict the ultimate behavior of the cellular automaton. And a core consequence of computational irreducibility is that one can’t in general expect to do this. Still, one might imagine that one could at least make a “reasonable guess” about whether a genotype is “likely” to have been chosen “purely randomly” or to have been “carefully selected”.

To explore this, we can look at the genotypes for symmetric k = 2, r = 2 rules, say ordered by their lifetime-based fitness—with black and white here representing “required” rule cases, and gray representing undetermined ones (which can all independently be either black or white):

On the right is a summary of how many white, black and undetermined (gray) outcomes are present in each genotype. And as we have seen several times, to achieve high fitness all or almost all of the outcomes must be determined—so that in a sense all or almost all of the genome is “being used”. But we still need to ask whether, given a certain actual pattern of outcomes, we can successfully guess whether or not a genotype is the result of selection.

To get more of a sense of this, we can look at plots of the probabilities for different outcomes for each case in the rule, first (trivially) for all combinatorially possible genotypes, then for all genotypes that give viable (i.e. in our case, finite-lifetime) phenotypes, and then for “selected genotypes”:

Certain cases are always completely determined for all viable genomes—but rather trivially so, because, for example, if then the pattern generated will expand at maximum speed forever, and so cannot have a finite lifetime.

So what happens for all k = 2, r = 2 rules? Here are the actual genomes that lead to particular fitness levels:

And now here are the corresponding probabilities for different outcomes for each case in the rule:

And, yes, given a particular setup we could imagine working out from results like these at least an approximation to the likelihood for a given randomly chosen genome to be a selected one. But what’s true in general? Is there something that can be determined with bounded computational effort (i.e. without explicitly computing phenotypes and their fitnesses) that gives a good estimate of whether a genome is selected? There are good reasons to believe that computational irreducibility will make this impossible.

It’s a different story, of course, if one’s given a “fully computed” phenotype. But at the genome level—without that computation—it seems unlikely that one can expect to distinguish random from “selected-somehow” genotypes.

Adaptive Evolution of Initial Conditions

In making our idealized model of biological evolution we’ve focused (as biology seems to) on the adaptive evolution of the genotype—or, in our case, the underlying rule for our cellular automata. But what if instead of changing the underlying rule, we change the initial condition used to “grow each organism”?

For example, let’s say that we start with the “single cell” we’ve been using so far, but then at each step in adaptive evolution we change the value of one cell in the initial condition (say within a certain distance of our original cell)—then keep any initial condition that does not lead to a shorter lifetime:

The sequence of lifetimes (“fitness values”) obtained in this process of adaptive evolution is

and the “breakthrough” initial conditions are:

The basic setup is similar to what we’ve seen repeatedly in the adaptive evolution of rules rather than initial conditions. But one immediate difference is that, at least in the example we’ve just seen, changing initial conditions does not as obviously “introduce new ideas” for how to increase lifetime; instead, it gives more of an impression of just directly extending “existing ideas”.

So what happens more generally? Rules with k = 2, r = 1 tend to show either infinite growth or no growth—with finite lifetimes arising only from direct “erosion” of initial conditions (here for rules 104 and 164):

For k = 2, r = 2 rules the story is more complicated, even in the symmetric case. Here are the sequences of longest lifetime patterns obtained with all possible progressively wider initial conditions with various rules:

Again, there is a certain lack of “fundamentally new ideas” in evidence, though there are definitely “mechanisms” that get progressively extended with larger initial conditions. (One notable regularity is that the maximum lifetimes of patterns often seem roughly proportional to the width of initial condition allowed.)

Can adaptive evolution “discover more”? Typically, when it’s just modifying initial conditions in a fixed region, it doesn’t seem so—again it seems to be more about “extending existing mechanisms” than introducing new ones:

2D Cellular Automata

Everything we’ve done so far has been for 1D cellular automata. So what happens if we go to 2D? In the end, the story is going to be very similar to 1D—except that the rule spaces even for quite minimal neighborhoods are vastly larger.

With k = 2 colors, it turns out that with a 5-cell neighborhood one can’t “escape from the null rule” by single point mutations. The issue is that any single case one adds in the rule will either do nothing, or will lead only to unbounded growth. And even with a 9-cell neighborhood one can’t get rules that show growth that is neither limited nor infinite with a single-cell initial condition. But with a initial condition this is possible, and for example here is a sequence of phenotype patterns generated by adaptive evolution using lifetime as a fitness function:

Here’s what these patterns look like when “viewed from above”:

And here’s how the fitness progressively increases in this case:

There are a total of 2512 ≈ 10154 possible 9-neighbor rules, and in this vast rule space it’s easy for adaptive evolution to find rules with long finite lifetimes. (By the way, I’ve no idea what the absolute maximum “busy beaver” lifetime in this space is.)

Just as in 1D, there’s a fair amount of variation in the behavior one sees. Here are some examples of the “final rules” for various instances of adaptive evolution:

In a few cases one can readily “see the mechanism” for the lifetime—say associated with collisions between localized structures. But mostly, as in the other examples we’ve seen, there’s no realistic “narrative explanation” for how these rules achieve long yet finite lifetimes.

The Turing Machine Case

OK, so we’ve now looked at 2D as well as 1D cellular automata. But what about systems that aren’t cellular automata at all? Will we still see the same core phenomena of adaptive evolution that we’ve identified in cellular automata? The Principle of Computational Equivalence would certainly lead one to expect that we would. But to check at least one example let’s look at Turing machines.

Here’s a Turing machine with s = 3 states for its head, and k = 2 colors for cells on its tape:

The Turing machine is set up to halt if it ever reaches a case in the rule where the output is . Starting from a blank initial condition, this particular Turing machine halts after 19 steps.

So what happens if we try to adaptively evolve Turing machines with long lifetimes (i.e. that take many steps to halt)? Say we start from a “null rule” that halts in all cases, and then we make a sequence of single point mutations in the rule, keeping ones that don’t lead the Turing machine to halt in fewer steps than before. Here’s an example where the adaptive evolution eventually reaches a Turing machine that takes 95 steps to halt:

The sequence of (“breakthrough”) mutations involved here is

corresponding to a fitness curve of the form:

And, yes, all of this is very analogous to what we’ve seen in cellular automata. But one difference is that with Turing machines there are routinely much larger jumps in halting times. And the basic reason for this is just that Turing machines have much less going on at any particular step than typical cellular automata do—so it can take them much longer to achieve some particular state, like a halting state.

Here’s an example of adaptive evolution in the space of s = 3, k = 3 Turing machines—and in this case the final halting time is long enough that we’ve had to squash the image vertically (by a factor of 5):

The fitness curve in this case is best viewed on a logarithmic scale:

But while the largest-lifetime cellular automata that we saw above typically seemed to have very complex behavior, the largest-lifetime Turing machine here seems, at least on the face of it, to operate in a much more “systematic” and “mechanical” way. And indeed this becomes even more evident if we compress our visualization by looking only at steps on which the Turing machine head reverses its direction:

Long-lifetime Turing machines found by adaptive evolution are not always so simple, though they still tend to show more regularity than long-lifetime cellular automata:

But—presumably because Turing machines are “less efficient” than cellular automata—the very longest possible lifetimes can be very large. It’s not clear whether rules with such lifetimes can be found by adaptive evolution—not least because even to evaluate the fitness function for any particular candidate rule could take an unbounded time. And indeed among s = 3, k = 3 rules the very longest possible is about 1017 steps—achieved by the rule

with the following “very pedantic behavior”:

So what about multiway evolution graphs? There are a total of 20,736 s = 2, k = 2 Turing machines with halting states allowed. From these there are 37 distinct finite-lifetime phenotypes:

Just as in other cases we’ve investigated, there are fitness-neutral sets such as:

Taking just one representative from each of these 18 sets, we can then construct a multiway evolution graph for 2,2 Turing machines with lifetime as our fitness function:

Here’s the analogous result for 3,2 Turing machines—with 2250 distinct phenotypes, and a maximum lifetime of 21 steps (and the patterns produced by the machines just show by “slabs”):

We could pick other fitness functions (like maximum pattern width, number of head reversals, etc.) But the basic structure and consequences of adaptive evolution seem to work very much the same in Turing machines as in cellular automata—much as we expect from the Principle of Computational Equivalence.

Multiway Turing Machines

Ordinary Turing machines (as well as ordinary cellular automata) in effect always follow a single path of history, producing a definite sequence of states based on their underlying rule. But it’s also possible to study multiway Turing machines in which many paths of history can be followed. Consider for example the rule:

The case in this rule has two possible outcomes—so this is a multiway system, and to represent its behavior we need a multiway graph:

From a biological point of view, we can potentially think of such a multiway system as an idealized model for a process of adaptive evolution. So now we can ask: can we evolve this evolution? Or, in other words, can we apply adaptive evolution to systems like multiway Turing machines?

As an example, let’s assume that we make single point mutation changes to just one case in a multiway Turing machine rule:

Many multiway Turing machines won’t halt, or at least won’t halt on all their branches. But for our fitness function let’s assume we require multiway Turing machines to halt on all branches (or at least go into loops that revisit the same states), and then let’s take the fitness to be the total number of nodes in the multiway graph when everything has halted. (And, yes, this is a direct generalization of our lifetime fitness function for ordinary Turing machines.)

So with this setup here are some examples of sequences of “breakthroughs” in adaptive evolution processes:

Breakthrough sequences

But what about looking at all possible paths of evolution for multiway Turing machines? Or, in other words, what about making a multiway graph of the evolution of multiway Turing machines?

Here’s an example of what we get by doing this (showing at each node just a single example of a fitness-neutral set):

So what’s really going on here? We’ve got a multiway graph of multiway graphs. But it’s worth understanding that the inner and outer multiway graphs are a bit different. The outer one is effectively a rulial multiway graph, in which different parts correspond to following different rules. The inner one is effectively a branchial multiway graph, in which different parts correspond to different ways of applying a particular rule. Ultimately, though, we can at least in principle expect to encode branchial transformations as rulial ones, and vice versa.

So we can think of the adaptive evolution of multiway Turing machines as a first step in exploring “higher-order evolution”: the evolution of evolution, etc. And ultimately in exploring inevitable limits of recursive evolution in the ruliad—and how these might relate to the formation of observers in the ruliad.

Some Conclusions

What does all this mean for the foundations of biological evolution? First and foremost, it reinforces the idea of computational irreducibility as a dominant force in biology. One might have imagined that what we see in biology must have been “carefully sculpted” by fitness constraints (say imposed by the environment). But what we’ve found here suggests that instead much of what we see is actually just a direct reflection of computational irreducibility. And in the end, more than anything else, what biological evolution seems to be doing is to “recruit” lumps of irreducible computation, and set them up so as to achieve “fitness objectives”.

It is, as I recently discovered, very similar to what happens in machine learning. And in both cases this picture implies that there’s a limit to the kind of explanations one can expect to get. If one asks why something has the form it does, the answer will often just be: “because that’s the lump of irreducible computation that happened to be picked up”. And there isn’t any reason to think that there’ll be a “narrative explanation” of the kind one might hope for in traditional science.

The simplicity of models makes it possible to study not just particular possible paths of adaptive evolution, but complete multiway graphs of all possible paths. And what we’ve seen here is that fitness functions in effect define a kind of traversal order or (roughly) foliation for such multiway graphs. If such foliations could be arbitrarily complex, then they could potentially pick out specific outcomes for evolution—in effect successfully “sculpting biology” through the details of natural selection and fitness functions.

But the point is that fitness functions and resulting foliations of multiway evolution graphs don’t get arbitrarily complex. And even as the underlying processes by which phenotypes develop are full of computational irreducibility, the fitness functions that are applied are computationally bounded. And in effect the complexity that is perhaps the single most striking immediate feature of biological systems is therefore a consequence of the interplay between the computational boundedness of selection processes, and the computational irreducibility of underlying processes of growth and development.

All of this relies on the fundamental idea that biological evolution—and biology—are at their core computational phenomena. And given this interpretation, there’s then a remarkable unification that’s emerging.

It begins with the ruliad—the abstract object corresponding to the entangled limit of all possible computational processes. We’ve talked about the ruliad as the ultimate foundation for physics, and for mathematics. And we now see that we can think of it as the ultimate foundation for biology too.

In physics what’s crucial is that observers like us “parse” the ruliad in certain ways—and that these ways lead us to have a perception of the ruliad that follows core known laws of physics. And similarly, when observers like us do mathematics, we can think of ourselves as “extracting that mathematics” from the way we parse the ruliad. And now what we’re seeing is that biology emerges because of the way selection from the environment, etc. “parses” the ruliad.

And what makes this view powerful is that we have to assume surprisingly little about how selection works to still be able to deduce important things about biology. In particular, if we assume that the selection operates in a computationally bounded way, then just from the inevitable underlying computational irreducibility “inherited” from the ruliad, we immediately know that biology must have certain features.

In physics, the Second Law of thermodynamics arises from the interplay of underlying computational irreducibility of mechanical processes involving many particles or other objects, and our computational boundedness as observers. We have the impression that “randomness is increasing” because as computationally bounded observers we can’t “decrypt” the underlying computational irreducibility.

What’s the analog of this in biology? Much as we can’t expect to “say what happens” in a system that follows the Second Law, so we can’t expect to “explain by selection” what happens in a biological system. Or, put another way, much of what we see in biology is just the way it is because of computational irreducibility—and try as we might it won’t be “explainable” by some fitness criterion that we can describe.

But that doesn’t mean that we can’t expect to deduce “general laws of biology”, much as there are general laws about gases whose detailed structure follows the Second Law. And in what we’ve done here we can begin to see some hints of what those general laws might look like.

They’ll be things like bulk statements about possible paths of evolution, and the effect of changing the constraints on them—a bit like laws of fluid mechanics but now applied to the rulial space of possible genotypes. But if there’s one thing that’s clear it’s that the minimal model we’ve developed of biological evolution has remarkable richness and potential. In the past it’s been possible to say things about what amounts to the pure combinatorics of evolution; now we can start talking in a structured way about what evolution actually does. And in doing this we go in the direction of finally giving biology a foundation as a theoretical science.

There’s So Much More to Study!

Even though this is my second long piece about my minimal model of biological evolution, I’ve barely scratched the surface of what can be done with it. First and foremost there are many detailed connections to be made with actual phenomena that have been observed—or could be observed—in biology. But there are also many things to be investigated directly about the model itself—and in effect much ruliology to be done on it. And what’s particularly notable is how accessible a lot of that ruliology is. (And, yes, you can click any picture here to get the Wolfram Language code that generates it.) What are some obvious things to do? Here are few. Investigate other fitness functions. Other rule spaces. Other initial conditions. Other evolution strategies. Investigate evolving both rules and initial conditions. Investigate different kinds of changes of fitness functions during evolution. Investigate the effect of having a much larger rule space. Investigate robustness (or not) to perturbations.

In what I’ve done here, I’ve effectively aggregated identical genotypes (and phenotypes). But one could also investigate what happens if one in effect “traces every individual organism”. The result will be abstract structures that generalize the multiway systems we’ve shown here—and that are associated with higher levels of abstract formalism capable of describing phenomena that in effect go “below species”.

For historical notes see here »

Thanks

Thanks to Wolfram Institute fellows Richard Assar and Nik Murzin for their help, as well as to the supporters of the new Wolfram Institute initiative in theoretical biology. Thanks also to Brad Klee for his help. Related student projects were done at our Summer Programs this year by Brian Mboya, Tadas Turonis, Ahama Dalmia and Owen Xuan.

Since writing my first piece about biological evolution in March, I’ve had occasion to attend two biology conferences: SynBioBeta and WISE (“Workshop on Information, Selection, and Evolution” at the Carnegie Institution). I thank many attendees at both conferences for their enthusiasm and input. Curiously, before the WISE conference in October 2024 the last conference I had attended on biological evolution was more than 40 years earlier: the June 1984 Mountain Lake Conference on Evolution and Development.

On the Nature of Time

2024-10-09 05:41:58

The Computational View of Time

Time is a central feature of human experience. But what actually is it? In traditional scientific accounts it’s often represented as some kind of coordinate much like space (though a coordinate that for some reason is always systematically increasing for us). But while this may be a useful mathematical description, it’s not telling us anything about what time in a sense “intrinsically is”.

We get closer as soon as we start thinking in computational terms. Because then it’s natural for us to think of successive states of the world as being computed one from the last by the progressive application of some computational rule. And this suggests that we can identify the progress of time with the “progressive doing of computation by the universe”.

But does this just mean that we are replacing a “time coordinate” with a “computational step count”? No. Because of the phenomenon of computational irreducibility. With the traditional mathematical idea of a time coordinate one typically imagines that this coordinate can be “set to any value”, and that then one can immediately calculate the state of the system at that time. But computational irreducibility implies that it’s not that easy. Because it says that there’s often essentially no better way to find what a system will do than by explicitly tracing through each step in its evolution.

In the pictures on the left there’s computational reducibility, and one can readily see what state will be after any number of steps t. But in the pictures on the right there’s (presumably) computational irreducibility, so that the only way to tell what will happen after t steps is effectively to run all those steps:

And what this implies is that there’s a certain robustness to time when viewed in these computational terms. There’s no way to “jump ahead” in time; the only way to find out what will happen in the future is to go through the irreducible computational steps to get there.

There are simple idealized systems (say with purely periodic behavior) where there’s computational reducibility, and where there isn’t any robust notion of the progress of time. But the point is that—as the Principle of Computational Equivalence implies—our universe is inevitably full of computational irreducibility which in effect defines a robust notion of the progress of time.

The Role of the Observer

That time is a reflection of the progress of computation in the universe is an important starting point. But it’s not the end of the story. For example, here’s an immediate issue. If we have a computational rule that determines each successive state of a system it’s at least in principle possible to know the whole future of the system. So given this why then do we have the experience of the future only “unfolding as it happens”?

It’s fundamentally because of the way we are as observers. If the underlying system is computationally irreducible, then to work out its future behavior requires an irreducible amount of computational work. But it’s a core feature of observers like us that we are computationally bounded. So we can’t do all that irreducible computational work to “know the whole future”—and instead we’re effectively stuck just doing computation alongside the system itself, never able to substantially “jump ahead”, and only able to see the future “progressively unfold”.

In essence, therefore, we experience time because of the interplay between our computational boundedness as observers, and the computational irreducibility of underlying processes in the universe. If we were not computationally bounded, we could “perceive the whole of the future in one gulp” and we wouldn’t need a notion of time at all. And if there wasn’t underlying computational irreducibility there wouldn’t be the kind of “progressive revealing of the future” that we associate with our experience of time.

A notable feature of our everyday perception of time is that it seems to “flow only in one direction”—so that for example it’s generally much easier to remember the past than to predict the future. And this is closely related to the Second Law of thermodynamics, which (as I’ve argued at length elsewhere) is once again a result of the interplay between underlying computational irreducibility and our computational boundedness. Yes, the microscopic laws of physics may be reversible (and indeed if our system is simple—and computationally reducible—enough of this reversibility may “shine through”). But the point is that computational irreducibility is in a sense a much stronger force.

Imagine that we prepare a state to have orderly structure. If its evolution is computationally irreducible then this structure will effectively be “encrypted” to the point where a computationally bounded observer can’t recognize the structure. Given underlying reversibility, the structure is in some sense inevitably “still there”—but it can’t be “accessed” by a computationally bounded observer. And as a result such an observer will perceive a definite flow from orderliness in what is prepared to disorderliness in what is observed. (In principle one might think it should be possible to set up a state that will “behave antithermodynamically”—but the point is that to do so would require predicting a computationally irreducible process, which a computationally bounded observer can’t do.)

One of the longstanding confusions about the nature of time has to do with its “mathematical similarity” to space. And indeed ever since the early days of relativity theory it’s seemed convenient to talk about “spacetime” in which notions of space and time are bundled together.

But in our Physics Project that’s not at all how things fundamentally work. At the lowest level the state of the universe is represented by a hypergraph which captures what can be thought of as the “spatial relations” between discrete “atoms of space”. Time then corresponds to the progressive rewriting of this hypergraph.

And in a sense the “atoms of time” are the elementary “rewriting events” that occur. If the “output” from one event is needed to provide “input” to another, then we can think of the first event as preceding the second event in time—and the events as being “timelike separated”. And in general we can construct a causal graph that shows the dependencies between different events.

So how does this relate to time—and spacetime? As we’ll discuss below, our everyday experience of time is that it follows a single thread. And so we tend to want to “parse” the causal graph of elementary events into a series of slices that we can view as corresponding to “successive times”. As in standard relativity theory, there typically isn’t a unique way to assign a sequence of such “simultaneity surfaces”, with the result that there are different “reference frames” in which the identifications of space and time are different.

The complete causal graph bundles together what we usually think of as space with what we usually think of as time. But ultimately the progress of time is always associated with some choice of successive events that “computationally build on each other”. And, yes, it’s more complicated because of the possibilities of different choices. But the basic idea of the progress of time as “the doing of computation” is very much the same. (In a sense time represents “computational progress” in the universe, while space represents the “layout of its data structure”.)

Very much as in the derivation of the Second Law (or of fluid mechanics from molecular dynamics), the derivation of Einstein’s equations for the large-scale behavior of spacetime from the underlying causal graph of hypergraph rewriting depends on the fact that we are computationally bounded observers. But even though we’re computationally bounded, we still have to “have something going on inside”, or we wouldn’t record—or sense—any “progress in time”.

It seems to be the essence of observers like us—as captured in my recent Observer Theory—that we equivalence many different states of the world to derive our internal perception of “what’s going on outside”. And at some rough level we might imagine that we’re sensing time passing by the rate at which we add to those internal perceptions. If we’re not adding to the perceptions, then in effect time will stop for us—as happens if we’re asleep, anesthetized or dead.

It’s worth mentioning that in some extreme situations it’s not the internal structure of the observer that makes perceived time stop; instead it’s the underlying structure of the universe itself. As we’ve mentioned, the “progress of the universe” is associated with successive rewriting of the underlying hypergraph. But when there’s been “too much activity in the hypergraph” (which physically corresponds roughly to too much energy-momentum), one can end up with a situation in which “there are no more rewrites that can be done”—so that in effect some part of the universe can no longer progress, and “time stops” there. It’s analogous to what happens at a spacelike singularity (normally associated with a black hole) in traditional general relativity. But now it has a very direct computational interpretation: one’s reached a “fixed point” at which there’s no more computation to do. And so there’s no progress to make in time.

Multiple Threads of Time

Our strong human experience is that time progresses as a single thread. But now our Physics Project suggests that at an underlying level time is actually in effect multithreaded, or, in other words, that there are many different “paths of history” that the universe follows. And it is only because of the way we as observers sample things that we experience time as a single thread.

At the level of a particular underlying hypergraph the point is that there may be many different updating events that can occur, and each sequence of such updating event defines a different “path of history”. We can summarize all these paths of history in a multiway graph in which we merge identical states that arise:

But given this underlying structure, why is it that we as observers believe that time progresses as a single thread? It all has to do with the notion of branchial space, and our presence within branchial space. The presence of many paths of history is what leads to quantum mechanics; the fact that we as observers ultimately perceive just one path is associated with the traditionally-quite-mysterious phenomenon of “measurement” in quantum mechanics.

When we talked about causal graphs above, we said that we could “parse” them as a series of “spacelike” slices corresponding to instantaneous “states of space”—represented by spatial hypergraphs. And by analogy we can similarly imagine breaking multiway graphs into “instantaneous slices”. But now these slices don’t represent states of ordinary space; instead they represent states of what we call branchial space.

Ordinary space is “knitted together” by updating events that have causal effects on other events that can be thought of as “located at different places in space”. (Or, said differently, space is knitted together by the overlaps of the elementary light cones of different events.) Now we can think of branchial space as being “knitted together” by updating events that have effects on events that end up on different branches of history.

(In general there is a close analogy between ordinary space and branchial space, and we can define a multiway causal graph that includes both “spacelike” and “branchlike” directions—with the branchlike direction supporting not light cones but what we can call entanglement cones.)

So how do we as observers parse what’s going on? A key point is that we are inevitably part of the system we’re observing. So the branching (and merging) that’s going on in the system at large is also going on in us. So that means we have to ask how a “branching mind” will perceive a branching universe. Underneath, there are lots of branches, and lots of “threads of history”. And there’s lots of computational irreducibility (and even what we can call multicomputational irreducibility). But computationally bounded observers like us have to equivalence most of those details to wind up with something that “fits in our finite minds”.

We can make an analogy to what happens in a gas. Underneath, there are lots of molecules bouncing around (and behaving in computationally irreducible ways). But observers like us are big compared to molecules, and (being computationally bounded) we don’t get to perceive their individual behavior, but only their aggregate behavior—from which we extract a thin set of computationally reducible “fluid-dynamics-level” features.

And it’s basically the same story with the underlying structure of space. Underneath, there’s an elaborately changing network of discrete atoms of space. But as large, computationally bounded observers we can only sample aggregate features in which many details have been equivalenced, and in which space tends to seem continuous and describable in basically computationally reducible ways.

So what about branchial space? Well, it’s basically the same story. Our minds are “big”, in the sense that they span many individual branches of history. And they’re computationally bounded so they can’t perceive the details of all those branches, but only certain aggregated features. And in a first approximation what then emerges is in effect a single aggregated thread of history.

With sufficiently careful measurements we can sometimes see “quantum effects” in which multiple threads of history are in evidence. But at a direct human level we always seem to aggregate things to the point where what we perceive is just a single thread of history—or in effect a single thread of progression in time.

It’s not immediately obvious that any of these “aggregations” will work. It could be that important effects we perceive in gases would depend on phenomena at the level of individual molecules. Or that to understand the large-scale structure of space we’d continually be having to think about detailed features of atoms of space. Or, similarly, that we’d never be able to maintain a “consistent view of history”, and that instead we’d always be having to trace lots of individual threads of history.

But the key point is that for us to stay as computationally bounded observers we have to pick out only features that are computationally reducible—or in effect boundedly simple to describe.

Closely related to our computational boundedness is the important assumption we make that we as observers have a certain persistence. At every moment in time, we are made from different atoms of space and different branches in the multiway graph. Yet we believe we are still “the same us”. And the crucial physical fact (that has to be derived in our model) is that in ordinary circumstances there’s no inconsistency in doing this.

So the result is that even though there are many “threads of time” at the lowest level—representing many different “quantum branches”—observers like us can (usually) successfully still view there as being a single consistent perceived thread of time.

But there’s another issue here. It’s one thing to say that a single observer (say a single human mind or a single measuring device) can perceive history to follow a single, consistent thread. But what about different human minds, or different measuring devices? Why should they perceive any kind of consistent “objective reality”?

Essentially the answer, I think, is that they’re all sufficiently nearby in branchial space. If we think about physical space, observers in different parts of the universe will clearly “see different things happening”. The “laws of physics” may be the same—but what star (if any) is nearby will be different. Yet (at least for the foreseeable future) for all of us humans it’s always the same star that’s nearby.

And so it is, presumably, in branchial space. There’s some small patch in which we humans—with our shared origins—exist. And it’s presumably because that patch is small relative to all of branchial space that all of us perceive a consistent thread of history and a common objective reality.

There are many subtleties to this, many of which aren’t yet fully worked out. In physical space, we know that effects can in principle spread at the speed of light. And in branchial space the analog is that effects can spread at the maximum entanglement speed (whose value we don’t know, though it’s related by Planck unit conversions to the elementary length and elementary time). But in maintaining our shared “objective” view of the universe it’s crucial that we’re not all going off in different directions at the speed of light. And of course the reason that doesn’t happen is that we don’t have zero mass. And indeed presumably nonzero mass is a critical part of being observers like us.

In our Physics Project it’s roughly the density of events in the hypergraph that determines the density of energy (and mass) in physical space (with their associated gravitational effects). And similarly it’s roughly the density of events in the multiway graph (or in branchial graph slices) that determines the density of action—the relativistically invariant analog of energy—in branchial space (with its associated effects on quantum phase). And though it’s not yet completely clear how this works, it seems likely that once again when there’s mass, effects don’t just “go off at the maximum entanglement speed in all directions”, but instead stay nearby.

There are definitely connections between “staying at the same place”, believing one is persistent, and being computationally bounded. But these are what seem necessary for us to have our typical view of time as a single thread. In principle we can imagine observers very different from us—say with minds (like the inside of an idealized quantum computer) capable of experiencing many different threads of history. But the Principle of Computational Equivalence suggests that there’s a high bar for such observers. They need not only to be able to deal with computational irreducibility but also multicomputational irreducibility, in which one includes both the process of computing new states, and the process of equivalencing states.

And so for observers that are “anything like us” we can expect that once again time will tend to be as we normally experience it, following a single thread, consistent between observers.

(It’s worth mentioning that all of this only works for observers like us “in situations like ours”. For example, at the “entanglement horizon” for a black hole—where branchially-oriented edges in the multiway causal graph get “trapped”—time as we know it in some sense “disintegrates”, because an observer won’t be able to “knit together” the different branches of history to “form a consistent classical thought” about what happens.)

Time in the Ruliad

In what we’ve discussed so far we can think of the progress of time as being associated with the repeated application of rules that progressively “rewrite the state of the universe”. In the previous section we saw that these rules can be applied in many different ways, leading to many different underlying threads of history.

But so far we’ve imagined that the rules that get applied are always the same—leaving us with the mystery of “Why those rules, and not others?” But this is where the ruliad comes in. Because the ruliad involves no such seemingly arbitrary choices: it’s what you get by following all possible computational rules.

One can imagine many bases for the ruliad. One can make it from all possible hypergraph rewritings. Or all possible (multiway) Turing machines. But in the end it’s a single, unique thing: the entangled limit of all possible computational processes. There’s a sense in which “everything can happen somewhere” in the ruliad. But what gives the ruliad structure is that there’s a definite (essentially geometrical) way in which all those different things that can happen are arranged and connected.

So what is our perception of the ruliad? Inevitably we’re part of the ruliad—so we’re observing it “from the inside”. But the crucial point is that what we perceive about it depends on what we are like as observers. And my big surprise in the past few years has been that assuming even just a little about what we’re like as observers immediately implies that what we perceive of the ruliad follows the core laws of physics we know. In other words, by assuming what we’re like as observers, we can in effect derive our laws of physics.

The key to all this is the interplay between the computational irreducibility of underlying behavior in the ruliad, and our computational boundedness as observers (together with our related assumption of our persistence). And it’s this interplay that gives us the Second Law in statistical mechanics, the Einstein equations for the structure of spacetime, and (we think) the path integral in quantum mechanics. In effect what’s happening is that our computational boundedness as observers makes us equivalence things to the point where we are sampling only computationally reducible slices of the ruliad, whose characteristics can be described using recognizable laws of physics.

So where does time fit into all of this? A central feature of the ruliad is that it’s unique—and everything about it is “abstractly necessary”. Much as given the definition of numbers, addition and equality it’s inevitable that one gets 1 + 1 = 2, so similarly given the definition of computation it’s inevitable that one gets the ruliad. Or, in other words, there’s no question about whether the ruliad exists; it’s just an abstract construct that inevitably follows from abstract definitions.

And so at some level this means that the ruliad inevitably just “exists as a complete thing”. And so if one could “view it from outside” one could think of it as just a single timeless object, with no notion of time.

But the crucial point is that we don’t get to “view it from the outside”. We’re embedded within it. And, what’s more, we must view it through the “lens” of our computational boundedness. And this is why we inevitably end up with a notion of time.

We observe the ruliad from some point within it. If we were not computationally bounded then we could immediately compute what the whole ruliad is like. But in actuality we can only discover the ruliad “one computationally bounded step at a time”—in effect progressively applying bounded computations to “move through rulial space”.

So even though in some abstract sense “the whole ruliad is already there” we only get to explore it step by step. And that’s what gives us our notion of time, through which we “progress”.

Inevitably, there are many different paths that we could follow through the ruliad. And indeed every mind (and every observer like us)—with its distinct inner experience—presumably follows a different path. But much as we described for branchial space, the reason we have a shared notion of “objective reality” is presumably that we are all very close together in rulial space; we form in a sense a tight “rulial flock”.

It’s worth pointing out that not every sampling of the ruliad that may be accessible to us conveniently corresponds to exploration of progressive slices of time. Yes, that kind of “progression in time” is characteristic of our physical experience, and our typical way of describing it. But what about our experience, say, of mathematics?

The first point to make is that just as the ruliad contains all possible physics, it also contains all possible mathematics. If we construct the ruliad, say from hypergraphs, the nodes are now not “atoms of space”, but instead abstract elements (that in general we call emes) that form pieces of mathematical expressions and mathematical theorems. We can think of these abstract elements as being laid out now not in physical space, but in some abstract metamathematical space.

In our physical experience, we tend to remain localized in physical space, branchial space, etc. But in “doing mathematics” it’s more as if we’re progressively expanding in metamathematical space, carving out some domain of “theorems we assume are true”. And while we could identify some kind of “path of expansion” to let us define some analog of time, it’s not a necessary feature of the way we explore the ruliad.

Different places in the ruliad in a sense correspond to describing things using different rules. And by analogy to the concept of motion in physical space, we can effectively “move” from one place to another in the ruliad by translating the computations done by one set of rules to computations done by another. (And, yes, it’s nontrivial to even have the possibility of “pure motion”.) But if we indeed remain localized in the ruliad (and can maintain what we can think of as our “coherent identity”) then it’s natural to think of there being a “path of motion” along which we progress “with time”. But when we’re just “expanding our horizons” to encompass more paradigms and to bring more of rulial space into what’s covered by our minds (so that in effect we’re “expanding in rulial space”), it’s not really the same story. We’re not thinking of ourselves as “doing computation in order to move”. Instead, we’re just identifying equivalences and using them to expand our definition of ourselves, which is something that we can at least approximate (much like in “quantum measurement” in traditional physics) as happening “outside of time”. Ultimately, though, everything that happens must be the result of computations that occur. It’s just that we don’t usually “package” these into what we can describe as a definite thread of time.

So What in the End Is Time?

From the paradigm (and Physics Project ideas) that we’ve discussed here, the question “What is time?” is at some level simple: time is what progresses when one applies computational rules. But what’s critical is that time can in effect be defined abstractly, independent of the details of those rules, or the “substrate” to which they’re applied. And what makes this possible is the Principle of Computational Equivalence, and the ubiquitous phenomenon of computational irreducibility that it implies.

To begin with, the fact that time can robustly be thought of as “progressing”, in effect in a linear chain, is a consequence of computational irreducibility—because computational irreducibility is what tells us that computationally bounded observers like us can’t in general ever “jump ahead”; we just have to follow a linear chain of steps.

But there’s something else as well. The Principle of Computational Equivalence implies that there’s in a sense just one (ubiquitous) kind of computational irreducibility. So when we look at different systems following different irreducible computational rules, there’s inevitably a certain universality to what they do. In effect they’re all “accumulating computational effects” in the same way. Or in essence progressing through time in the same way.

There’s a close analogy here with heat. It could be that there’d be detailed molecular motion that even on a large scale worked noticeably differently in different materials. But the fact is that we end up being able to characterize any such motion just by saying that it represents a certain amount of heat, without getting into more details. And that’s very much the same kind of thing as being able to say that such-and-such an amount of time has passed, without having to get into the details of how some clock or other system that reflects the passage of time actually works.

And in fact there’s more than a “conceptual analogy” here. Because the phenomenon of heat is again a consequence of computational irreducibility. And the fact that there’s a uniform, “abstract” characterization of it is a consequence of the universality of computational irreducibility.

It’s worth emphasizing again, though, that just as with heat, a robust concept of time depends on us being computationally bounded observers. If we were not, then we’d able to break the Second Law by doing detailed computations of molecular processes, and we wouldn’t just describe things in terms of randomness and heat. And similarly, we’d be able to break the linear flow of time, either jumping ahead or following different threads of time.

But as computationally bounded observers of computationally irreducible processes, it’s basically inevitable that—at least to a good approximation—we’ll view time as something that forms a single one-dimensional thread.

In traditional mathematically based science there’s often a feeling that the goal should be to “predict the future”—or in effect to “outrun time”. But computational irreducibility tells us that in general we can’t do this, and that the only way to find out what will happen is just to run the same computation as the system itself, essentially step by step. But while this might seem like a letdown for the power of science, we can also see it as what gives meaning and significance to time. If we could always jump ahead then at some level nothing would ever fundamentally be achieved by the passage of time (or, say, by the living of our lives); we’d always be able to just say what will happen, without “living through” how we got there. But computational irreducibility gives time and the process of it passing a kind of hard, tangible character.

So what does all this imply for the various classic issues (and apparent paradoxes) that arise in the way time is usually discussed?

Let’s start with the question of reversibility. The traditional laws of physics basically apply both forwards and backwards in time. And the ruliad is inevitably symmetrical between “forward” and “backward” rules. So why is it then that in our typical experience time always seems to “run in the same direction”?

This is closely related to the Second Law, and once again it’s consequence of our computational boundedness interacting with underlying computational irreducibility. In a sense what defines the direction of time for us is that we (typically) find it much easier to remember the past than to predict the future. Of course, we don’t remember every detail of the past. We only remember what amounts to certain “filtered” features that “fit in our finite minds”. And when it comes to predicting the future, we’re limited by our inability to “outrun” computational irreducibility.

Let’s recall how the Second Law works. It basically says that if we set up some state that’s “ordered” or “simple” then this will tend to “degrade” to one that’s “disordered” or “random”. (We can think of the evolution of the system as effectively “encrypting” the specification of our starting state to the point where we—as computationally bounded observers—can no longer recognize its ordered origins.) But because our underlying laws are reversible, this degradation (or “encryption”) must happen when we go both forwards and backwards in time:

But the point is that our “experiential” definition of the direction of time (in which the “past” is what we remember, and the “future” is what we find hard to predict) is inevitably aligned with the “thermodynamic” direction of time we observe in the world at large. And the reason is that in both cases we’re defining the past to be something that’s computationally bounded (while the future can be computationally irreducible). In the experiential case the past is computationally bounded because that’s what we can remember. In the thermodynamic case it’s computationally bounded because those are the states we can prepare. In other words, the “arrows of time” are aligned because in both cases we are in effect “requiring the past to be simpler”.

So what about time travel? It’s a concept that seems natural—and perhaps even inevitable—if one imagines that “time is just like space”. But it becomes a lot less natural when we think of time in the way we’re doing here: as a process of applying computational rules.

Indeed, at the lowest level, these rules are by definition just sequentially applied, producing one state after another—and in effect “progressing in one direction through time”. But things get more complicated if we consider not just the raw, lowest-level rules, but what we might actually observe of their effects. For example, what if the rules lead to a state that’s identical to one they’ve produced before (as happens, for example, in a system with periodic behavior)? If we equivalence the state now and the state before (so we represent both as a single state) then we can end up with a loop in our causal graph (a “closed timelike curve”). And, yes, in terms of the raw sequence of applying rules these states can be considered different. But the point is that if they are identical in every feature then any observer will inevitably consider them the same.

But will such equivalent states ever actually occur? As soon as there’s computational irreducibility it’s basically inevitable that the states will never perfectly match up. And indeed for the states to contain an observer like us (with “memory”, etc.) it’s basically impossible that they can match up.

But can one imagine an observer (or a “timecraft”) that would lead to states that match up? Perhaps somehow it could carefully pick particular sequences of atoms of space (or elementary events) that would lead it to states that have “happened before”. And indeed in a computationally simple system this might be possible. But as soon as there’s computational irreducibility, this simply isn’t something one can expect any computationally bounded observer to be able to do. And, yes, this is directly analogous to why one can’t have a “Maxwell’s demon” observer that “breaks the Second Law”. Or why one can’t have something that carefully navigates the lowest-level structure of space to effectively travel faster than light.

But even if there can’t be time travel in which “time for an observer goes backwards”, there can still be changes in “perceived time”, say as a result of relativistic effects associated with motion. For example, one classic relativistic effect is time dilation, in which “time goes slower” when objects go faster. And, yes, given certain assumptions, there’s a straightforward mathematical derivation of this effect. But in our effort to understand the nature of time we’re led to ask what its physical mechanism might be. And it turns out that in our Physics Project it has a surprisingly direct—and almost “mechanical”—explanation.

One starts from the fact that in our Physics Project space and everything in it is represented by a hypergraph which is continually getting rewritten. And the evolution of any object through time is then defined by these rewritings. But if the object moves, then in effect it has to be “re-created at a different place in space”—and this process takes up a certain number of rewritings, leaving fewer for the intrinsic evolution of the object itself, and thus causing time to effectively “run slower” for it. (And, yes, while this is a qualitative description, one can make it quite formal and precise, and recover the usual formulas for relativistic time dilation.)

Something similar happens with gravitational fields. In our Physics Project, energy-momentum (and thus gravity) is effectively associated with greater activity in the underlying hypergraph. And the presence of this greater activity leads to more rewritings, causing “time to run faster” for any object in that region of space (corresponding to the traditional “gravitational redshift”).

More extreme versions of this occur in the context of black holes. (Indeed, one can roughly think of spacelike singularities as places where “time ran so fast that it ended”.) And in general—as we discussed above—there are many “relativistic effects” in which notions of space and time get mixed in various ways.

But even at a much more mundane level there’s a certain crucial relationship between space and time for observers like us. The key point is that observers like us tend to “parse” the world into a sequence of “states of space” at successive “moments in time”. But the fact that we do this depends on some quite specific features of us, and in particular our effective physical scale in space as compared to time.

In our everyday life we’re typically looking at scenes involving objects that are perhaps tens of meters away from us. And given the speed of light that means photons from these objects get to us in less than a microsecond. But it takes our brains milliseconds to register what we’ve seen. And this disparity of timescales is what leads us to view the world as consisting of a sequence of states of space at successive moments in time.

If our brains “ran” a million times faster (i.e. at the speed of digital electronics) we’d perceive photons arriving from different parts of a scene at different times, and we’d presumably no longer view the world in terms of overall states of space existing at successive times.

The same kind of thing would happen if we kept the speed of our brains the same, but dealt with scenes of a much larger scale (as we already do in dealing with spacecraft, astronomy, etc.).

But while this affects what it is that we think time is “acting on”, it doesn’t ultimately affect the nature of time itself. Time remains that computational process by which successive states of the world are produced. Computational irreducibility gives time a certain rigid character, at least for computationally bounded observers like us. And the Principle of Computational Equivalence allows there to be a robust notion of time independent of the “substrate” that’s involved: whether us as observers, the everyday physical world, or, for that matter, the whole universe.

Nestedly Recursive Functions

2024-09-28 01:50:59

Nestedly Recursive Functions

Yet Another Ruliological Surprise

Integers. Addition. Subtraction. Maybe multiplication. Surely that’s not enough to be able to generate any serious complexity. In the early 1980s I had made the very surprising discovery that very simple programs based on cellular automata could generate great complexity. But how widespread was this phenomenon?

At the beginning of the 1990s I had set about exploring this. Over and over I would consider some type of system and be sure it was too simple to “do anything interesting”. And over and over again I would be wrong. And so it was that on the night of August 13, 1993, I thought I should check what could happen with integer functions defined using just addition and subtraction.

I knew, of course, about defining functions by recursion, like Fibonacci:

But could I find something like this that would have complex behavior? I did the analog of what I have done so many times, and just started (symbolically) enumerating possible definitions. And immediately I saw cases with nested functions, like:

(For some reason I wanted to keep the same initial conditions as Fibonacci: f[1] = f[2] = 1.) What would functions like this do? My original notebook records the result in this case:

Nestedly recursive function

But a few minutes later I found something very different: a simple nestedly recursive function with what seemed like highly complex behavior:

Simple nestedly recursive function with complex behavior

I remembered seeing a somewhat similarly defined function discussed before. But the behavior I’d seen reported for that function, while intricate, was nested and ultimately highly regular. And, so far as I could tell, much like with rule 30 and all the other systems I’d investigated, nobody had ever seen serious complexity in simple recursive functions.

It was a nice example. But it was one among many. And when I published A New Kind of Science in 2002, I devoted just four pages (and 7 notes) to “recursive sequences”—even though the gallery I made of their behavior became a favorite page of mine:

Recursive sequences gallery

A year after the book was published we held our first Wolfram Summer School, and as an opening event I decided to do a live computer experiment—in which I would try to make a real-time science discovery. The subject I chose was nestedly recursive functions. It took a few hours. But then, yes, we made a discovery! We found that there was a nestedly recursive function simpler than the ones I’d discussed in A New Kind of Science that already seemed to have very complex behavior:

Over the couple of decades that followed I returned many times to nestedly recursive functions—particularly in explorations I did with high school and other students, or in suggestions I made for student projects. Then recently I used them several times as “intuition-building examples” in various investigations.

I’d always felt my work with nestedly recursive functions was unfinished. Beginning about five years ago—particularly energized by our Physics Project—I started looking at harvesting seeds I’d sown in A New Kind of Science and before. I’ve been on quite a roll, with a few pages or even footnotes repeatedly flowering into rich book-length stories. And finally—particularly after my work last year on “Expression Evaluation and Fundamental Physics”—I decided it was time to try to finish my exploration of nestedly recursive functions.

Our modern Wolfram Language tools—as well as ideas from our Physics Project—provided some new directions to explore. But I still thought I pretty much knew what we’d find. And perhaps after all these years I should have known better. Because somehow in the computational universe—and in the world of ruliology—there are always surprises.

And here, yet again, there was indeed quite a surprise.

The Basic Idea

Consider the definition (later we’ll call this “P312”)

which we can also write as:

The first few values for f[n] generated from this definition are:

Continuing further we get:

But how are these values actually computed? To see that we can make an “evaluation graph” in which we show how each value of f[n] is computed from ones with smaller values of n, here starting from f[20]:

The gray nodes represent initial conditions: places where f[n] was sampled for n ≤ 0. The two different colors of edges correspond to the two different computations done in evaluating each f[n]:

Continuing to f[30] we get:

But what’s the structure of this graph? If we pull out the “red” graph on its own, we can see that it breaks into two path graphs, that consist of the sequences of the f[n] for odd and even n, respectively:

The “blue” graph, on the other hand, breaks into four components—each always a tree—leading respectively to the four different initial conditions:

And for example we can now plot f[n], showing which tree each f[n] ends up being associated with:

We’ll be using this same basic setup throughout, though for different functions. We’ll mostly consider recursive definitions with a single term (i.e. with a single “outermost f”, not two, as in Fibonacci recurrences).

The specific families of recursive functions we’ll be focusing on are:

And with this designation, the function we just introduced is P312.

A Closer Look at P312 ( f[n_] := 3 + f[n – f[n – 2]] )

Let’s start off by looking in more detail at the function we just introduced. Here’s what it does up to n = 500:

It might seem as if it’s going to go on “seemingly randomly” forever. But if we take it further, we get a surprise: it seems to “resolve itself” to something potentially simpler:

What’s going on? Let’s plot this again, but now showing which “blue graph tree” each value is associated with:

And now what we see is that the f[–3] and f[–2] trees stop contributing to f[n] when n is (respectively) 537 and 296, and these trees are finite (and have sizes 53 and 15):

The overall structures of the “remaining” trees—here shown up to f[5000]—eventually start to exhibit some regularity:

We can home in on this regularity by arranging these trees in layers, starting from the root, then plotting the number of nodes in each successive layer:

Looking at these pictures suggests that there should be some kind of more-or-less direct “formula” for f[n], at least for large n. They also suggest that such a formula should have some kind of mod-6 structure. And, yes, there does turn out to be essentially a “formula”. Though the “formula” is quite complicated—and reminiscent of several other “strangely messy” formulas in other ruliological cases—like Turing machine 600720 discussed in A New Kind of Science or combinator s[s[s]][s][s][s][s].

Later on, we’ll see the much simpler recursive function P111 (f[n_] := 1 + f[nf[n 1]]). The values for this function form a sequence in which successive blocks of length k have value k:

P312 has the same kind of structure, but much embellished. First, it has 6 separate riffled (“mod”) subsequences. Each subsequence then consists of a sequence of blocks. Given a value n, this computes which subsequence this is on, which block for that subsequence it’s in, and where it is within that block:

So, for example, here are results for multiples of 1000:

For n = 1000 we’re not yet in the “simple” regime, we can’t describe the sequence in any simple way, and our “indices” calculation is meaningless. For n = 2000 it so happens that we are at block 0 for the mod-1 subsequence. And the way things are set up, we just start by giving exactly the form of block 0 for each mod. So for mod 1 the block is:

But now n = 2000 has offset 16 within this block, so the final value of f[2000] is simply the 16th value from this list, or 100. f[2001] is then simply the next element within this block, or 109. And so on—until we reach the end of the block.

But what if we’re not dealing with block 0? For example, according to the table above, f[3000] is determined by mod-3 block 1. It turns out there’s a straightforward, if messy, way to compute any block b (for mod m):

So now we have a way to compute the value, say of f[3000], effectively just by “evaluating a formula”:

And what’s notable is that this evaluation doesn’t involve any recursion. In other words, at the cost of “messiness” we’ve—somewhat surprisingly—been able to unravel all the recursion in P312 to arrive at a “direct formula” for the value of f[n] for any n.

So what else can we see about the behavior of f[n] for P312? One notable feature is its overall growth rate. For large n, it turns out that (as can be seen by substituting this form into the recursive definition and taking a limit):

One thing this means is that our evaluation graph eventually has a roughly conical form:

This can be compared to the very regular cone generated by P111 (which has asymptotic value ):

If one just looks at the form of the recursive definition for P312 it’s far from obvious “how far back” it will need to probe, or, in other words, what values of f[n] one will need to specify as initial conditions. As it turns out, though, the only values needed are f[–3], f[–2], f[–1] and f[0].

How can one see this? In 3 + f[nf[n – 2]] it’s only the outer f that can probe “far back” values. But how far it actually goes back depends on how much larger f[n – 2] gets compared to n. Plotting f[n – 2] and n together we have:

And the point is that only for very few values of n does f[n – 2] exceed n—and it’s these values that probe back. Meanwhile, for larger n, there can never be additional “lookbacks”, because f[n] only grows like .

So does any P312 recursion always have the same lookback? So far, we’ve considered specifically the initial condition f[n] = 1 for all n ≤ 0. But what if we change the value of f[0]? Here are plots of f[n] for different cases:

And it turns out that with f[0] = z, the lookback goes to –z for z ≥ 3, and to z – 4 for 1 ≤ z ≤ 2.

(If z ≤ 0 the function f[n] is basically not defined, because the recursion is trying to compute f[n] from f[n], f[n + 1], etc., so never “makes progress”.)

The case f[0] = 2 (i.e. z = 2) is the one that involves the least lookback—and a total of 3 initial values. Here is the evaluation graph in this case:

By comparison, here is the evaluation graph for the case f[0] = 5, involving 6 initial values:

If we plot the value of f[n] as a function of f[0] we get the following:

For n f[0], f[n] always has simple behavior, and is essentially periodic in n with period 3:

And it turns out that for any specified initial configuration of values, there is always only bounded lookback—with the bound apparently being determined by the largest of the initial values f[ninit].

So what about the behavior of f[n] for large n? Just like in our original f[0] = 1 case, we can construct “blue graph trees” rooted at each of the initial conditions. In the case f[0] = 1 we found that of the 4 trees only two continue to grow as n increases. As we vary f[0], the number of “surviving trees” varies quite erratically:

What if instead of just changing f[0], and keeping all other f[–k] = 1, we set f[n] = s for all n ≤ 0? The result is somewhat surprising:

For s ≥ 2, the behavior turns out to be simple—and similar to the behavior of P111.

So what can P312 be made to do if we change its initial conditions? With f[n] = 2 for n f[0] the behavior remains “tame”, but as f[0] increases it starts showing its typical complexity:

One question to ask is what set of values f[n] takes on. Given that the initial values have certain residues mod 3, all subsequent values must have the same residues. But apart from this constraint, it seems that all values for f[n] are obtained—which is not surprising given that f[n] grows only like .

The “P Family”: f[n_] := a + f[n – b f[n – c]]

P312 is just one example of the “P family” of sequences defined by:

Here is the behavior of some other Pabc sequences:

And here are their evaluation graphs:

P312 is the first “seriously complex” example.

P111 (as mentioned earlier) has a particularly simple form

which corresponds to the simple formula:

The evaluation graph in this case is just:

Only a single initial condition f[0] = 1 is used, and there is only a single “blue graph tree” with a simple form:

Another interesting case is P123:

Picking out only odd values of n we get:

This might look just like the behavior of P111. But it’s not. The lengths of the successive “plateaus” are now

with differences:

But this turns out to be exactly a nested sequence generated by joining together the successive steps in the evolution of the substitution system:

P123 immediately “gets into its final behavior”, even for small n. But—as we saw rather dramatically with P312—there can be “transient behavior” that doesn’t “resolve” until n is large. A smaller case of this phenomenon occurs with P213. Above n = 68 it shows a simple “square root” pattern of behavior, basically like P111. But for smaller n it’s a bit more complicated:

And in this case the transients aren’t due to “blue graph trees” that stop growing. Instead, there are only two trees (associated with f[0] and f[–1]), but both of them soon end up growing in very regular ways:

The “T Family”: f[n_] := a f[n – b f[n – c]]

What happens if our outermost operation is not addition, but multiplication?

Here are some examples of the behavior one gets. In each case we’re plotting on a log scale—and we’re not including T1xx cases, which are always trivial:

We see that some sequences have regular and readily predictable behavior, but others do not. And this is reflected in the evaluation graphs for these functions:

The first “complicated case” is T212:

The evaluation graph for f[50] in this case has the form:

And something that’s immediately notable is that in addition to “looking back” to the values of f[0] and f[–1], this also looks back to the value of f[24]. Meanwhile, the evaluation graph for f[51] looks back not only to f[0] and f[–1] but also to f[–3] and f[–27]:

How far back does it look in general? Here’s a plot showing which lookbacks are made as a function of n (with the roots of the “blue graph trees” highlighted):

There’s alternation between behaviors for even and odd n. But apart from that, additional lookbacks are just steadily added as n increases—and indeed the total number of lookbacks seems to follow a simple pattern:

But—just for once—if one looks in more detail, it’s not so simple. The lengths of the successive “blocks” are:

So, yes, the lookbacks are quite “unpredictable”. But the main point here is that—unlike for the P family—the number of lookbacks isn’t limited. In a sense, to compute T212 for progressively larger n, progressively more information about its initial conditions is needed.

When one deals with ordinary, unnested recurrence relations, one’s always dealing with a fixed lookback. And the number of initial conditions then just depends on the lookback. (So, for example, the Fibonacci recurrence has lookback 2, so needs two initial conditions, while the standard factorial recurrence has lookback 1, so needs only one initial condition.)

But for the nested recurrence relation T212 we see that this is no longer true; there can be an unboundedly large lookback.

OK, but let’s look back at the actual T212 sequence. Here it is up to larger values of n:

Or, plotting each point as a dot:

Given the recursive definition of f[n], the values of f[n] must always be powers of 2. This shows where each successive power of 2 is first reached as a function of n:

Meanwhile, this shows the accumulated average of f[n] as a function of n:

This is well fit by 0.38 Log[n], implying that, at least with this averaging, f[n] asymptotically approximates n0.26. And, yes, it is somewhat surprising that what seems like a very “exponential” recursive definition should lead to an f[n] that increases only like a power. But, needless to say, this is the kind of surprise one has to expect in the computational universe.

It’s worth noticing that f[n] fluctuates very intensely as a function of n. The overall distribution of values is very close to exponentially distributed—for example with the distribution of logarithmic values of f[n] for n between 9 million and 10 million being:

What else can we say about this sequence? Let’s say we reduce mod 2 the powers of 2 for each f[n]. Then we get a sequence which starts:

This is definitely not “uniformly random”. But if one look at blocks of sequential values, one can plot at what n each of the 2b possible configurations of a length-b block first appears:

And eventually it seems as if all length-b blocks for any given b will appear.

By the way, whereas in the P family, there were always a limited number of “blue graph trees” (associated with the limited number of initial conditions), for T212 the number of such trees increases with n, as more initial conditions are used. So, for example, here are the trees for f[50] and f[51]:

We’ve so far discussed T212 only with the initial condition f[n] = 1 for n ≤ 0. The fact that f[n] is always a power of 2 relies on every initial value also being a power of 2. But here’s what happens, for example, if f(n) = 2s for n ≤ 0:

In general, one can think of T212 as transforming an ultimately infinite sequence of initial conditions into an infinite sequence of function values, with different forms of initial conditions potentially giving very different sequences of function values:

(Note that not all choices of initial conditions are possible; some lead to “f[n] = f[n]” or f[n] = f[n + 1]” situations, where the evaluation of the function can’t “make progress”.)

The “Summer School” Sequence T311 (f[n_] := 3 f[n – f[n – 1]])

Having explored T212, let’s now look at T311—the original one-term nestedly recursive function discovered at the 2003 Wolfram Summer School:

Here’s its basic behavior:

And here is its evaluation graph—which immediately reveals a lot more lookback than T212:

Plotting lookbacks as a function of n we get:

Much as with T212, the total number of lookbacks varies with n in the fairly simple way (~ 0.44 n):

Continuing the T311 sequence further, it looks qualitatively very much like T212:

And indeed T311—despite its larger number of lookbacks—seems to basically behave like T212. In a story typical of the Principle of Computational Equivalence, T212 seems to have already “filled out the computational possibilities”, so T311 “doesn’t have anything to add”.

The “S Family”: f[n_] := n – f[f[n – a] – b]

As another (somewhat historically motivated) example of nestedly recursive functions, consider what we’ll call the “S family”, defined by:

Let’s start with the very minimal case S10 (or “S1”):

Our standard initial condition f[n] = 1 for n ≤ 0 doesn’t work here, because it implies that f[1] = 1 – f[1]. But if we take f[n] = 1 for n ≤ 1 we get:

Meanwhile, with f[n] = 1 for n ≤ 3 we get:

The first obvious feature of both these results is their overall slope: 1/ϕ ≈ 0.618, where ϕ is the golden ratio. It’s not too hard to see why one gets this slope. Assume that for large n we can take f[n] = σ n. Then substitute this form into both sides of the recursive definition for the S family to get σ n == n – σ (σ (na) – b). For large n all that survives is the condition for the coefficients of n

which has solution σ = 1/ϕ.

Plotting f[n] – n/ϕ for the case f[n] = 1 for n ≤ 1 we get:

The evaluation graph is this case has a fairly simple form

as we can see even more clearly with a different graph layout:

It’s notable that only the initial condition f[1] = 1 is used—leading to a single “blue graph tree” that turns out to have a very simple “Fibonacci tree” form (which, as we’ll discuss below, has been known since the 1970s):

From this it follows that f[n] related to the “Fibonacci-like” substitution system

and in fact the sequence of values of f[n] can be computed just as:

And indeed it turns out that in this case f[n] is given exactly by:

What about when f[n] = 1 not just for n ≤ 1 but beyond? For n ≤ 2 the results are essentially the same as for n ≤ 1. But for n ≤ 3 there’s a surprise: the behavior is considerably more complicated—as we can see if we plot f[n] – n/ϕ:

Looking at the evaluation graph in this case we see that the only initial conditions sampled are f[1] = 1 and f[3] = 1 (with f[2] only being reached if one specifically starts with f[2]):

And continuing the evaluation graph we see a mixture of irregularity and comparative regularity:

The plot of f[n] has a strange “hand-drawn” appearance, with overall regularity but detailed apparent randomness. The most obvious large-scale feature is “bursting” behavior (interspersed in an audio rendering with an annoying hum). The bursts all seem to have approximately (though not exactly) the same structure—and get systematically larger. The lengths of successive “regions of calm” between bursts (characterized by runs with Abs[f[n] – n/ϕ]

What happens to S1 with other initial conditions? Here are a few examples:

So how does Sa depend on a? Sometimes there’s at least a certain amount of clear regularity; sometimes it’s more complicated:

As is very common, adding the parameter b in the definition doesn’t seem to lead to fundamentally new behavior—though for b > 0 the initial condition f[n] = 1, n ≤ 0 can be used:

In all cases, only a limited number of initial conditions are sampled (bounded by the value of a + b in the original definition). But as we can see, the behavior can either be quite simple, or can be highly complex.

More Complicated Rules

Highly complex behavior arises even from very simple rules. It’s a phenomenon one sees all over the computational universe. And we’re seeing it here in nestedly recursive functions. But if we make the rules (i.e. definitions) for our functions more complicated, will we see fundamentally different behavior, or just more of the same?

The Principle of Computational Equivalence (as well as many empirical observations of other systems) suggests that it’ll be “more of the same”: that once one’s passed a fairly low threshold the computational sophistication—and complexity—of behavior will no longer change.

And indeed this is what one sees in nestedly recursive functions. But below the threshold different kinds of things can happen with different kinds of rules.

There are several directions in which we can make rules more complicated. One that we won’t discuss here is to use operations (conditional, bitwise, etc.) that go beyond arithmetic. Others tend to involve adding more instances of f in our definitions.

An obvious way to do this is to take f[n_] to be given by a sum of terms, “Fibonacci style”. There are various specific forms one can consider. As a first example—that we can call ab—let’s look at:

The value of a doesn’t seem to matter much. But changing b we see:

12 has unbounded lookback (at least starting with f[n] = 1 for n ≤ 0), but for larger b, 1b has bounded lookback. In both 13 and 15 there is continuing large-scale structure (here visible in log plots)

though this does not seem to be reflected in the corresponding evaluation graphs:

As another level of Fibonacci-style definition, we can consider ab:

But the typical behavior here does not seem much different from what we already saw with one-term definitions involving only two f’s:

(Note that aa is equivalent to a. Cases like 13 lead after a transient to pure exponential growth.)

A somewhat more unusual case is what we can call abc:

Subtracting overall linear trends we get:

For 111 using initial conditions f[1] = f[2] = 1 and plotting f[n] – n/2 we get

which has a nested structure that is closely related to the result of concatenating binary digit sequences of successive integers:

But despite the regularity in the sequence of values, the evaluation graph for this function is not particularly simple:

So how else might we come up with more complicated rules? One possibility is that instead of “adding f’s by adding terms” we can add f’s by additional nesting. So, for example, we can consider what we can call S31 (here shown with initial condition f[n] = 1 for n ≤ 3):

We can estimate the overall slope here by solving for x in x == 1 – x3 to get ≈ 0.682. Subtracting this off we get:

We can also consider deeper nestings. At depth d the slope is the solution to x == 1 – xd. Somewhat remarkably, in all cases the only initial conditions probed are f[1] = 1 and f[3] = 1:

As another example of “higher nesting” we can consider the class of functions (that we call a):

Subtracting a constant 1/ϕ slope we get:

The evaluation graph for 1 is complicated, but has some definite structure:

What happens if we nest even more deeply, say defining a functions:

With depth-d nesting, we can estimate the overall slope of f[n] by solving for x in

or

so that for the d = 3 case here the overall slope is the real root of or about 0.544. Subtracting out this overall slope we get:

And, yes, the sine-curve-like form of 5 is very odd. Continuing 10x longer, though, things are “squaring off”:

What happens if we continue nesting deeper? stays fairly tame:

However, already allows for more complicated behavior:

And for different values of a there are different regularities:

There are all sorts of other extensions and generalizations one might consider. Some involve alternate functional forms; others involve introducing additional functions, or allowing multiple arguments to our function f.

An Aside: The Continuous Case

In talking about recursive functions f[n] we’ve been assuming—as one normally does—that n is always an integer. But can we generalize what we’re doing to functions f[x] where x is a continuous real number?

Consider for example a continuous analog of the Fibonacci recurrence:

This produces a staircase-like function whose steps correspond to the usual Fibonacci numbers:

Adjusting the initial condition produces a slightly different result:

We can think of these as being solutions to a kind of “Fibonacci delay equation”—where we’ve given initial conditions not at discrete points, but instead on an interval.

So what happens with nestedly recursive functions? We can define an analog of S1 as:

Plotting this along with the discrete result we get:

In more detail, we get

where now the plateaus occur at the (“Wythoff numbers”) .

Changing the initial condition to be x ≤ 3 we get:

Removing the overall slope by subtracting x/ϕ gives:

One feature of the continuous case is that one can continuously change initial conditions—though the behavior one gets typically breaks into “domains” with discontinuous boundaries, as in this case where we’re plotting the value of f[x] as a function of x and the “cutoff” in the initial conditions f[x], x:

So what about other rules? A rule like P312 (f[n_] := 3 + f[nf[n – 2]]) given “constant” initial conditions effectively just copies and translates the initial interval, and gives a simple order-0 interpolation of the discrete case. With initial condition f[x] = x some segments get “tipped”:

All the cases we’ve considered here don’t “look back” to negative values, in either the discrete or continuous case. But what about a rule like T212 (f[n_] := 2 f[n – 1 f[n – 2]]) that progressively “looks back further”? With the initial condition f[x] = 1 for x ≤ 0, one gets the same result as in the discrete case:

But if one uses the initial condition f[x ] = Abs[x – 1] for x ≤ 0 (the Abs[x 1] is needed to avoid ending up with f[x] depending on f[y] for y > x) one instead has

yielding the rather different result:

Continuing for larger x (on a log scale) we get:

Successively zooming in on one of the first “regions of noise” we see that it ultimately consists just of a large number of straight segments:

What’s going on here? If we count the number of initial conditions that are used for different values of x we see that this has discontinuous changes, leading to disjoint segments in f[x]:

Plotting over a larger range of x values the number of initial conditions used is:

And plotting the actual values of those initial conditions we get:

If we go to later, “more intense” regions of noise, we see more fragmentation—and presumably in the limit x ∞ we get the analog of an essential singularity in f[x]:

For the S family, with its overall n/ϕ trend, even constant initial conditions—say for S1—already lead to tipping, here shown compared to the discrete case:

How Do You Actually Compute Recursive Functions?

Let’s say we have a recursive definition—like the standard Fibonacci one:

How do we actually use this to compute the value of, say, f[7]? Well, we can start from f[7], then use the definition to write this as f[6] + f[5], then write f[6] as f[5] + f[4], and so on. And we can represent this using a evaluation graph, in the form:

But this computation is in a sense very wasteful; for example, it’s independently computing f[3] five separate times (and of course getting the same answer each time). But what if we just stored each f[n] as soon as we compute, and then just retrieve that stored (“cached”) value whenever we need it again?

In the Wolfram Language, it’s a very simple change to our original definition:

And now our evaluation graph becomes much simpler:

And indeed it’s this kind of minimal evaluation graph that we’ve been using in everything we’ve discussed so far.

What’s the relationship between the “tree” evaluation graph, and this minimal one? The tree graph is basically an “unrolled” version of the minimal graph, in which all the possible paths that can be taken from the root node to the initial condition nodes have been treed out.

In general, the number of edges that come out of a single node in a evaluation graph will be equal to the number of instances of the function f that appear on the right-hand side of the recursive definition we’re using (i.e. 2 in the case of the standard Fibonacci definition). So this means that if the maximum length of path from the root to the initial conditions is s, the maximum number of nodes that can appear in the “unrolled” graph is 2s. And whenever there are a fixed set of initial conditions (i.e. if there’s always the same lookback), the maximum path length is essentially n—implying in the end that the maximum possible number of nodes in the unrolled graph will be 2n.

(In the actual case of the Fibonacci recurrence, the number of nodes in the unrolled graph is, or about 1.6n.)

But if we actually evaluate f[7]—say in the Wolfram Language—what is the sequence of f[n]’s that we’ll end up computing? Or, in effect, how will the evaluation graph be traversed? Here are the results for the unrolled and minimal evaluation graphs—i.e. without and with caching:

Particularly in the first case this isn’t the only conceivable result we could have gotten. It’s the way it is here because of the particular “leftmost innermost” evaluation order that the Wolfram Language uses by default. In effect, we’re traversing the graph in a depth-first way. In principle we could use other traversal orders, leading to f[n]’s being evaluated in different orders. But unless we allow other operations (like f[3] + f[3] 2 f[3]) to be interspersed with f evaluations, we’ll still always end up with the same number of f evaluations for a given evaluation graph.

But which is the “correct” evaluation graph? The unrolled one? Or the minimal one? Well, it depends on the computational primitives we’re prepared to use. With a pure stack machine, the unrolled graph is the only one possible. But if we allow (random-access) memory, then the minimal graph becomes possible.

OK, so what happens with nestedly recursive functions? Here, for example, are unrolled and minimal graphs for T212:

Here are the sequences of f[n]’s that are computed:

And here’s a comparison of the number of nodes (i.e. f evaluations) from unrolled and minimal evaluation graphs (roughly 1.2n and 0.5 n, respectively):

Different recursive functions lead to different patterns of behavior. The differences are less obvious in evaluation graphs, but can be quite obvious in the actual sequence of f[n]’s that are evaluated:

But although looking at evaluation sequences from unrolled evaluation graphs can be helpful as a way of classifying behavior, the exponentially more steps involved in the unrolled graph typically makes this impractical in practice.

Primitive Recursive or Not?

Recursive functions have a fairly long history, that we’ll be discussing below. And for nearly a hundred years there’s been a distinction made between “primitive recursive functions” and “general recursive functions”. Primitive recursive functions are basically ones where there’s a “known-in-advance” pattern of computation that has to be done; general recursive functions are ones that may in effect make one have to “search arbitrarily far” to get what one needs.

In Wolfram Language terms, primitive recursive functions are roughly ones that can be constructed directly using functions like Nest and Fold (perhaps nested); general recursive functions can also involve functions like NestWhile and FoldWhile.

So, for example, with the Fibonacci definition

the function f[n] is primitive recursive and can be written, say, as:

Lots of the functions one encounters in practice are similarly primitive recursive—including most “typical mathematical functions” (Plus, Power, GCD, Prime, …). And for example functions that give the results of n steps in the evolution of a Turing machine, cellular automaton, etc. are also primitive recursive. But functions that for example test whether a Turing machine will ever halt (or give the state that it achieves if and when it does halt) are not in general primitive recursive.

On the face of it, our nestedly recursive functions seem like they must be primitive recursive, since they don’t for example appear to be “searching for anything”. But things like the presence of longer and longer lookbacks raise questions. And then there’s the potential confusion of the very first example (dating from the late 1920s) of a recursively defined function known not to be primitive recursive: the Ackermann function.

The Ackermann function has three (or sometimes two) arguments—and, notably, its definition (here given in its classic form) includes nested recursion:

This is what the evaluation graphs look like for some small cases:

Looking at these graphs we can begin to see a pattern. And in fact there’s a simple interpretation: f[m, x, y] for successive m is doing progressively more nested iterations of integer successor operations. f[0, x, y] computes x + y; f[1, x, y] does “repeated addition”, i.e. computes x × y; f[2, x, y] does “repeated multiplication”, i.e. computes xy; f[3, x, y] does “tetration”, i.e. computes the “power tower” Nest[x#&, 1, y]; etc.

Or, alternatively, these can be given explicitly in successively more nested form:

And at least in this form f[m, x, y] involves m nestings. But a given primitive recursive function can involve only a fixed number of nestings. It might be conceivable that we could rewrite f[m, x, y] in certain cases to involve only a fixed number of nestings. But if we look at f[m, m, m] then this turns out to inevitably grow too rapidly to be represented by a fixed number of nestings—and thus cannot be primitive recursive.

But it turns out that the fact that this can happen depends critically on the Ackermann function having more than one argument—so that one can construct the “diagonal” f[m, m, m].

So what about our nestedly recursive functions? Well, at least in the form that we’ve used them, they can all be written in terms of Fold. The key idea is to accumulate a list of values so far (conveniently represented as an association)—sampling whichever parts are needed—and then at the end take the last element. So for example the “Summer School function” T311

can be written:

An important feature here is that we’re getting Lookup to give 1 if the value it’s trying to look up hasn’t been filled in yet, implementing the fact that f[n] = 1 for n ≤ 0.

So, yes, our recursive definition might look back further and further. But it always just finds value 1—which is easy for us to represent without, for example, any extra nesting, etc.

The ultimate (historical) definition of primitive recursion, though, doesn’t involve subsets of the Wolfram Language (the definition was given almost exactly 100 years too early!). Instead, it involves a specific set of simple primitives:

(An alternative, equivalent definition for recursion—explicitly involving Fold—is r[g_, h_] := Fold[{u, v} h[u, v, ##2]], g[##2], Range[0, #1 – 1]] &.)

So can our nestedly recursive functions be written purely in terms of these primitives? The answer is yes, though it’s seriously complicated. A simple function like Plus can for example be written as r[p[1], s], so that e.g. r[p[1], s][2,3]5. Times can be written as r[z, c[Plus, p[1], p[3]]] or r[z, c[r[p[1], s], p[1], p[3]]], while Factorial can be written as r[c[s, z], c[Times, p[1], c[s, p[2]]]]. But even Fibonacci, for example, seems to require a very much longer specification.

In writing “primitive-recursive-style” definitions in Wolfram Language we accumulated values in lists and associations. But in the ultimate definition of primitive recursion, there are no such constructs; the only form of “data” is positive integers. But for our definitions of nestedly recursive functions we can use a “tupling function” that “packages up” any list of integer values into a single integer (and an untupling function that unpacks it). And we can do this say based on a pairing (2-element-tupling) function like:

But what about the actual If[n ≤0, 1, ...] lookback test itself? Well, If can be written in primitive recursive form too: for example, r[c[s, z], c[f, c[s, p[2]]]][n] is equivalent to If[n ≤ 0, 1, f[n]].

So our nestedly recursive functions as we’re using them are indeed primitive recursive. Or, more strictly, finding values f[n] is primitive recursive. Asking questions like “For what n does f[n] reach 1000?” might not be primitive recursive. (The obvious way of answering them involves a FoldWhile-style non-primitive-recursive search, but proving that there’s no primitive recursive way to answer the question is likely very much harder.)

By the way, it’s worth commenting that while for primitive recursive functions it’s always possible to compute a value f[n] for any n, that’s not necessarily true for general recursive functions. For example, if we ask “For what n does f[n] reach 1000?” there might simply be no answer to this; f[n] might never reach 1000. And when we look at the computations going on underneath, the key distinction is that in evaluating primitive recursive functions, the computations always halt, while for general recursive functions, they may not.

So, OK. Our nestedly recursive functions can be represented in “official primitive recursive form”, but they’re very complicated in that form. So that raises the question: what functions can be represented simply in this form? In A New Kind of Science I gave some examples, each minimal for the output it produces:

And then there’s the most interesting function I found:

It’s the simplest primitive recursive function whose output has no obvious regularity:

Because it’s primitive recursive, it’s possible to express it in terms of functions like Fold—though it’s two deep in those, making it in some ways more complicated (at least as far as the Grzegorczyk hierarchy that counts “Fold levels” is concerned) than our nestedly recursive functions:

But there’s still an issue to address with nestedly recursive functions and primitive recursion. When we have functions (like T212) that “reach back” progressively further as n increases, there’s a question of what they’ll find. We’ve simply assumed f[n] = 1 for n ≤0. But what if there was something more complicated there? Even if f[–m] was given by some primitive recursive function, say p[m], it seems possible that in computing f[n] one could end up somehow “bouncing back and forth” between positive and negative arguments, and in effect searching for an m for which p[m] has some particular value, and in doing that searching one could find oneself outside the domain of primitive recursive functions.

And this raises yet another question: are all definitions we can give of nestedly recursive functions consistent? Consider for example:

Now ask: what is f[1]? We apply the recursive definition. But it gives us f[1] = 1 – f[f[0]] or f[1] = 1 – f[1], or, in other words, an inconsistency. There are many such inconsistencies that seem to “happen instantly” when we apply definitions. But it seems conceivable that there could be “insidious inconsistencies” that show up only after many applications of a recursive definition. And it’s also conceivable that one could end up with “loops” like f[i] = f[i]. And things like this could be reasons that f[n] might not be a “total function”, defined for all n.

We’ve seen all sorts of complex behavior in nestedly recursive functions. And what the Principle of Computational Equivalence suggests is that whenever one sees complex behavior, one must in some sense be dealing with computations that are “as sophisticated as any computation can be”. And in particular one must be dealing with computations that can somehow support computation universality.

So what would it mean for a nestedly recursive function to be universal? For a start, one would need some way to “program” the function. There seem to be a couple of possibilities. First, one could imagine packing both “code” and “data” into the argument n of f[n]. So, for example, one might use some form of tupling function to take a description of a rule and an initial state for a Turing machine, together with a specification of a step number, then package all these things into an integer n that one feeds into one’s universal nestedly recursive function f. Then the idea would be that the value computed for f[n] could be decoded to give the state of the Turing machine at the specified step. (Such a computation by definition always halts—but much as one computes with Turing machines by successively asking for the next steps in their evolution, one can imagine setting up a “harness” that just keeps asking for values of f[n] at an infinite progression of values n.)

Another possible approach to making a universal nestedly recursive function is to imagine feeding in a “program” through the initial conditions one gives for the function. There might well need to be decoding involved, but in some sense what one might hope is that just by changing its initial conditions one could get a nestedly recursive function with a specific recursive definition to emulate a nestedly recursive function with any other recursive definition (or, say, for a start, any linear recurrence).

Perhaps one could construct a complicated nestedly recursive function that would have this property. But what the Principle of Computational Equivalence suggests is that it should be possible to find the property even in “naturally occurring cases”—like P312 or T212.

The situation is probably going to be quite analogous to what happened with the rule 110 cellular automaton or the s = 2, k = 3 596440 Turing machine. By looking at the actual typical behavior of the system one got some intuition about what was likely to be going on. And then later, with great effort, it became possible to actually prove computation universality.

In the case of nestedly recursive functions, we’ve seen here examples of just how diverse the behavior generated by changing initial conditions can be. It’s not clear how to harness this diversity to extract some form of universality. But it seems likely that the “raw material” is there. And that nestedly recursive functions will show themselves as able join so many other systems in fitting into the framework defined by the Principle of Computational Equivalence.

Some History

Once one has the concept of functions and the concept of recursion, nestedly recursive functions aren’t in some sense a “complicated idea”. And between this fact and the fact that nestedly recursive functions haven’t historically had a clear place in any major line of mathematical or other development it’s quite difficult to be sure one’s accurately tracing their history. But I’ll describe here at least what I currently know.

The concept of something like recursion is very old. It’s closely related to mathematical induction, which was already being used for proofs by Euclid around 300 BC. And in a quite different vein, around the same time (though not recorded in written form until many centuries later) Fibonacci numbers arose in Indian culture in connection with the enumeration of prosody (“How many different orders are there in which to say the Sanskrit words in this veda?”).

Then in 1202 Leonardo Fibonacci, at the end of his calculational math book Liber Abaci (which was notable for popularizing Hindu-Arabic numerals in the West) stated—more or less as a recreational example—his “rabbit problem” in recursive form, and explicitly listed the Fibonacci numbers up to 377. But despite this early appearance, explicit recursively defined sequences remained largely a curiosity until as late as the latter part of the twentieth century.

The concept of an abstract function began to emerge with calculus in the late 1600s, and became more solidified in the 1700s—but basically always in the context of continuous arguments. A variety of specific examples of recurrence relations—for binomial coefficients, Bernoulli numbers, etc.—were in fairly widespread use. But there didn’t seem to have yet been a sense that there was a general mathematical structure to study.

In the course of the 1800s there had been an increasing emphasis on rigor and abstraction in mathematics, leading by the latter part of the century to a serious effort to axiomatize concepts associated with numbers. Starting with concepts like the recursive definition of integers by repeated application of the successor operation, by the time of Peano’s axioms for arithmetic in 1891 there was a clear general notion (particularly related to the induction axiom) that (integer) functions could be defined recursively. And when David Hilbert’s program of axiomatizing mathematics got underway at the beginning of the 1900s, it was generally assumed that all (integer) functions of interest could actually be defined specifically using primitive recursion.

The notation for recursively specifying functions gradually got cleaner, making it easier to explore more elaborate examples. And in 1927 Wilhelm Ackermann (a student of Hilbert’s) introduced (in completely modern notation) a “reasonable mathematical function” that—as we discussed above—he showed was not primitive recursive. And right there, in his paper, without any particular comment, is a nestedly recursive function definition:

Ackermann nestedly recursive function paper

Ackermann nestedly recursive function definition

In 1931 Kurt Gödel further streamlined the representation of recursion, and solidified the notion of general recursion. There soon developed a whole field of recursion theory—though most of it was concerned with general issues, not with specific, concrete recursive functions. A notable exception was the work of Rózsa Péter (Politzer), beginning in the 1930s, and leading in 1957 to her book Recursive Functions—which contains a chapter on “Nested Recursion” (here in English translation):

Nested recursion book chapter

But despite the many specific (mostly primitive) recursive functions discussed in the rest of the book, this chapter doesn’t stray far from the particular function Ackermann defined (or at least Péter’s variant of it).

What about the recreational mathematics literature? By the late 1800s there were all sorts of publications involving numbers, games, etc. that at least implicitly involved recursion (an example being Édouard Lucas’s 1883 Tower of Hanoi puzzle). But—perhaps because problems tended to be stated in words rather than mathematical notation—it doesn’t seem as if nestedly recursive functions ever showed up.

In the theoretical mathematics literature, a handful of somewhat abstract papers about “nested recursion” did appear, an example being one in 1961 by William Tait, then at Stanford:

Nested recursion paper by William Tait

But, meanwhile, the general idea of recursion was slowly beginning to go from purely theoretical to more practical. John McCarthy—who had coined the term “artificial intelligence”—was designing LISP as “the language for AI” and by 1960 was writing papers with titles like “Recursive Functions of Symbolic Expressions and Their Computation by Machine”.

In 1962 McCarthy came to Stanford to found the AI Lab there, bringing with him enthusiasm for both AI and recursive functions. And by 1968 these two topics had come together in an effort to use “AI methods” to prove properties of programs, and in particular programs involving recursive functions. And in doing this, John McCarthy came up with an example he intended to be awkward—that’s exactly a nestedly recursive function:

John McCarthy nestedly recursive function example

In our notation, it would be:

And it became known as “McCarthy’s 91-function” because, yes, for many n, f[n] = 91. These days it’s trivial to evaluate this function—and to find out that f[n] = 91 only up to n = 102:

But even the evaluation graph is somewhat large

and in pure recursive evaluation the recursion stack can get deep—which back then was a struggle for LISP systems to handle.

There were efforts at theoretical analysis, for example by Zohar Manna, who in 1974 published Mathematical Theory of Computation which—in a section entitled “Fixpoints of Functionals”—presents the 91-function and other nestedly recursively functions, particularly in the context of evaluation-order questions.

In the years that followed, a variety of nestedly recursive functions were considered in connection with proving theorems about programs, and with practical assessments of LISP systems, a notable example being Ikuo Takeuchi’s 1978 triple recursive function:

Ikuo Takeuchi triple recursive function example

But in all these cases the focus was on how these functions would be evaluated, not on what their behavior would be (and it was typically very simple).

But now we have to follow another thread in the story. Back in 1961, right on the Stanford campus, a then-16-year-old Douglas Hofstadter was being led towards nestedly recursive functions. As Doug tells it, it all started with him seeing that squares are interspersed with gaps of 1 or 2 between triangular numbers, and then noticing patterns in those gaps (and later realizing that they showed nesting). Meanwhile, at Stanford he had access to a computer running Algol, a language which (like LISP and unlike Fortran) supported recursion (though this wasn’t particularly advertised, since recursion was still generally considered quite obscure).

And as Doug tells it, within a year or two he was using Algol to do things like recursively create trees representing English sentences. Meanwhile—in a kind of imitation of the Eleusis “guess-a-card-rule” game—Doug was apparently challenging his fellow students to a “function game” based on guessing a math function from specified values. And, as he tells it, he found that functions that were defined recursively were the ones people found it hardest to guess.

That was all in the early 1960s, but it wasn’t until the mid-1970s that Doug Hofstadter returned to such pursuits. After various adventures, Doug was back at Stanford—writing what became his book Gödel, Escher, Bach. And in 1977 he sent a letter to Neil Sloane, creator of the 1973 A Handbook of Integer Sequences (and what’s now the Online Encyclopedia of Integer Sequences, or OEIS):

Douglas Hofstadter letter to Neil Sloane

As suggested by the accumulation of “sequence ID” annotations on the letter, Doug’s “eta sequences” had actually been studied in number theory before—in fact, since at least the 1920s (they are now usually called Beatty sequences). But the letter went on, now introducing some related sequences—that had nestedly recursive definitions:

Sequences with nestedly recursive definitions

As Doug pointed out, these particular sequences (which were derived from golden ratio versions of his “eta sequences”) have a very regular form—which we would now call nested. And it was the properties of this form that Doug seemed most concerned about in his letter. But actually, as we saw above, just a small change in initial conditions in what I’m calling S1 would have led to much wilder behavior. But that apparently wasn’t something Doug happened to notice. A bit later in the letter, though, there was another nestedly recursive sequence—that Doug described as a “horse of an entirely nother color”: the “absolutely CRAZY” Q sequence:

Crazy Q sequence

Two years later, Doug’s Gödel, Escher, Bach book was published—and in it, tucked away at the bottom of page 137, a few pages after a discussion of recursive generation of text (with examples such as “the strange bagels that the purple cow without horns gobbled”), there was the Q sequence:

Chaotic Q sequence

Strangely, though, there was no picture of it, and Doug listed only 17 terms (which, until I was writing this, was all I assumed he had computed):

17 Q-sequence terms

So now nestedly recursive sequences were out in the open—in what quickly became a very popular book. But I don’t think many people noticed them there (though, as I’ll discuss, I did). Gödel, Escher, Bach is primarily a playful book focused on exposition—and not the kind of place you’d expect to find a new, mathematical-style result.

Still—quite independent of the book—Neil Sloane showed Doug’s 1977 letter to his Bell Labs colleague Ron Graham, who within a year made a small mention of the Q sequence in a staid academic math publication (in a characteristic “state-it-as-a-problem” Erdös form):

Erdös and Graham math paper

Erdös and Graham math paper continued

There was a small and tight-knit circle of serious mathematicians—essentially all of whom, as it happens, I personally knew—who would chase these kinds of easy-to-state-but-hard-to-solve problems. Another was Richard Guy, who soon included the sequence as part of problem E31 in his Unsolved Problems in Number Theory, and mentioned it again a few years later.

But for most of the 1980s little was heard about the sequence. As it later turns out, a senior British applied mathematician named Brian Conolly (who wasn’t part of the aforementioned tight-knit circle) had—presumably as a kind of hobby project—made some progress, and in 1986 had written to Guy about it. Guy apparently misplaced the letter, but later told Conolly that John Conway and Sol Golomb had worked on similar things.

Conway presumably got the idea from Hofstadter’s work (though he had a habit of obfuscating his sources). But in any case, on July 15, 1988, Conway gave a talk at Bell Labs entitled “Some Crazy Sequences” (note the word “crazy”, just like in Hofstadter’s letter to Sloane) in which he discussed the regular-enough-to-be-mathematically-interesting sequence (which we’re calling G3111 here):

Despite its visual regularity, Conway couldn’t mathematically prove certain features of the wiggles in the sequence—and in his talk offered a $10,000 prize for anyone who could. By August a Bell Labs mathematician named Colin Mallows had done it. Conway claimed (later to be contradicted by video evidence) that he’d only offered $1000—and somehow the whole affair landed as a story in the August 30 New York Times under the heading “Intellectual Duel: Brash Challenge, Swift Response”. But in any case, this particular nestedly recursive sequence became known as “Conway’s Challenge Sequence”.

So what about Sol Golomb? It turns out he’d started writing a paper—though never finished it:

Discrete Chaos paper

Discrete Chaos paper continued

He’d computed 280 terms of the Q sequence (he wasn’t much of a computer user) and noticed a few coincidences. But he also mentioned another kind of nestedly recursive sequence, no doubt inspired by his work on feedback shift registers:

As he noted, the behavior depends greatly on the initial conditions, though is always eventually periodic—with his student Unjeng Cheng having found long-period examples.

OK, so by 1988 nestedly recursive functions had at least some notoriety. So what happened next? Not so much. There’s a modest academic literature that’s emerged over the last few decades, mostly concentrated very specifically around “Conway’s Challenge Sequence”, Hofstadter’s Q function, or very similar “meta Fibonacci” generalizations of them. And so far as I know, even the first published large-scale picture of the Q sequence only appeared in 1998 (though I had pictures of it many years earlier):

Klaus Pinn Q-sequence paper

Klaus Pinn Q-sequence paper continued

Why wasn’t more done with nestedly recursive functions? At some level it’s because they tend to have too much computational irreducibility—making it pretty difficult to say much about them in traditional mathematical ways. But perhaps more important, studying them broadly is really a matter of ruliology: it requires the idea of exploring spaces of rules, and of expecting the kinds of behavior and phenomena that are characteristic of systems in the computational universe. And that’s something that’s still not nearly as widely understood as it should be.

My Personal Story with Nestedly Recursive Functions

I think 1979 was the year when I first took recursion seriously. I’d heard about the Fibonacci sequence (though not under that name) as a young child a decade earlier. I’d implicitly (and sometimes explicitly) encountered recursion (sometimes through error messages!) in computer algebra systems I’d used. In science, I’d studied fractals quite extensively (Benoit Mandelbrot’s book having appeared in 1977), and I’d been exposed to things like iterated maps. And I’d quite extensively studied cascade processes, notably of quarks and gluons in QCD.

As I think about it now, I realize that for several years I’d written programs that made use of recursion (and I had quite a lot of exposure to LISP, and the culture around it). But it was in 1979—having just started using C—that I first remember writing a program (for doing percolation theory) where I explicitly thought “this is using recursion”. But then, in late 1979, I began to design SMP (“Symbolic Manipulation Program”), the forerunner of the modern Wolfram Language. And in doing this I quickly solidified my knowledge of mathematical logic and the (then-fledgling) field of theoretical computer science.

My concept of repeated transformations for symbolic expressions—which is still the core of Wolfram Language today—is somehow fundamentally recursive. And by the time we had the first signs of life for our SMP system, Fibonacci was one of our very first tests. We soon tried the Ackermann function too. And in 1980 I became very interested in the problem of evaluation order, particularly for recursive functions—and the best treatment I found of it (though at the time not very useful to me) was in none other than the book by Zohar Manna that I mentioned above. (In a strange twist, I was at that time also studying gauge choices in physics—and it was only last year that I realized that they’re fundamentally the same thing as evaluation orders.)

It was soon after it came out in 1979 that I first saw Douglas Hofstadter’s book. At the time I wasn’t too interested in its Lewis-Carroll-like aspects, or its exposition; I just wanted to know what the “science meat” in it was. And somehow I found the page about the Q sequence, and filed it away as “something interesting”.

I’m not sure when I first implemented the Q sequence in SMP, but by the time we released Version 1.0 in July 1981, there it was: an external package (hence the “X” prefix) for evaluating “Hofstadter’s recursive function”, elegantly using memoization—with the description I gave saying (presumably because that’s what I’d noticed) that its values “have several properties of randomness”:

Hofstadter recursive function

Firing up a copy of SMP today—running on a virtual machine that still thinks it’s 1986—I can run this code, and easily compute the function:

SMP evaluation

I can even plot it—though without an emulator for a 1980s-period storage-tube display, only the ASCIIfied rendering works:

ASCIIfied rendering

So what did I make of the function back in 1981? I was interested in how complexity and randomness could occur in nature. But at the time, I didn’t have enough of a framework to understand the connection. And, as it was, I was just starting to explore cellular automata, which seemed a lot more “nature like”—and which soon led me to things like rule 30 and the phenomenon of computational irreducibility.

Still, I didn’t forget the Q sequence. And when I was building Mathematica I again used it as a test (the .tq file extension came from the brief period in 1987 when we were trying out “Technique” as the name of the system):

Combinatorial functions

Combinatorial functions continued

When Mathematica 1.0 was released on June 23, 1988, the Q sequence appeared again, this time as an example in the soon-in-every-major-bookstore Mathematica book:

Q sequence in Mathematica book

Q sequence in Mathematica book continued

I don’t think I was aware of Conway’s lecture that occurred just 18 days later. And for a couple of years I was consumed with tending to a young product and a young company. But by 1991, I was getting ready to launch into basic science again. Meanwhile, the number theorist (and today horologist) Ilan Vardi—yet again from Stanford—had been working at Wolfram Research and writing a book entitled Computational Recreations in Mathematica, which included a long section on the analysis of Takeuchi’s nested recursive function (“TAK”). My email archive records an exchange I had with him:

Wolfram–Vardi email

He suggested a “more symmetrical” nested recursive function. I responded, including a picture that made it fairly clear that this particular function would have only nested behavior, and not seem “random”:

Wolfram–Vardi followup email

Nested recursive function graphic

By the summer of 1991 I was in the thick of exploring different kinds of systems with simple rules, discovering the remarkable complexity they could produce, and filling out what became Chapter 3 of A New Kind of Science: “The World of Simple Programs”. But then came Chapter 4: “Systems Based on Numbers”. I had known since the mid-1980s about the randomness in things like digit sequences produced by successive arithmetic operations. But what about randomness in pure sequences of integers? I resolved to find out just what it would take to produce randomness there. And so it was that on August 13, 1993, I came to be enumerating possible symbolic forms for recursive functions—and selecting ones that could generate at least 10 terms:

Symbolic forms for recursive functions

As soon as I plotted the “survivors” I could see that interesting things were going to happen:

Recursive function graphs

Was this complexity somehow going to end? I checked out to 10 million terms. And soon I started collecting my “prize specimens” and making a gallery of them:

Recursive functions gallery

I had some one-term recurrences, and some two-term ones. Somewhat shortsightedly I was always using “Fibonacci-like” initial conditions f[1] = f[2] = 1—and I rejected any recurrence that tried to “look back” to f[0], f[–1], etc. And at the time, with this constraint, I only found “really interesting” behavior in two-term recurrences.

In 1994 I returned briefly to recursive sequences, adding a note “solving” a few of them, and discussing the evaluation graphs of others:

Properties of sequences

Evaluation graphs

When I finally finished A New Kind of Science in 2002, I included a list of historical “Close approaches” to its core discoveries, one of them being Douglas Hofstadter’s work on recursive sequences:

Douglas Hofstadter work on recursive sequences

In retrospect, back in 1981 I should have been able to take the “Q sequence” and recognize in it the essential “rule 30 phenomenon”. But as it was, it took another decade—and many other explorations in the computational universe—for me to build up the right conceptual framework to see this. In A New Kind of Science I studied many kinds of systems, probing them far enough, I hoped, to see their most important features. But recursive functions were an example where I always felt there was more to do; I felt I’d only just scratched the surface.

In June 2003—a year after A New Kind of Science was published—we held our first summer school. And as a way to introduce methodology—and be sure that people knew I was fallible and approachable—I decided on the first day of the summer school to do a “live experiment”, and try to stumble my way to discovering something new, live and in public.

A few minutes before the session started, I picked the subject: recursive functions. I began with some examples I knew. Then it was time to go exploring. At first lots of functions “didn’t work” because they were looking back too far. But then someone piped up “Why not just say that f[n] is 1 whenever n isn’t a positive integer?” Good idea! And very easy to try.

Soon we had the “obvious” functions written (today Apply[Plus, ...] could be just Total[...], but otherwise there’s nothing “out of date” here):

In a typical story of Wolfram-Language-helps-one-think-clearly, the obvious function was also very general, and allowed a recurrence with any number of terms. So why not start with just one term? And immediately, there it was—what we’re now calling T311:

T311

And then a plot (yes, after Version 6 one didn’t need the Show or the trailing “;”):

RSValues plot

Of course, as is the nature of computational constructions, this is something timeless—that looks the same today as it did 21 years ago (well, except that now our plots display with color by default).

I thought this was a pretty neat discovery. And I just couldn’t believe that years earlier I’d failed to see the obvious generalization of having “infinite” initial conditions.

The next week I did a followup session, this time talking about how one would write up a discovery like this. We started off with possible titles (including audience suggestions):

Suggested titles

And, yes, the first title listed is exactly the one I’ve now used here. In the notebook I created back then, there were first some notes (some of which should still be explored!):

Title notes

Three hours later (on the afternoon of July 11, 2003) there’s another notebook, with the beginnings of a writeup:

Initial recursive functions writeup

By the way, another thing we came up with at the summer school was the (non-nestedly) recursive function:

Plotting g[n + 1] – g[n] gives:

And, yes, bizarrely (and reminiscent of McCarthy’s 91-function) for n ≥ 396, g[n + 1] – g[n] is always 97, and g[n] = 38606 + 97 (n – 396).

But in any case, a week or so after my “writeups” session, the summer school was over. In January 2004 I briefly picked the project up again, and made some pictures that, yes, show interesting structure that perhaps I should investigate now:

f[n - f[n - 1]]

In the years that followed, I would occasionally bring nestedly recursive functions out again—particularly in interacting with high school and other students. And at our summer programs I suggested projects with them for a number of students.

In 2008, they seemed like an “obvious interesting thing” to add to our Demonstrations Project:

NKS summer school live experiment

But mostly, they languished. Until, that is, my burst of “finish this” intellectual energy that followed the launch of our Physics Project in 2020. So here now, finally, after a journey of 43 years, I feel like I’ve been able to do some justice to nestedly recursive functions, and provided a bit more illumination to yet another corner of the computational universe.

(Needless to say, there are many, many additional questions and issues to explore. Different primitives, e.g. Mod, Floor, etc. Multiple functions that refer to each other. Multiway cases. Functions based on rational numbers. And endless potential approaches to analysis, identifying pockets of regularity and computational reducibility.)

Thanks

Thanks to Brad Klee for extensive help. Thanks also to those who’ve worked on nestedly recursive functions as students at our summer programs over the years, including Roberto Martinez (2003), Eric Rowland (2003), Chris Song (2021) and Thomas Adler (2024). I’ve benefitted from interactions about nestedly recursive functions with Ilan Vardi (1991), Tal Kubo (1993), Robby Villegas (2003), Todd Rowland (2003 etc.), Jim Propp (2004), Matthew Szudzik (2005 etc.), Joseph Stocke (2021 etc.), Christopher Gilbert (2024) and Max Niedermann (2024). Thanks to Doug Hofstadter for extensive answers to questions about history for this piece. It’s perhaps worth noting that I’ve personally known many of the people mentioned in the history section here (with the dates I met them indicated): John Conway (1984), Paul Erdös (1986), Sol Golomb (1981), Ron Graham (1983), Benoit Mandelbrot (1986), John McCarthy (1981) and Neil Sloane (1983).

Bibliography of Nestedly Recursive Functions »

Five Most Productive Years: What Happened and What’s Next

2024-08-30 00:31:46

Five Most Productive Years: What Happened and What's Next

So… What Happened?

Today is my birthday—for the 65th time. Five years ago, on my 60th birthday, I did a livestream where I talked about some of my plans. So… what happened? Well, what happened was great. And in fact I’ve just had the most productive five years of my life. Nine books. 3939 pages of writings (1,283,267 words). 499 hours of podcasts and 1369 hours of livestreams. 14 software product releases (with our great team). Oh, and a bunch of big—and beautiful—ideas and results.

It’s been wonderful. And unexpected. I’ve spent my life alternating between technology and basic science, progressively building a taller and taller tower of practical capabilities and intellectual concepts (and sharing what I’ve done with the world). Five years ago everything was going well, and making steady progress. But then there were the questions I never got to. Over the years I’d come up with a certain number of big questions. And some of them, within a few years, I’d answered. But others I never managed to get around to.

And five years ago, as I explained in my birthday livestream, I began to think “it’s now or never”. I had no idea how hard the questions were. Yes, I’d spent a lifetime building up tools and knowledge. But would they be enough? Or were the questions just not for our time, but only perhaps for some future century?

At several points before in my life I’d faced such issues—and things had worked out well (A New Kind of Science, Wolfram|Alpha, etc.). And from this, I had gotten a certain confidence about what might be possible. In addition, as a serious student of intellectual history, I had a sense of what kind of boldness was needed. Five years ago there wasn’t really anything that made me need to do something big and new. But I thought: “What the heck. I might as well try. I’ll never know what’s possible unless I try.”

A major theme of my work since the early 1980s had been exploring the consequences of simple computational rules. And I had found the surprising result that even extremely simple rules could lead to immensely complex behavior. So what about the universe? Could it be that at a fundamental level our whole universe is just following some simple computational rule?

I had begun my career in the 1970s as a teenager studying the frontiers of existing physics. And at first I couldn’t see how computational rules could connect to what is known in physics. But in the early 1990s I had an idea, and by the late 1990s I had developed it and gotten some very suggestive results. But when I published these in A New Kind of Science in 2002, even my friends in the physics community didn’t seem to care—and I decided to concentrate my efforts elsewhere (e.g. building Wolfram|Alpha, Wolfram Language, etc.).

But I didn’t stop thinking “one day I need to get back to my physics project”. And in 2019 I decided: “What the heck. Let’s try it now.” It helped that I’d made a piece of technical progress the year before, and that now two young physicists were enthusiastic to work with me on the project.

And so it was, soon after my birthday in 2019, that we embarked on our Physics Project. It was a mixture of computer experiments and big concepts. But before the end of 2019 it was clear: it was going to work! It was an amazing experience. Thing after thing in physics that had always been mysterious I suddenly understood. And it was beautiful—a theory of such strength built on a structure of such incredible simplicity and elegance.

We announced what we’d figured out in April 2020, right when the pandemic was in full swing. There was still much to do (and there still is today). But the overall picture was clear. I later learned that a century earlier many well-known physicists were beginning to think in a similar direction (matter is discrete, light is discrete; space must be too) but back then they hadn’t had the computational paradigm or the other tools needed to move this forward. And now the responsibility had fallen on us to do this. (Pleasantly enough, given our framework, many modern areas of mathematical physics seemed to fit right in.)

And, yes, figuring out the basic “machine code” for our universe was of course pretty exciting. But seeing an old idea of mine blossom like this had another very big effect on me. It made me think: “OK, what about all those other projects I’ve been meaning to do? Maybe it’s time to do those too.”

And something else had happened as well. In doing the Physics Project we’d developed a new way of thinking about things—not just computational, but “multicomputational”. And actually, the core ideas behind this were in A New Kind of Science too. But somehow I’d never taken them seriously enough before, and never extended my intuition to encompass them. But now with the Physics Project I was doing this. And I could see that the ideas could also go much further.

So, yes, I had a new and powerful conceptual framework for doing science. And I had all the technology of the modern Wolfram Language. But in 2020 I had another thing too—in effect, a new distribution channel for my ideas and efforts. Early in my career I had used academic papers as my “channel” (at one point in 1979 even averaging a paper every few weeks). But in the late 1980s I had a very different kind of channel: embodying my ideas in the design and implementation of Mathematica and what’s now the Wolfram Language. Then in the 1990s I had another channel: putting everything together into what became my book A New Kind of Science.

After that was published in 2002 I would occasionally write small posts—for the community site around the science in my book, for our corporate blog, etc. And in 2010 I started my own blog. At first I mostly just wrote small, fun pieces. But by 2015—partly driven by telling historical stories (200th anniversary of George Boole, 200th anniversary of Ada Lovelace, …)—the things I was writing were getting ever meatier. (There’d actually already been some meaty ones about personal analytics in 2012.)

And by 2020 my pattern was set and I would routinely write 50+ -page pieces, full of pictures (all with immediately runnable “click-to-copy” code) and intended for anyone who cared to read them. Finally I had a good channel again. And I started using it. As I’d found over the years—whether with language documentation or with A New Kind of Science—the very act of exposition was a critical part of organizing and developing my ideas.

And now I started producing pieces. Some were directly about specific topics around the Physics Project. But within two months I was already writing about a “spinoff”: “Exploring Rulial Space: The Case of Turing Machines”. I had realized that one of the places the ideas of the Physics Project should apply was to the foundations of mathematics, and to metamathematics. In a footnote to A New Kind of Science I had introduced the idea of “empirical metamathematics”. And in the summer of 2020, fuelled by my newfound “finish those old projects” mindset, I ended up writing an 80-page piece on “The Empirical Metamathematics of Euclid and Beyond”.

December 7, 1920 was the date a certain Moses Schönfinkel introduced what we now call combinators: the very first clear foundations for universal computation. I had always found combinators interesting (if hard to understand). I had used ideas from them back around 1980 in the predecessor of what’s now the Wolfram Language. And I had talked about them a bit in A New Kind of Science. But as the centenary approached, I decided to make a more definitive study, in particular using methods from the Physics Project. And, for good measure, even in the middle of the pandemic I tracked down the mysterious history of Moses Schönfinkel.

In March 2021, there was another centenary, this time of Emil Post’s tag system, and again I decided to finish what I’d started in A New Kind of Science, and write a definitive piece, this time running to about 75 pages.

One might have thought that the excursions into empirical metamathematics, combinators, tag systems, rulial and multiway Turing machines would be distractions. But they were not. Instead, they just deepened my understanding and intuition for the new ideas and methods that had come out of the Physics Project. As well as finishing projects that I’d wondered about for decades (and the world had had open for a century).

Perhaps not surprisingly given its fundamental nature, the Physics Project also engaged with some deep philosophical issues. People would ask me about them with some regularity. And in March 2021 I started writing a bit about them, beginning with a piece on consciousness. The next month I wrote “Why Does the Universe Exist? Some Perspectives from Our Physics Project”. (This piece of writing happened to coincide with the few days in my life when I’ve needed to do active cryptocurrency trading—so I was in the amusing position of thinking about a philosophical question about as deep as they come, interspersed with making cryptocurrency trades.)

Everything kept weaving together. These philosophical questions made me internalize just how important the nature of the observer is in our Physics Project. Meanwhile I started thinking about the relationship of methods from the Physics Project to distributed computing, and to economics. And in May 2021 that intersected with practical blockchain questions, which caused me to write about “The Problem of Distributed Consensus”—which would soon show up again in the science and philosophy of observers.

The fall of 2021 involved really leaning into the new multicomputational paradigm, among other things giving a long list of where it might apply: metamathematics, chemistry, molecular biology, evolutionary biology, neuroscience, immunology, linguistics, economics, machine learning, distributed computing. And, yes, in a sense this was my “to do” list. In many ways, half the battle was just defining this. And I’m happy to say that just three years later, we’ve already made a big dent in it.

While all of this was going on, I was also energetically pursuing my “day job” as CEO of Wolfram Research. Version 12.1 of the Wolfram Language had come out less than a month before the Physics Project was announced. Version 12.2 right after the combinator centenary. And in 2021 there were two new versions. In all 635 new functions, all of which I had carefully reviewed, and many of which I’d been deeply involved in designing.

It’s a pattern in the history of science (as well as technology): some new methodology or some new paradigm is introduced. And suddenly vast new areas are opened up. And there’s lots of juicy “low-hanging fruit” to be picked. Well, that’s what had happened with the ideas from our Physics Project, and the concept of multicomputation. There were many directions to go, and many people wanting to get involved. And in 2021 it was becoming clear that something organizational had to be done: this wasn’t a job for a company (even for one as terrific and innovative as ours is), it was a job for something like an institute. (And, yes, in 2022, we indeed launched what’s now the Wolfram Institute for Computational Foundations of Science.)

But back in 1986, I had started the very first institute concentrating on complexity and how it could arise from simple rules. Running it hadn’t been a good fit for me back then, and very quickly I started our company. In 2002, when A New Kind of Science was published, I’d thought again about starting an institute. But it didn’t happen. But now there really seemed to be no choice. I started reflecting on what had happened to “complexity”, and whether there was something to leverage from the institutional structure that had grown up around it. Nearly 20 years after the publication of A New Kind of Science, what should “complexity” be now?

I wrote “Charting a Course for ‘Complexity’: Metamodeling, Ruliology and More”—and in doing so, finally invented a word for the “pure basic science of what simple rules do”: ruliology.

My original framing of what became our Physics Project had been to try to “find a computational rule that gives our universe”. But I’d always found this unsatisfying. Because even if we had the rule, we’d still be left asking “why this one, and not another?” But in 2020 there’d been a dawning awareness of a possible answer.

Our Physics Project is based on the idea of applying rules to abstract hypergraphs that represent space and everything in it. But given a particular rule, there are in general many ways it can be applied. And a key idea in our Physics Project is that somehow it’s always applied in all these ways—leading to many separate threads of history, that branch and merge—and, importantly, giving us a way to understand quantum mechanics.

We talked about these different threads of history corresponding to different places in branchial space—and about how the laws of quantum mechanics are the direct analogs in branchial space (or branchtime) of the laws of classical mechanics (and gravity) in physical space (or spacetime). But what if instead of just applying a given rule in all possible ways, we applied all possible rules in all possible ways?

What would one get? In November 2021 I came up with a name for it: the ruliad. A year and a half earlier I’d already been starting to talk about rulial space—and the idea of us as observers perceiving the universe in terms of our particular sampling of rulial space. But naming the ruliad really helped to crystallize the concept. And I began to realize that I had come upon a breathtakingly broad intellectual arc.

The ruliad is the biggest computational thing there can be: it’s the entangled limit of all possible computations. It’s abstract and it’s unique—and it’s as inevitable in its structure as 2 + 2 = 4. It encompasses everything computational—including us. So what then is physics? Well, it’s a description of how observers like us embedded in the ruliad perceive the ruliad.

Back in 1984 I’d introduced what I saw as being the very central concept of computational irreducibility: the idea that there are many computational processes whose outcomes can be found only by following them step by step—with no possibility of doing what mathematical science was used to, and being able to “jump ahead” and make predictions without going through each step. At the beginning of the 1990s, when I began to work on A New Kind of Science, I’d invented the Principle of Computational Equivalence—the idea that systems whose behavior isn’t obviously simple will always tend to be equivalent in the sophistication of the computations they do.

Given the Principle of Computational Equivalence, computational irreducibility was inevitable. It followed from the fact that the observer could only be as computationally sophisticated as the system they were observing, and so would never be able to “jump ahead” and shortcut the computation. There’d come to be a belief that eventually science would always let one predict (and control) things. But here—from inside science—was a fundamental limitation on the power of science. All these things I’d known in some form since the 1980s, and with clarity since the 1990s.

But the ruliad took things to another level. For now I could see that the very laws of physics we know were determined by the way we are as observers. I’d always imagined that the laws of physics just are the way they are. But now I realized that we could potentially derive them from the inevitable structure of the ruliad, and very basic features of what we’re like as observers.

I hadn’t seen this philosophical twist coming. But somehow it immediately made sense. We weren’t getting our laws of physics from nothing; we were getting them from being the way we are. Two things seemed to be critical: that as observers we are computationally bounded, and that (somewhat relatedly) we believe we are persistent in time (i.e. we have a continuing thread of experience through time).

But even as I was homing in on the idea of the ruliad as it applied to physics, I was also thinking about another application: the foundations of mathematics. I’d been interested in the foundations of mathematics for a very long time; in fact, in the design of Mathematica (and what’s now the Wolfram Language) and its predecessor SMP, I’d made central use of ideas that I’d developed from thinking about the foundations of mathematics. And in A New Kind of Science, I’d included a long section on the foundations of mathematics, discussing things like the network of all possible theorems, and the space of all possible axiom systems.

But now I was developing a clearer picture. The ruliad represented not only all possible physics, but also all possible mathematics. And the actual mathematics that we perceive—like the actual physics—would be determined by our nature as observers, in this case mathematical observers. There were lots of technical details, and it wasn’t until March 2022 that I published “The Physicalization of Metamathematics and Its Implications for the Foundations of Mathematics”.

In some ways this finished what I’d started in the mid-1990s. But it went much further than I expected, in particular in providing a sweeping unification of the foundations of physics and mathematics. It talked about what the ultimate limit of mathematics would be like. And it talked about how “human-level mathematics”—where we can discuss things like the Pythagorean theorem rather than just the microdetails of underlying axioms—emerges for observers like us just like our human-level impression of physical space emerges from the underlying network of atoms of space.

One of the things I’d discovered in computational systems is how common computational irreducibility is, along with undecidability. And I had always wondered why undecidability wasn’t more common in typical mathematics. But now I had an answer: it just isn’t what mathematical observers like us “see” in the ruliad. At some level, this was a very philosophical result. But for me it also had practical implications, notably greatly validating the idea of using higher-level computational language to represent useful human-level mathematics, rather than trying to drill down to “axiomatic machine code”.

October 22, 2021 had marked a third of a century of Mathematica. And May 14, 2022 was the 20th anniversary of A New Kind of Science. And in contextualizing my activities, and planning for the future, I’ve increasingly found it useful to reflect on what I’ve done before, and how it’s worked out. And in both these cases I could see that seeds I’d planted many years earlier had blossomed, sometimes in ways I’d suspected they might, and sometimes in ways that far exceeded what I’d imagined.

What had I done right? The key, it seemed, was drilling down to find the essence of things, and then developing that. Even if I hadn’t been able to imagine quite what could be built on them, I’d been able to construct solid foundations, that successfully encapsulated things in the cleanest and simplest ways.

In talking about observers and the ruliad—and in fact our Physics Project in general—I kept on making analogies to the way that the gas laws and fluid dynamics emerge from the complicated underlying dynamics of molecules. And at the core of this is the Second Law of thermodynamics.

Well, as it happens, the very first foundational question in physics that I ever seriously studied was the origin of the Second Law. But that was when I was 12 years old, in 1972. For more than a century the Second Law had been quite mysterious. But when I discovered computational irreducibility in 1984 I soon realized that it might be the key to the Second Law. And in the summer of 2022—armed with a new perspective on the importance of observers—I decided I’d better once and for all write down how the Second Law works.

Once again, there were lots of technical details. And as a way to check my ideas I decided to go back and try to untangle the rather confused 150-year history of the Second Law. It was an interesting exercise, satisfying for seeing how my new ways of thinking clarified things, but cautionary in seeing how wrong turns had been taken—and solidified—in the past. But in the end, there it was: the Second Law was a consequence of the interplay between underlying computational irreducibility, and our limitations as observers.

It had taken half a century, but finally I had finished the project I’d started when I was 12 years old. I was on a roll finishing things. But I was also realizing that a bigger structure than I’d ever imagined was emerging. The Second Law project completed what I think is the most beautiful thing I’ve ever discovered. That all three of the core theories of twentieth century physics—general relativity, quantum mechanics and the Second Law (statistical mechanics)—have the same origin: the interplay between the underlying computational structure of the ruliad, and our characteristics and limitations as observers.

And I knew it didn’t stop there. I’d already applied the same kind of thinking to the foundations of mathematics. And I was ready to start applying it to all sorts of deep questions in science, in philosophy, and beyond. But at the end of 2022, just as I was finishing my pieces about the Second Law, there was a surprise: ChatGPT.

I’d been following AI and neural nets for decades. I first simulated a neural net in 1981. My first company, started in 1981, had, to my chagrin, been labeled an “AI company”. And from the early 2010s we’d integrated neural nets into the Wolfram Language. But—like the creators of ChatGPT—I didn’t expect the capabilities that emerged in ChatGPT. And as soon as I saw ChatGPT I started trying to understand it. What was it really doing? What would its capabilities be?

In the world at large, there was a sense of shock: if AI can do this now, soon it’ll be able to do everything. But I immediately thought about computational irreducibility. And it gave us limitations. But those limitations would inevitably apply to AIs as well. There would be things that couldn’t be “quickly figured out by pure thought”—by humans and AIs alike. And, by the way, I’d just spent four decades building a way to represent things computationally, and actually do systematic computations on them—because that was the point of the Wolfram Language.

So immediately I could see we were in a very interesting position. The Wolfram Language had the completely unique mission of creating a full-scale computational language. And now this was a crucial tool for AIs. The AIs could provide a very interesting and useful broad linguistic interface. But when it came to solid computation, they were—like humans—going to need a tool. Conveniently, Wolfram|Alpha already communicated in natural language. And it took only a few weeks to hook up Wolfram|Alpha—and Wolfram Language—to ChatGPT. We’d given “computational superpowers” to the AI.

ChatGPT was everywhere. And people kept asking me about it. And over and over again I ended up explaining things about it. So at the beginning of February 2023 I decided it’d be better for me just to write down once and for all what I knew. It took a little over a week (yes, I’m a fast writer)—and then I had an “explainer” (that ran altogether to 76 pages) of ChatGPT.

Partly it talked in general about how machine learning and neural nets work, and how ChatGPT in particular works. But what a lot of people wanted to know was not “how” but “why” ChatGPT works. Why was something like that possible? Well, in effect ChatGPT was showing us a new science discovery—about language. Everyone knows that there’s a certain syntactic grammar of language—like that, in English, sentences typically have the form noun-verb-noun. But what ChatGPT was showing us is that there’s also a semantic grammar—some pattern of rules for what words can be put together and make sense.

I’ve thought about the foundations of language for a long time (which isn’t too surprising, given the four decades I’ve spent as a computational language designer). So in effect I was well primed to think about its interaction with ChatGPT. And it also helped that—as I’ll talk about below—one of my long-unfinished projects is precisely on a formal framework for capturing meaning that I call “symbolic discourse language”.

In technology and other things I always like best situations where basically nothing is known, and one has to invent everything from scratch. And that’s what was happening for functionality based on LLMs in the middle of 2023. How would LLM-based Wolfram Language functions work? How would a prompt repository work? How would LLMs interact with notebooks?

Meanwhile, there was still lots of foment in the world about the “AI shock”. Before the arrival of the Physics Project in 2019—I’d been quite involved in AI philosophy, AI ethics, etc. And in March 2023 I wrote a piece on “Will AIs Take All Our Jobs and End Human History—or Not?” In the end—after all sorts of philosophical arguments, and an analysis of actual historical data—the answer was: “It’s Complicated”. But along the way computational irreducibility and the ruliad were central elements: limiting the controllability of AIs, allowing for an infinite frontier of invention, and highlighting the inevitable meaninglessness of everything in the absence of human choice.

By this point (and actually, with remarkable speed) my explainer on ChatGPT had turned into a book—that proved extremely popular (and now, for example, exists in over 10 languages). It was nice that people found the book useful—and perhaps it helped remove some of the alarming mystique of AI. But I couldn’t help noticing that of all the many things I’d written, this had been one of the fastest to write, yet it was garnering one of the largest readerships.

One might have imagined that AI was pretty far from our Physics Project, the ruliad, etc. But actually it soon became clear that there were close connections, and that there were things to learn in both directions. In particular, I’d come to think of minds that work in different ways as occupying different positions in the ruliad. But how could one get intuition about what such minds would experience—or observe? Well, I realized, one could just look at generative AI. In July I wrote “Generative AI Space and the Mental Imagery of Alien Minds”. I called this the “cats in hats piece”, because, yes, it has lots of pictures of (often bizarrely distorted) cats (in hats)—used as examples of what happens if one moves a mind around in rulial space. But despite the whimsy of the cats, this piece provided a surprisingly useful window into what for me has been a very longstanding question of how other minds might perceive things.

And this fed quite directly into my piece on “Observer Theory” in December 2023. Ever since things like Turing machines we’ve had a formal model for the process of computation. My goal was to do the same kind of thing for the process of observation. In a sense, computation constructs sequences of new things, say with time. Observation, on the other hand, equivalences things together, so they fit in finite minds. And just what equivalencing is done—by our senses, our measuring devices, our thinking—determines what our ultimate perceptions will be. Or, put another way, if we can characterize well enough what we’re like as observers, it’ll show us how we sample the ruliad, and what we’ll perceive the laws of physics to be.

When I started the Physics Project I wasn’t counting on it having any applications for hundreds of years. But quite soon it became clear that actually there were going to be all sorts of near-term applications, particularly of the formalism of multicomputation. And every time one used that formalism one could get more intuition about features of the Physics Project, particularly related to quantum mechanics. I ended up writing a variety of “ruliological” pieces, all, as it happens, expanding on footnotes in A New Kind of Science. There was “Multicomputation with Numbers” (October 2021), “Games and Puzzles as Multicomputational Systems” (June 2022) and “Aggregation and Tiling as Multicomputational Processes” (November 2023). And in September 2023 there was also “Expression Evaluation and Fundamental Physics”.

Back around 1980—when I was working on SMP—I’d become interested in the theory of expression evaluation. And finally, now, with the Physics Project—and my work on combinators and metamathematics—four decades later I had a principled way to study it (potentially with immediate application in distributed computing and computational language design around that). And I could check off progress on another long-pending project.

I give many talks, and do many podcasts and livestreams—essentially all unprepared. But in October 2023 I agreed to give a TED talk. And I just didn’t see any way to fit a reasonable snapshot of my activities into 18 minutes without preparation. How was I to coherently explain the Physics Project, the ruliad and computational language in such a short time? I called the talk “How to Think Computationally about AI, the Universe and Everything”. And I began with what for me was a new condensation: “Human language. Mathematics. Logic. These are all ways to formalize the world. And in our century there’s a new and yet more powerful one: computation.”

Over the years I’d done all sorts of seemingly very different projects in science and in technology. But somehow it seemed like they were now all converging. Back in 1979, for example, I’d invented the idea of transformations for symbolic expressions as a foundation for computational language. But now—more than four decades later—our Physics Project was saying that those kinds of transformations (specifically on hypergraphs) were just what the “machine code of the universe” was made of.

Since the 1980s I’d thought that computation was a useful paradigm with which to think about the world. But now our Physics Project and the ruliad were saying that it wasn’t just useful; it was the underlying paradigm of the world. For some time I’d been viewing our whole Wolfram Language effort as a way to provide a way to formalize computation for the purposes of both humans and machines. Four hundred years ago mathematical notation had streamlined mathematical thinking, allowing what became the mathematical sciences to develop. I saw what we were doing with our computational language as a way to streamline computational thinking, and allow “computational X” for all fields “X” to develop.

I began to see computational thinking as a way to “humanize” the ruliad; to pick out those parts that are meaningful to humans. And I began to see computational language as the bridge between the power of raw computation, and the kinds of things we humans think about.

But how did AI fit in? At the beginning of 2024, lots of people were still asking in effect “Can AI Solve Science?” So I decided to analyze that. I certainly didn’t expect AI to be able to “break computational irreducibility”. And it didn’t. Yes, it could automate much of what humans could do in a quick look. But formalized, irreducible computation: that was going to need computational language, not AI.

It’s easy to be original in the computational universe: if you pick a rule at random, it’s overwhelmingly likely nobody’s ever looked at it before. But will anyone care? They’ll care if in effect that part of the ruliad has been “colonized”; if there’s already a human connection to it. But what if you define some attribute that you want, then just “search out there” for a rule that exhibits it? That’s basically what biological evolution—or machine learning training—seems to do.

And as a kind of off-hand note I decided to just see if I could make a minimal model for that. I’d tried before—in the mid-1980s. And in the 1990s when I was writing A New Kind of Science I’d become convinced that computational irreducibility was in a sense a stronger force than adaptive evolution, and that when complex behavior was seen in biology, it was computational irreducibility that should take most of the credit.

But I decided to just do the experiment and see. And although computational irreducibility in a sense tells one to always “expect the unexpected”, in all these years I’ve never fully come to terms with that—and I’m still regularly surprised by what simple systems somehow “cleverly” manage to do. And so it was with my minimal model of biological evolution.

I’d always wondered why biological evolution managed to work at all, why it didn’t “get stuck”, and how it managed to come up with the ornate “solutions” it did. Well, now I knew: and it turned out it was, once again, a story of computational irreducibility. And I’d managed to finish another project that I started in the 1980s.

But then there was machine learning. And despite all the energy around it—as well as practical experience with it—it didn’t seem like there was a good foundational understanding of what it was doing or why it worked. For a couple of years I’d been asking all the machine learning experts I ran into what they knew. But mostly they confirmed that, yes, it wasn’t well understood. And in fact several of them suggested that I’d be the best person to figure it out.

So just a few weeks ago, starting with ideas from the biological evolution project, and mixing in some things I tried back in 1985, I decided to embark on exploring minimal models of machine learning. I just posted the results last week. And, yes, one seems to be able to see the essence of machine learning in systems vastly simpler than neural nets. In these systems one can visualize what’s going on—and it’s basically a story of finding ways to put together lumps of irreducible computation to do the tasks we want. Like stones one might pick up off the ground to put together into a stone wall, one gets something that works, but there’s no reason for there to be any understandable structure to it.

Like so many of the projects I’ve done in the past five years, I could in principle have done this project much earlier—even in the 1980s. But back then I didn’t have the intuition, the tools or the intellectual confidence to actually dive in and get the project done. And what’s been particularly exciting over the past five years is that I can feel—and even very tangibly see—how what I can do has grown. With every project I’ve done I’ve further honed my intuition, developed more tools (both conceptual and practical), and built my intellectual confidence. Could I have gotten here earlier in my life? I don’t think so. I think to get to where I am now required the kind of journey I’ve taken through science, technology and the other things I’ve done. A living example of the phenomenon of computational irreducibility.

The Process of Getting Things Done

I started my career young—and usually found myself the “youngest person in the room”. But shockingly fast all those years whizzed by, and now I’m usually the “oldest person in the room”. But somehow I always still seem to feel like a young whippersnapper—not settled into some expected pattern, and “pushing for the future”.

I’ve always done projects that are hard. Projects that many people thought were impossible. Projects that stretched my capabilities to the limit. And to do this has required a certain mixture of confidence and humility. Confidence that it’s worth me trying the project. Humility in not assuming that it’ll be easy for me.

I’ve learned a lot of fields by now, and with them a lot of different ways of thinking. But somehow it’s never enough to make the projects I do easy. Somehow the projects are always far enough out on the frontier that I have to learn new things and new ways of thinking to succeed at them. And so there I am, often the only person in the room whose project isn’t somehow easy for them. And who still has to be pushing, whippersnapper style.

At this point, a fair fraction of the projects I do are ones that I’ve thought about for a long time; a smaller fraction are opportunistic—coming into scope just now as a result of something I’ve done, or something that’s happened in the world at large. Before the past five years I had a lot of projects that had languished, often for decades. Yes, I thought they would be interesting, and I gradually collected information about them. But somehow I wasn’t quite in a place to tackle them.

But now I feel quite differently. In the past five years, I’ve gone back and finished a fair fraction of all those languishing projects. And it’s been great. Without exception, the projects turned out to be richer and more interesting than I expected. Often I realized I really couldn’t have done them without the tools and ideas (and infrastructure) I now have. And—often to my great surprise—the projects turned out to have very direct connections to big themes around the ruliad, the Physics Project and, for that matter, computational language.

Why was this happening? Partly it’s a tribute to the breadth of the computational (and now multicomputational) paradigm. But partly it has to do with the specific character of projects I was choosing—always seeking what seemed like the simplest, most foundational versions of things.

I’ve done quite a few big projects in my life, many seemingly very different. But as I look back, I realize that all my projects have a certain overall pattern to them. They’re all about taking something that seems complicated, then drilling down to find the foundations of what’s going on, and then building up from these—often with considerable engineering-style effort. And the methods and tools I’ve developed have in a sense implicitly been optimized for this pattern of work.

I suppose one gets used to the rhythm of it all. The time when one’s drilling down, slowly trying to understand things. The time when one’s doing all the work to build the big structure up. And yes, it’s all hard. But by now I know the signs of progress, and they’re always energizing to see.

At any given time, I’ll have many projects gestating—often for years or decades. But once a project becomes active, it’s usually the only one I’m working on. And I’ll work on it with great intensity, pushing hard to keep going until it’s done. Often I’ll be working with other people, usually much younger than me. And I think it’s always a surprise that I’ll routinely be the one who works with the greatest intensity—every day, at all hours.

I think I’m pretty efficient too. Of course, it helps that I have a tool—Wolfram Language—that I’ve been building for decades to support me. And it helps that I’ve developed all kinds of practices around how I organize code and notebooks I create, and how I set up my process of writing about things. Of course, it also helps that I have very capable people around me to make suggestions, explore additional directions, fill in details, check things, and get my write-ups produced and published.

As I have written about elsewhere, my life is in many ways set up to be quite simple and routine. I get up at the same time every day, eat the same thing for breakfast, and so on. But in a sense this frees me to concentrate on the intellectual things I’m doing—which are different every day, often in unexpected ways.

But how is it that I even get the time to do all these intellectual things? After all, I am—as I have been for the past 38 years—the CEO of a very active tech company. Two things I think help (in addition, of course, to the fact that I have such a great long-term team at the company). First, organization. And second, resolve. Every day I’ll have tightly scheduled meetings over the course of the working day. (And there are lots of details to this. I get up in the late morning, then do my first two meetings while walking, and so on.) But somehow—mostly on evenings and weekends—I find time to work intensely on my intellectual projects.

It’s not as if I ignore everything else in the world. But I do have a certain drive—and resolve—that fills any time available with my projects, and somehow seems to succeed in getting them done. (And, yes, there are many optimizations in the details of my life, saving me all sorts of time. And it probably helps that I’ve been a work-from-home CEO now for 33 years.)

One might have thought that CEOing would greatly detract from being able to do intellectual work. But I find the exact opposite. Because in my experience the discipline of strategy and decision making (as well as communicating thoughts and ideas to other people) that comes with CEOing is critical to being able to do incisive intellectual work. And, by the way, the kind of thinking that goes with intellectual work is also incredibly valuable in being an effective CEO.

There’s another critical part to my “formula”. And that has to do with exposition. For me, the exposition of a project is an integral part of the project. Part of it is that the very definition of the question is often one of the most important parts of a project. But more than that, it’s through exposition that I find I really understand things. It takes a certain discipline. It can be easy enough to make some highfalutin technical statement. But can one grind it down into truly simple pieces that one can immediately understand? Yes, that means other people will be able to understand it too. But for me, what’s critical is that that’s the way I can tell if I’m getting things right. And for me the exposition is what in the end defines the backbone of a project.

Normally I write quickly, and basically without revision. But whenever there’s a piece I’m finding unduly hard to write I know that’s where I’m muddled, and need to go back and understand what’s going on. Some of my projects (like creating this piece, for example) end up being essentially “pure writing”. But most are deeply computational—and full of computer experiments. And just as I put a lot of effort into making written exposition clear, I do the same for computational language, and for pictures. Indeed, many of my projects are in large measure driven by pictures. Usually these are what one can think of as “algorithmic diagrams”—created automatically with a structure optimized for exposition.

And the pictures aren’t just useful for presenting what I’ve done; they’re also critical to my own efforts to figure things out. And I’ve learned that it’s important to get the presentational details of pictures right as early as possible in a project—to give myself the best chance to notice things.

Often the projects I do require exploring large numbers of possible systems. And somehow with great regularity this leads to me ending up looking at large arrays of little pictures. Yes, there’s a lot of “looking” that can be automated. But in the end computational irreducibility means there’ll always be the unexpected, that I basically have to see for myself.

A great thing about the Wolfram Language is that it’s been very stable ever since it was first released. And that means that I can take notebooks even from the 1980s and immediately run them today. And, yes, given all the “old” projects I’ve worked on in the past five years, that’s been very important.

But in addition to being very stable, the Wolfram Language is also very self contained—and very much intended to be readable by humans. And the result is something that I’ve found increasingly important: every computational picture in everything I write has Wolfram Language code “behind it”, that you can get by clicking. All the time I find myself going back to previous things I’ve written, and picking up click-to-copy code to run for some new case, or use as the basis for something new I’m doing.

And of course that click-to-copy code is open for anyone to use. Not only for its “computational content”, but also for the often-elaborate visuals it implements.

Most of my writings over the past five years have been about new basic science. But interspersed with this—along with pieces about technology and about philosophy—are pieces about history. And in fact many of my scientific pieces have had extensive historical sections as well.

Why do I put such effort into history? Partly I just find it fun to figure out. But mostly it’s to contextualize my understanding of things. Particularly in the past five years I’ve ended up working on a whole sequence of projects that are in a sense about changing longstanding directions in science. And to feel confident about making such changes, one has to know why people went in those directions in the first place. And that requires studying history.

Make no mistake: history—or at least good history—is hard. Often there’ll be a standard simple story about how some discovery was suddenly made, or how some direction was immediately defined. But the real story is usually much more complicated—and much more revealing of the true intellectual foundations of what was figured out. Almost never did someone discover something “one day”; almost always it took many years to build up the conceptual framework so that “one day” the key thing could even be noticed.

When I do history I always make a big effort to look at the original documents. And often I realize that’s critical—because it’s only with whatever new understanding I’ve developed that one would stand a chance of correctly interpreting what’s in the documents. And even if one’s mainly interested in the history of ideas, I’ve always found it’s crucial to also understand the people who were involved with them. What was their motivation? What was their practical situation? What kinds of things did they know about? What was their intellectual style in thinking about things?

It has helped me greatly that I’ve had my own experiences in making discoveries—that gives me an intuition for how the process of discovery works. And it also helps that I’ve had my fair share of “worldly” experiences. Still, often it’s at first a mystery how some idea developed or some discovery got made. But my consistent experience is that with enough effort one can almost always solve it.

Particularly for the projects I’ve done in recent years, it often leaves me with a strange feeling of connection. For in many cases I find out that the things I’ve now done can be viewed as direct follow-ons to ideas that were thought about a century or more ago, and for one reason or another ignored or abandoned since.

And I’m then usually left with a strong sense of responsibility. An idea that was someone’s great achievement had been buried and lost to the world. But now I have found it again, and it rests on me to bring it into the future.

In addition to writing about “other people’s history”, I’ve also been writing quite a bit about my own history. And in the last few years I’ve made a point of explaining my personal history around the science—and technology—I describe. In doing this, it helps a lot that I have excellent personal archives—that routinely let me track to within minutes discoveries I made even four decades ago.

My goal in describing my own history is to help other people contextualize things I write about. But I have to say that time and time again I’ve found the effort to piece together my own history extremely valuable just for me. As I go through life, I try to build up a repertoire of patterns for how things I do fit together. But often those patterns aren’t visible at the time. And it takes going back—often years later—to see them.

I do the projects I do first and foremost for myself. But I’ve always liked the idea that other people can get their own pleasure and benefit from my projects. And—basically starting with the Physics Project—I’ve tried to open to the world not just the results of my projects, but the process by which they’re done.

I post my working notebooks. Whenever practical I livestream my working meetings. And, perhaps taking things to an extreme, I record even my own solitary work, posting it in “video work logs”. (Except I just realized I forgot to record the writing I’m doing right now!)

A couple of years before the Physics Project I actually also opened up my technology development activities—livestreaming our software design reviews, in the past five years 692 hours of them. (And, yes, I put a lot of work and effort into designing the Wolfram Language!)

At the beginning of the pandemic I thought: “There are all these kids out of school. Let me try to do a little bit of public service and livestream something about science and technology for them.” And that’s how I started my “Science & Technology Q&A for Kids & Others” livestreams, that I’ve now been doing for four and a half years. Along the way, I’ve added “History of Science & Technology Q&A”, “Future of Science & Technology Q&A”, and “Business, Innovation & Managing Life Q&A”. Altogether I’ve done 272 hours of these, that have generated 376 podcast episodes.

Twice a week I sit down in front of a camera, watch the feed of questions, and try to answer them. It’s always off the cuff, completely unprepared. And I find it a great experience. I can tell that over the time I’ve been doing this, I’ve become a better and more fluent explainer, which no doubt helps my written exposition too. Often in answering questions I’ll come up with a new way to explain something, that I’ve never thought of before. And often there’ll be questions that make me think about things I’ve never thought about at all before. Indeed, several of my recent projects actually got started as a result of questions people asked.

When I was younger I always just wanted to get on with research, create things, and so on; I wasn’t interested in education. But as I’ve gotten older I’ve come to really like education. Partly it’s because I feel I learn a lot myself from it, but mostly it’s because I find it fulfilling to use what I know and try to help people develop.

I’ve always been interested in people—a useful attribute in running a talent-rich company for four decades. (I’m particularly interested in how people develop through their lives—leading me recently, for example, to organize a 50-year reunion for my elementary school class.) I’ve had a long-time “hobby” of mentoring CEOs and kids (both being categories of people who tend to believe that anything is possible).

But my main educational efforts are concentrated in a few weeks of the year when we do our Wolfram Summer School (started in 2003) and our Wolfram High School Summer Research Program (started in 2012). All the students in these programs (775 of them over the past five years) do an original project, and one of my jobs is to come up with what all these projects should be. Over the course of the year I’ll accumulate ideas—though rather often when I actually meet a student I’ll invent something new.

I obviously do plenty of projects myself. But it’s always an interesting—and invigorating—experience to see so many projects get done with such intensity at our summer programs. Plus, I get lots of extra practice in framing projects that helps when I come to frame my own projects.

At this point, I’ve spent years trying to organize my life to optimize it for what I want to get out of it. I need long stretches of time when I can concentrate coherently. But I like having a diversity of activities, and I’m pretty sure I wouldn’t have the energy and effectiveness I do without that. Over the years, I’ve added in little pieces. Like my weekly virtual sessions where I “do my homework” with a group of kids, working on something that I need to get done, but that doesn’t quite fit elsewhere. Or my weekly sessions with local kids, talking about things that make me and them think. Or, for that matter, my “call while driving” list of calls it’s good to make, but wouldn’t usually quite get the priority to happen.

Doing all the things I do is hard work. But it’s what I want to do. Yes, things can drag from time to time. But at this point I’m so used to the rhythm of projects that I don’t think I notice much. And, yes, I work basically every hour of every day I can. Do I have hobbies? Well, back when I was an academic, business was my main “hobby”. When I started CEOing, science became a “hobby”. Writing. Education. Livestreaming. These were all “hobbies” too. But somehow one of the patterns of my life is that nothing really stays quite as a “true hobby”.

What’s Next?

The past five years have not only been my most productive ever, but they’ve also built more “productivity momentum” than I’ve had before. So, what’s next? I have a lot of projects currently “in motion”, or ready to “get into motion”. Then I have many more that are in gestation, for which the time may finally have come. But I know there’ll also be surprises: projects that suddenly occur to me, or that I suddenly realize are possible. And one of the great challenges is to be in a position to actually jump into such things.

It has to be said that there’s always a potentially complicated tradeoff. To what extent should one “tend” the things one’s already done, and to what extent should one do new things? Of course, there are some things that are never “done”—like the Wolfram Language, which I started building 38 years ago, and still (energetically) work on every day. Or the Physics Project, where there’s just so much to figure out. But one of the things that’s worked well in most of the basic science projects I’ve done in the past five years or is that once I’ve written my piece about the project, I can usually consider the project “done for now”. It always takes a lot of effort to get a project to the point where I can write about it. But I work hard to make sure I only have to do it once; that I’ve “picked the low-hanging fruit”, so I don’t feel I have to come back “to add a little more”.

I put a lot of effort into the pieces I write about my projects. And I also give talks, do interviews, etc. (about 500 altogether in the past five years). But I certainly don’t “market” my efforts as much as I could. It’s a decision I’ve made: that at this point in my life—particularly with the burst of productivity I’m experiencing—I want to spend as much of my time as possible doing new things. And so I need to count on others to follow up and spread knowledge about what I’ve done, whether in the academic world, on Wikipedia, the web, etc. (And, yes, pieces I write and the pictures they contain are set up to be immediately reproducible wherever appropriate.)

OK, so what specific new things are currently in my pipeline? Well, there’s lots of science (and related intellectual things). And there’s also lots of technology. But let’s talk about science first.

A big story is the Physics Project—where there’s a lot to be done, in many different directions. There’s foundational theory to be developed. And there are experimental implications to be found.

It’d be great if we could find experimental evidence of the discreteness of space, or maximum entanglement speed, or a host of other unexpected phenomena in our models. A century or so ago it was something of a stroke of luck that atoms were big enough that they could be detected. And we don’t know if the discreteness of space is something we’ll be able to detect now—or only centuries from now.

There are phenomena—particularly associated with black holes—that might effectively serve as powerful “spacetime microscopes”. And there are phenomena like dimension fluctuations that could potentially show up in a variety of astrophysical settings. But one direction I’m particularly interested in exploring is what one might call “spacetime heat”—the effect of detailed microscopic dynamics in the hypergraph that makes up spacetime. Could “dark matter”, for example, not be “matter” at all, but instead be associated with spacetime heat?

Part of investigating this involves building practical simulation software to investigate our models on as large a scale as possible. And part of it involves “good, old-fashioned physics”, figuring out how to go from underlying foundational effects to observable phenomena.

And there’s a foundational piece to this too. How does one set up mathematics—and mathematical physics—when one’s starting from a hypergraph? A traditional manifold is ultimately built up from Euclidean space. But what kind of object is the limit of a hypergraph? To understand this, we need to construct what I’m calling infrageometry—and infracalculus alongside it. Infrageometry—as its name suggests—starts from something lower level than traditional geometry. And the challenge is in effect to build a “21st century Euclid”, then Newton, etc.—eventually finding generalizations of things like differential geometry and algebraic topology that answer questions like what 3-dimensional curvature tensors are like, or how we might distinguish local gauge degrees of freedom from spatial ones in a limiting hypergraph.

Another direction has to do with particles—like electrons. The fact is that existing quantum field theory in a sense only really deals with particles indirectly, by thinking of them as perturbations in a field—which in turn is full of (usually unobservable) zero-point fluctuations. In our models, the structure of everything—from spacetime up—is determined by the “fluctuating” structure of the underlying hypergraph (or, more accurately, by the whole multiway graph of “possible fluctuations”). And what this suggests is that there’s in a sense a much lower level version of the Feynman diagrams we use in quantum field theory and where we can discuss the “effect of particles” without ever having to say exactly what a particle “is”.

I must say that I expected we’d have to know what particles were even to talk about energy. But it turned out there was a “bulk” way to do that. And maybe similarly there’s an indirect way to talk about interactions between particles. My guess is that in our model particles are structures a bit like black holes—but we may be able to go a very long way without having to know the details.

One of the important features of our models is that quantum mechanics is “inevitable” in them. And one of the projects I’m hoping to do is to finally “really understand quantum mechanics”. In general terms, it’s connected to the way branching observers (like us) perceive branching universes. But how do we get intuition for this, and what effects can we expect? Several projects over the past years (like multiway Turing machines, multiway games, multiway aggregation, etc.) I’ve done in large part to bolster my intuition about branchial space and quantum mechanics.

I first worked on quantum computers back in 1980. And at the time, I thought that the measurement process (whose mechanism isn’t described in the standard formalism of quantum mechanics) would be a big problem for them. Years have gone by, and enthusiasm for quantum computers has skyrocketed. In our models there’s a rather clear picture that inside a quantum computer there are “many threads of history” that can in effect do computations in parallel. But for an observer like us to “know what the answer is” we have to knit those threads together. And in our models (particularly with my observer theory efforts) we start to be able to see how that might happen, and what the limitations might be.

Meanwhile, in the world at large there are all sorts of experimental quantum computers being built. But what are their limitations? I have a suspicion that there’s some as-yet-unknown fundamental physics associated with these limitations. It’s like building telescopes: you polish the mirror, and keep on making engineering tweaks. But unless you know about diffraction, you won’t understand why your resolution is limited. And I have a slight hope that even existing results on quantum computers may be enough to see limitations perhaps associated with maximum entanglement speed in our models. And the way our models work, knowing this speed, you can for example immediately deduce the discreteness scale of space.

Back in 1982, I and another physicist wrote two papers on “Properties of the Vacuum”. Part 1 was mechanical properties. Part 2 was electrodynamic. We announced a part 3, on gravitational properties. But we never wrote it. Well, finally, it looks as if our Physics Project shows us how to think about such properties. So perhaps it’s time to finally write “Part 3”, and respond to all those people who sent preprint request cards for it four decades ago.

One of the great conclusions of our Physics Project—and the concept of the ruliad—is that we have the laws of physics we do because we are observers of the kind we are. And just knowing very coarsely about us as observers seems to already imply the major laws of twentieth century physics. And to be able to say more, I think we need more characterization of us as observers. And my guess is, for example, that some feature of us that we probably consider completely obvious is what leads us to perceive space as (roughly) three dimensional. And indeed I increasingly suspect that the whole structure of our Physics Project can be derived—a bit like early derivations of special relativity—from certain axiomatic assumptions about our nature as observers, and fundamental features of computation.

There’s plenty to do on our Physics Project, and I’m looking forward to making progress with all of it. But the ideas of the Physics Project—and multicomputation in general—apply to lots of other fields too. And I have many projects planned on these.

Let’s talk first about chemistry. I never found chemistry interesting as a kid. But as we’ve added chemistry functionality in the Wolfram Language, I’ve understood more about it, and why it’s interesting. And I’ve also followed molecular computing since the 1980s. And now, largely inspired by thinking about multicomputation, I’ve become very interested in what one might call the foundations of chemistry. Actually, what I’m most interested in is what I’m calling “subchemistry”. I suppose one can think of it as having a similar kind of relation to chemistry as infrageometry has to geometry.

In ordinary chemistry, one thinks about reactions between different species of molecules. And to calculate rates of reactions, one multiplies concentrations of different species, implicitly assuming that there’s perfect randomness in which specific molecules interact. But what if one goes to a lower level, and starts talking about the interactions not of species of molecules, but individual molecules? From our Physics Project we get the idea of making causal graphs that represent the causal relations between different specific interaction events.

In a gas the assumption of molecular-level randomness will probably be pretty good. But even in a liquid it’ll be more questionable. And in more exotic materials it’ll be a completely different story. And I suspect that there are “subchemical” processes that can potentially be important, perhaps in a sense finding a new “slice of computational reducibility” within the general computational irreducibility associated with the Second Law.

But the most important potential application of subchemistry is in biology. If we look at biological tissue, a basic question might be: “What phase of matter is it?” One of the major takeaways from molecular biology in the last few decades has been that in biological systems, molecules (or at least large ones) are basically never just “bouncing around randomly”. Instead, their motion is typically carefully orchestrated.

So when we look at biological tissue—or a biological system—we’re basically seeing the result of “bulk orchestration”. But what are the laws of bulk orchestration? We don’t know. But I want to find out. I think the “mechanoidal phase” that I identified in studying the Second Law is potentially a good test case.

If we look at a microprocessor, it’s not very useful to describe it as “containing a gas of electrons”. And similarly, it’s not useful to describe a biological cell as “being liquid inside”. But just what kind of theory is needed to have a more useful description we don’t know. And my guess is that there’ll be some new level of abstraction that’s needed to think about this (perhaps a bit like the new abstraction that was needed to formulate information theory).

Biology is not big on theory. Yes, there’s natural selection. And there’s the digital nature of biomolecules. But mostly biology has ended up just accumulating vast amounts of data (using ever better instrumentation) without any overarching theory. But I suspect that in fact there’s another foundational theory to be found in biology. And if we find it, a lot of the data that’s been collected will suddenly fall into place.

There’s the “frankly molecular” level of biology. And there’s the more “functional” level. And I was surprised recently to be able to find a very minimal model that seems to capture “functional” aspects of biological evolution. It’s a surprisingly rich model, and there’s much more to explore with it, notably about how different “ideas” get propagated and developed in the process of adaptive evolution—and what kinds of tree-of-life-style branchings occur.

And then there’s the question of self replication—a core feature of biology. Just how simple a system can exhibit it in a “biologically relevant way”? I had thought that self replication was “just relevant for biology”. But in thinking about the problem of observers in the ruliad, I’ve come to realize that it’s also relevant at a foundational level there. It’s no good to just have one observer; you have to have a whole “rulial flock” of similar ones. And to get similar ones you need something like self replication.

Talking of “societies of observers” brings me to another area I want to study: economics. How does a coherent economic system emerge from all the microscopic transactions and other events in a society? I suspect it’s a story that’s in the end similar to the theories we’ve studied in physics—from the emergence of bulk properties in fluids, to the emergence of continuum spacetime, and so on. But now in economics we’re dealing not with fluid density or metric, but instead with things like price. I don’t yet know how it will work out. Maybe computational reducibility will be associated with value. Maybe computational irreducibility will be what determines robustness of value. But I suspect that there’s a way of thinking about “economic observers” in the ruliad—and figuring out what “natural laws” they’ll “inevitably observe”. And maybe some of those natural laws will be relevant in thinking about the kind of questions we humans care about in economics.

It’s rather amazing in how many different areas one seems to be able to apply the kind of approach that’s emerged from the Physics Project, the ruliad, etc. One that I’ve very recently tackled is machine learning. And in my effort to understand its foundations, I’ve ended up coming up with some very minimal models. My purpose was to understand the essence of machine learning. But—somewhat to my surprise—it looks as if these minimal models can actually be practical ways to do machine learning. Their hardware-level tradeoffs are somewhat different. But—given my interest in practical technology—I want to see if one can build out a practical machine-learning framework that’s based on these (fundamentally discrete) models.

And while I’m not currently planning to investigate this myself, I suspect that the approach I’ve used to study machine learning can also be applied to neuroscience, and perhaps to linguistics. And, yes, there’ll probably be a lot of computational irreducibility in evidence. And once again one has to hope that the pockets of computational reducibility that exist will give rise to “natural laws” that are useful for what we care about in these fields.

In addition to these “big” projects, I’m also hoping to do a variety of “smaller” projects. Many I started decades ago, and in fact mentioned in A New Kind of Science. But now I feel I have the tools, intuition and intellectual momentum to finally finish them. Nestedly recursive functions. Deterministic random tilings. Undecidability in the three-body problem. “Meta-engineering” in the Game of Life. These might on their own seem esoteric. But my repeated experience—particularly in the past five years—is that by solving problems like these one builds examples and intuition that have surprisingly broad application.

And then there are history projects. Just what did happen to theories of discrete space in the early twentieth century (and how close did people like Einstein get to the ideas of our Physics Project)? What was “ancient history” of neural nets, and why did people come to assume they should be based on continuous real numbers? I fully expect that as I investigate these things, I’ll encounter all sorts of “if only” situations—where for example some unpublished note languishing in an archive (or attic) would have changed the course of science if it had seen the light of day long ago. And when I find something like this, it’s yet more motivation to actually finish those projects of mine that have been languishing so long in the filesystem of my computer.

There’s a lot I want to do “down in the computational trenches”, in physics, chemistry, biology, economics, etc. But there are also things at a more abstract level in the ruliad. There’s more to study about metamathematics, and about how mathematics that we humans care about can emerge from the ruliad. And there are also foundational questions in computer science. P vs. NP, for example, can be formulated as an essentially geometric problem in the ruliad—and conceivably there are mathematical methods (say from higher category theory) that might give insight into it.

Then there are questions about hyperruliads and hyporuliads. In a hyperruliad that’s based on hypercomputation, there will be hyperobservers. But is there a kind of “rulial relativity” that makes their perception of things just the same as “ordinary observers” in the ordinary ruliad? A way to get some insight into this may be to study hyporuliads—versions of the ruliad in which there are only limited levels of computation possible. A bit like the way a spacelike singularity associated with a black hole supports only limited time histories, or a decidable axiomatic theory supports only proofs of limited length, there will be limitations in the hyporuliad. And by studying them, there’s a possibility that we’ll be able to see more about issues like what kinds of mathematical axioms can be compatible with observers like us.

It’s worth commenting that our Physics Project—and the ruliad—have all sorts of connections and resonances with long-studied ideas in philosophy. “Didn’t Kant talk about that? Isn’t that similar to Leibniz?”, etc. I’ve wanted to try to understand these historical connections. But while I’ve done a lot of work on the historical development of ideas, the ideas in question have tended to be more focused, and more tied to concrete formalism than they usually are in philosophy. “Did Kant actually mean that, or something completely different?” You might have to understand all his works to know. And that’s more than I think I can do.

I invented the concept of the ruliad as a matter of science. But it’s now clear that the ruliad has all sorts of connections and resonances not only with philosophy but also with theology. Indeed, in a great many belief systems there’s always been the idea that somehow in the end “everything is one”. In cases where this gets slightly more formalized, there’s often some kind of combinatorial enumeration involved (think: I Ching, or various versions of “counting the names of God”).

There are all sorts of examples where long-surviving “ancient beliefs” end up having something to them, even if the specific methods of post-1600s science don’t have much to say about them. One example is the notion of a soul, which we might now see as an ancient premonition of the modern notion of abstract computation. And whenever there’s a belief that’s ancient, there’s likely to have been lots of thinking done around it over the millennia. So if we can, for example, see a connection to the ruliad, we can expect to leverage that thinking. And perhaps also be able to provide new input that can refine the belief system in interesting and valuable ways.

I’m always interested in different viewpoints about things—whether from science, philosophy, theology, wherever. And an extreme version of this is to think about how other “alien” minds might view things. Nowadays I think of different minds as effectively being at different places in the ruliad. Humans with similar backgrounds have minds that are close in rulial space. Cats and dogs have minds that are further away. And the weather (with its “mind of its own”) is still further.

Now that we have AIs we potentially have a way to study the correspondence—and communication—between “different minds”. I looked at one aspect of this in my “cats” piece. But my recent work on the foundations of machine learning suggests a broader approach, that can also potentially tell us things about the fundamental character of language, and about how it serves as a medium that can “transport thoughts” from one mind to another.

Many non-human animals seem to have at least some form of language—though mostly in effect just a few standalone words. But pretty unquestionably the greatest single invention of our species is language—and particularly compositional language where words and phrases can fit together in an infinite number of ways. But is there something beyond compositional language? And, for example, where might we get if our brains were bigger?

With the 100 billion neurons in our brains, we seem to be able to handle about 50,000 words. If we had a trillion neurons we’d probably be able to handle more words (though perhaps more slowly), in effect letting us describe more things more easily. But what about something fundamentally beyond compositional language? Something perhaps “higher order”?

With a word we are in effect conflating all instances of a certain concept into a single object that we can then work with. But typically with ordinary words we’re dealing with what we might call “static concepts”. So what about “ways of thinking”, or paradigms? They’re more like active, functional concepts. And it’s a bit like dogs versus us: dogs deal with a few standalone words; we “package” those together into whole sentences and beyond. And at the next level, we could imagine in effect packaging things like generators of meaningful sentences.

Interestingly enough, we have something of a preview of ideas like this—in computational language. And this is one of those places where my efforts in science—and philosophy—start to directly intersect with my efforts in technology.

The foundation of the Wolfram Language is the idea of representing everything in computational terms, and in particular in symbolic computational terms. And one feature of such a representation is that it can encompass both “data” and “code”—i.e. both things one might think about, and ways one might think about them.

I first started building Wolfram Language as a practical tool—though one very much informed by my foundational ideas. And now, four decades later, the Wolfram Language has emerged as the largest single project of my life, and something that, yes, I expect to always put immense effort into. It wasn’t long ago that we finally finished my 1991 to-do list for Wolfram Language—and we have many projects running now that will take years to complete. But the mission has always remained the same: to take the concept of computation and apply it as broadly as possible, through the medium of computational language.

Now, however, I have some additional context for that—viewing computational language as a bridge from what we humans think about to what’s possible in the computational universe. And this helps in framing some of the ways to expand the foundations of our computational language, for example to multicomputation, or to hypergraph-based representations. It also helps in understanding the character of current AI, and how it needs to interact with computational language.

In the Wolfram Language we’ve been steadily trying to create a representation for everything. And when it comes to definitive, objective things we’ve gotten a long way. But there’s more than that in everyday discourse. For example, I might say “I’m going to drink a glass of orange juice.” Well, we do just fine at representing “a glass of orange juice” in the Wolfram Language, and we can compute lots of things—like nutrition content—about it. But what about “I’m going to drink…”? For that we need something different.

And, actually, I’ve been thinking for a shockingly long time about what one might need. I first considered the question in the early 1980s, in connection with “extending SMP to AI”. I learned about the attempts to make “philosophical languages” in the 1600s, and about some of the thinking around modern conlangs (constructed languages). Something that always held me back, though, was use cases. Yes, I could see how one could use things like this for tasks like customer service. But I wasn’t too excited about that.

But finally there was blockchain, and with it, smart contracts. And around 2015 I started thinking about how one might represent contracts in general not in legalese but in some precise computational way. And the result was that I began to crispen my ideas about what I called “symbolic discourse language”. I thought about how this might relate to questions like a “constitution for AIs” and so on. But I never quite got around to actually starting to design the specifics of the symbolic discourse language.

But then along came LLMs, together with my theory that their success had to do with a “semantic grammar” of language. And finally now we’ve launched a serious project to build a symbolic discourse language. And, yes, it’s a difficult language design problem, deeply entangled with a whole range of foundational issues in philosophy. But as, by now at least, the world’s most experienced language designer (for better or worse), I feel a responsibility to try to do it.

In addition to language design, there’s also the question of making all the various “symbolic calculi” that describe in appropriately coarse terms the operation of the world. Calculi of motion. Calculi of life (eating, dying, etc.). Calculi of human desires. Etc. As well as calculi that are directly supported by the computation and knowledge in the Wolfram Language.

And just as LLMs can provide a kind of conversational linguistic interface to the Wolfram Language, one can expect them also to do this to our symbolic discourse language. So the pattern will be similar to what it is for Wolfram Language: the symbolic discourse language will provide a formal and (at least within its purview) correct underpinning for the LLM. It may lose the poetry of language that the LLM handles. But from the outset it’ll get its reasoning straight.

The symbolic discourse language is a broad project. But in some sense breadth is what I have specialized in. Because that’s what’s needed to build out the Wolfram Language, and that’s what’s needed in my efforts to pull together the foundations of so many fields.

And in maintaining a broad range of interests there are some where I imagine that someday there’ll be a project I can do, but there may for example be many years of “ambient technology” that are needed before that project will be feasible. Usually, though, I have some “conceptual idea” of what the project might be. For example, I’ve followed robotics, imagining that one day there’ll be a way to do “general-purpose robotics”, perhaps constructing everything out of modular elements. I’ve followed biomedicine, partly out of personal self interest, and partly because I think it’ll relate to some of the foundational questions I’m asking in biology.

But in addition to all the projects where the goal is basic research, or technology development, I’m also hoping to pursue my interests in education. Much of what I hope to do relates to content, but some of it relates to access and motivation. I don’t have perfect evidence, but I strongly believe there’s a lot of young talent out there in the world that never manages to connect for example with things like the educational programs we put on. We–and I—have tried quite hard over the years to “bridge the gap”. But with the world as it is, it’s proved remarkably difficult. But it’s still a problem I’d like to solve, and I’ll keep picking away at it, hoping to change for the better some kids’ “trajectories”.

But about content I believe my path is clearer. With the modern Wolfram Language I think we’ve gone a long way towards being able to take computational thinking about almost anything, and being able to represent it in a formalized way, and compute from it. But how do people manage to do the computational thinking in the first place? Well, like mathematical thinking and other formalized kinds of thinking, they have to learn how to do it.

For years people have been telling me I should “write the book” to teach this. And finally in January of this year I started. I’m not sure how long it will take, but I’ll soon be starting to post sections I’ve written so far.

My goal is to create a general book—and course—that’s an introduction to computational thinking at a level suitable for typical first-year college students. Lots of college students these days say they want to study “computer science”. But really it’s computational X for some field X that they’re ultimately interested in. And neither the theoretical nor the engineering aspects of typical “computer science” are what’s most relevant to them. What they need to know is computational thinking as it might be applied to computational X—not “CS” but what one might call “CX”.

So what will CX101 be like? In some ways more like a philosophy course than a CS one. Because in the end it’s about generally learning to think, albeit in the new paradigm of computation. And the point is that once someone has a clear computational conceptualization of something, then it’s our job in the Wolfram Language to make sure that it’s easy for them to concretely implement it.

But how does one teach computational conceptualization? What I’ve concluded is that one needs to anchor it in actual things in the world. Geography. Video. Genomics. Yes, there are principles to explain. But they need practical context to make them useful, or even understandable. And what I’m finding is that framing everything computationally makes things incredibly much easier to explain than before. (A test example coming soon is whether I can easily explain math ideas like algebra and calculus this way.)

OK, so that’s a lot of projects. But I’m excited about all of them, and can’t wait to make them happen. At an age when many of my contemporaries are retiring, I feel like I’m just getting started. And somehow the way my projects keep on connecting back to things I did decades ago makes me feel—in a computational irreducibility kind of way—that there’s something necessary about all the steps I’ve taken. I feel like the things I’ve done have let me climb some hills. But now there are many more hills that have come into view. And I look forward to being able to climb those too. For myself and for the world.

What’s Really Going On in Machine Learning? Some Minimal Models

2024-08-23 02:28:17

What's Really Going On in Machine Learning? Some Minimal Models

The Mystery of Machine Learning

It’s surprising how little is known about the foundations of machine learning. Yes, from an engineering point of view, an immense amount has been figured out about how to build neural nets that do all kinds of impressive and sometimes almost magical things. But at a fundamental level we still don’t really know why neural nets “work”—and we don’t have any kind of “scientific big picture” of what’s going on inside them.

The basic structure of neural networks can be pretty simple. But by the time they’re trained up with all their weights, etc. it’s been hard to tell what’s going on—or even to get any good visualization of it. And indeed it’s far from clear even what aspects of the whole setup are actually essential, and what are just “details” that have perhaps been “grandfathered” all the way from when computational neural nets were first invented in the 1940s.

Well, what I’m going to try to do here is to get “underneath” this—and to “strip things down” as much as possible. I’m going to explore some very minimal models—that, among other things, are more directly amenable to visualization. At the outset, I wasn’t at all sure that these minimal models would be able to reproduce any of the kinds of things we see in machine learning. But, rather surprisingly, it seems they can.

And the simplicity of their construction makes it much easier to “see inside them”—and to get more of a sense of what essential phenomena actually underlie machine learning. One might have imagined that even though the training of a machine learning system might be circuitous, somehow in the end the system would do what it does through some kind of identifiable and “explainable” mechanism. But we’ll see that in fact that’s typically not at all what happens.

Instead it looks much more as if the training manages to home in on some quite wild computation that “just happens to achieve the right results”. Machine learning, it seems, isn’t building structured mechanisms; rather, it’s basically just sampling from the typical complexity one sees in the computational universe, picking out pieces whose behavior turns out to overlap what’s needed. And in a sense, therefore, the possibility of machine learning is ultimately yet another consequence of the phenomenon of computational irreducibility.

Why is that? Well, it’s only because of computational irreducibility that there’s all that richness in the computational universe. And, more than that, it’s because of computational irreducibility that things end up being effectively random enough that the adaptive process of training a machine learning system can reach success without getting stuck.

But the presence of computational irreducibility also has another important implication: that even though we can expect to find limited pockets of computational reducibility, we can’t expect a “general narrative explanation” of what a machine learning system does. In other words, there won’t be a traditional (say, mathematical) “general science” of machine learning (or, for that matter, probably also neuroscience). Instead, the story will be much closer to the fundamentally computational “new kind of science” that I’ve explored for so long, and that has brought us our Physics Project and the ruliad.

In many ways, the problem of machine learning is a version of the general problem of adaptive evolution, as encountered for example in biology. In biology we typically imagine that we want to adaptively optimize some overall “fitness” of a system; in machine learning we typically try to adaptively “train” a system to make it align with certain goals or behaviors, most often defined by examples. (And, yes, in practice this is often done by trying to minimize a quantity normally called the “loss”.)

And while in biology there’s a general sense that “things arise through evolution”, quite how this works has always been rather mysterious. But (rather to my surprise) I recently found a very simple model that seems to do well at capturing at least some of the most essential features of biological evolution. And while the model isn’t the same as what we’ll explore here for machine learning, it has some definite similarities. And in the end we’ll find that the core phenomena of machine learning and of biological evolution appear to be remarkably aligned—and both fundamentally connected to the phenomenon of computational irreducibility.

Most of what I’ll do here focuses on foundational, theoretical questions. But in understanding more about what’s really going on in machine learning—and what’s essential and what’s not—we’ll also be able to begin to see how in practice machine learning might be done differently, potentially with more efficiency and more generality.

Traditional Neural Nets

Note: Click any diagram to get Wolfram Language code to reproduce it.

To begin the process of understanding the essence of machine learning, let’s start from a very traditional—and familiar—example: a fully connected (“multilayer perceptron”) neural net that’s been trained to compute a certain function f[x]:

If one gives a value x as input at the top, then after “rippling through the layers of the network” one gets a value at the bottom that (almost exactly) corresponds to our function f[x]:

Scanning through different inputs x, we see different patterns of intermediate values inside the network:

And here’s (on a linear and log scale) how each of these intermediate values changes with x. And, yes, the way the final value (highlighted here) emerges looks very complicated:

So how is the neural net ultimately put together? How are these values that we’re plotting determined? We’re using the standard setup for a fully connected multilayer network. Each node (“neuron”) on each layer is connected to all nodes on the layer above—and values “flow” down from one layer to the next, being multiplied by the (positive or negative) “weight” (indicated by color in our pictures) associated with the connection through which they flow. The value of a given neuron is found by totaling up all its (weighted) inputs from the layer before, adding a “bias” value for that neuron, and then applying to the result a certain (nonlinear) “activation function” (here ReLU or Ramp[z], i.e. If[z z]).

What overall function a given neural net will compute is determined by the collection of weights and biases that appear in the neural net (along with its overall connection architecture, and the activation function it’s using). The idea of machine learning is to find weights and biases that produce a particular function by adaptively “learning” from examples of that function. Typically we might start from a random collection of weights, then successively tweak weights and biases to “train” the neural net to reproduce the function:

We can get a sense of how this progresses (and, yes, it’s complicated) by plotting successive changes in individual weights over the course of the training process (the spikes near the end come from “neutral changes” that don’t affect the overall behavior):

The overall objective in the training is progressively to decrease the “loss”—the average (squared) difference between true values of f[x] and those generated by the neural net. The evolution of the loss defines a “learning curve” for the neural net, with the downward glitches corresponding to points where the neural net in effect “made a breakthrough” in being able to represent the function better:

It’s important to note that typically there’s randomness injected into neural net training. So if one runs the training multiple times, one will get different networks—and different learning curves—every time:

But what’s really going on in neural net training? Effectively we’re finding a way to “compile” a function (at least to some approximation) into a neural net with a certain number of (real-valued) parameters. And in the example here we happen to be using about 100 parameters.

But what happens if we use a different number of parameters, or set up the architecture of our neural net differently? Here are a few examples, indicating that for the function we’re trying to generate, the network we’ve been using so far is pretty much the smallest that will work:

And, by the way, here’s what happens if we change our activation function from ReLU
to the smoother ELU :

Later we’ll talk about what happens when we do machine learning with discrete systems. And in anticipation of that, it’s interesting to see what happens if we take a neural net of the kind we’ve discussed here, and “quantize” its weights (and biases) in discrete levels:

The result is that (as recent experience with large-scale neural nets has also shown) the basic “operation” of the neural net does not require precise real numbers, but survives even when the numbers are at least somewhat discrete—as this 3D rendering as a function of the discreteness level δ also indicates:

Simplifying the Topology: Mesh Neural Nets

So far we’ve been discussing very traditional neural nets. But to do machine learning, do we really need systems that have all those details? For example, do we really need every neuron on each layer to get an input from every neuron on the previous layer? What happens if instead every neuron just gets input from at most two others—say with the neurons effectively laid out in a simple mesh? Quite surprisingly, it turns out that such a network is still perfectly able to generate a function like the one we’ve been using as an example:

And one advantage of such a “mesh neural net” is that—like a cellular automaton—its “internal behavior” can readily be visualized in a rather direct way. So, for example, here are visualizations of “how the mesh net generates its output”, stepping through different input values x:

And, yes, even though we can visualize it, it’s still hard to understand “what’s going on inside”. Looking at the intermediate values of each individual node in the network as a function of x doesn’t help much, though we can “see something happening” at places where our function f[x] has jumps:

So how do we train a mesh neural net? Basically we can use the same procedure as for a fully connected network of the kind we saw above (ReLU activation functions don’t seem to work well for mesh nets, so we’re using ELU here):

Here’s the evolution of differences in each individual weight during the training process:

And here are results for different random seeds:

At the size we’re using, our mesh neural nets have about the same number of connections (and thus weights) as our main example of a fully connected network above. And we see that if we try to reduce the size of our mesh neural net, it doesn’t do well at reproducing our function:

Making Everything Discrete: A Biological Evolution Analog

Mesh neural nets simplify the topology of neural net connections. But, somewhat surprisingly at first, it seems as if we can go much further in simplifying the systems we’re using—and still successfully do versions of machine learning. And in particular we’ll find that we can make our systems completely discrete.

The typical methodology of neural net training involves progressively tweaking real-valued parameters, usually using methods based on calculus, and on finding derivatives. And one might imagine that any successful adaptive process would ultimately have to rely on being able to make arbitrarily small changes, of the kind that are possible with real-valued parameters.

But in studying simple idealizations of biological evolution I recently found striking examples where this isn’t the case—and where completely discrete systems seemed able to capture the essence of what’s going on.

As an example consider a (3-color) cellular automaton. The rule is shown on the left, and the behavior one generates by repeatedly applying that rule (starting from a single-cell initial condition) is shown on the right:

The rule has the property that the pattern it generates (from a single-cell initial condition) survives for exactly 40 steps, and then dies out (i.e. every cell becomes white). And the important point is that this rule can be found by a discrete adaptive process. The idea is to start, say, from a null rule, and then at each step to randomly change a single outcome out of the 27 in the rule (i.e. make a “single-point mutation” in the rule). Most such changes will cause the “lifetime” of the pattern to get further from our target of 40—and these we discard. But gradually we can build up “beneficial mutations”

that through “progressive adaptation” eventually get to our original lifetime-40 rule:

We can make a plot of all the attempts we made that eventually let us reach lifetime 40—and we can think of this progressive “fitness” curve as being directly analogous to the loss curves in machine learning that we saw before:

If we make different sequences of random mutations, we’ll get different paths of adaptive evolution, and different “solutions” for rules that have lifetime 40:

Two things are immediately notable about these. First, that they essentially all seem to be “using different ideas” to reach their goal (presumably analogous to the phenomenon of different branches in the tree of life). And second, that none of them seem to be using a clear “mechanical procedure” (of the kind we might construct through traditional engineering) to reach their goal. Instead, they seem to be finding “natural” complicated behavior that just “happens” to achieve the goal.

It’s nontrivial, of course, that this behavior can achieve a goal like the one we’ve set here, as well as that simple selection based on random point mutations can successfully reach the necessary behavior. But as I discussed in connection with biological evolution, this is ultimately a story of computational irreducibility—particularly in generating diversity both in behavior, and in the paths necessary to reach it.

But, OK, so how does this model of adaptive evolution relate to systems like neural nets? In the standard language of neural nets, our model is like a discrete analog of a recurrent convolutional network. It’s “convolutional” because at any given step the same rule is applied—locally—throughout an array of elements. It’s “recurrent” because in effect data is repeatedly “passed through” the same rule. The kinds of procedures (like “backpropagation”) typically used to train traditional neural nets wouldn’t be able to train such a system. But it turns out that—essentially as a consequence of computational irreducibility—the very simple method of successive random mutation can be successful.

Machine Learning in Discrete Rule Arrays

Let’s say we want to set up a system like a neural net—or at least a mesh neural net—but we want it to be completely discrete. (And I mean “born discrete”, not just discretized from an existing continuous system.) How can we do this? One approach (that, as it happens, I first considered in the mid-1980s—but never seriously explored) is to make what we can call a “rule array”. Like in a cellular automaton there’s an array of cells. But instead of these cells always being updated according to the same rule, each cell at each place in the cellular automaton analog of “spacetime” can make a different choice of what rule it will use. (And although it’s a fairly extreme idealization, we can potentially imagine that these different rules represent a discrete analog of different local choices of weights in a mesh neural net.)

As a first example, let’s consider a rule array in which there are two possible choices of rules: k = 2, r = 1cellular automaton rules 4 and 146 (which are respectively class 2 and class 3):

A particular rule array is defined by which of these rules is going to be used at each (“spacetime”) position in the array. Here are a few examples. In all cases we’re starting from the same single-cell initial condition. But in each case the rule array has a different arrangement of rule choices—with cells “running” rule 4 being given a background, and those running rule 146 a one:

We can see that different choices of rule array can yield very different behaviors. But (in the spirit of machine learning) can we in effect “invert this”, and find a rule array that will give some particular behavior we want?

A simple approach is to do the direct analog of what we did in our minimal modeling of biological evolution: progressively make random “single-point mutations”—here “flipping” the identity of just one rule in the rule array—and then keeping only those mutations that don’t make things worse.

As our sample objective, let’s ask to find a rule array that makes the pattern generated from a single cell using that rule array “survive” for exactly 50 steps. At first it might not be obvious that we’d be able to find such a rule array. But in fact our simple adaptive procedure easily manages to do this:

As the dots here indicate, many mutations don’t lead to longer lifetimes. But every so often, the adaptive process has a “breakthrough” that increases the lifetime—eventually reaching 50:

Just as in our model of biological evolution, different random sequences of mutations lead to different “solutions”, here to the problem of “living for exactly 50 steps”:

Some of these are in effect “simple solutions” that require only a few mutations. But most—like most of our examples in biological evolution—seem more as if they just “happen to work”, effectively by tapping into just the right, fairly complex behavior.

Is there a sharp distinction between these cases? Looking at the collection of “fitness” (AKA “learning”) curves for the examples above, it doesn’t seem so:

It’s not too difficult to see how to “construct a simple solution” just by strategically placing a single instance of the second rule in the rule array:

But the point is that adaptive evolution by repeated mutation normally won’t “discover” this simple solution. And what’s significant is that the adaptive evolution can nevertheless still successfully find some solution—even though it’s not one that’s “understandable” like this.

The cellular automaton rules we’ve been using so far take 3 inputs. But it turns out that we can make things even simpler by just putting ordinary 2-input Boolean functions into our rule array. For example, we can make a rule array from And and Xor functions (r = 1/2 rules 8 and 6):

Different And+Xor ( + ) rule arrays show different behavior:

But are there for example And+Xor rule arrays that will compute any of the 16 possible (2-input) functions? We can’t get Not or any of the 8 other functions with —but it turns out we can get all 8 functions with (additional inputs here are assumed to be ):

And in fact we can also set up And+Xor rule arrays for all other “even” Boolean functions. For example, here are rule arrays for the 3-input rule 30 and rule 110 Boolean functions:

It may be worth commenting that the ability to set up such rule arrays is related to functional completeness of the underlying rules we’re using—though it’s not quite the same thing. Functional completeness is about setting up arbitrary formulas, that can in effect allow long-range connections between intermediate results. Here, all information has to explicitly flow through the array. But for example the functional completeness of Nand (r = 1/2 rule 7, ) allows it to generate all Boolean functions when combined for example with First (r = 1/2 rule 12, ), though sometimes the rule arrays required are quite large:

OK, but what happens if we try to use our adaptive evolution process—say to solve the problem of finding a pattern that survives for exactly 30 steps? Here’s a result for And+Xor rule arrays:

And here are examples of other “solutions” (none of which in this case look particularly “mechanistic” or “constructed”):

But what about learning our original f[x] = function? Well, first we have to decide how we’re going to represent the numbers x and f[x] in our discrete rule array system. And one approach is to do this simply in terms of the position of a black cell (“one-hot encoding”). So, for example, in this case there’s an initial black cell at a position corresponding to about x = –1.1. And then the result after passing through the rule array is a black cell at a position corresponding to f[x] = 1.0:

So now the question is whether we can find a rule array that successfully maps initial to final cell positions according to the mapping x f[x] we want. Well, here’s an example that comes at least close to doing this (note that the array is taken to be cyclic):

So how did we find this? Well, we just used a simple adaptive evolution process. In direct analogy to the way it’s usually done in machine learning, we set up “training examples”, here of the form:

Then we repeatedly made single-point mutations in our rule array, keeping those mutations where the total difference from all the training examples didn’t increase. And after 50,000 mutations this gave the final result above.

We can get some sense of “how we got there” by showing the sequence of intermediate results where we got closer to the goal (as opposed to just not getting further from it):

Here are the corresponding rule arrays, in each case highlighting elements that have changed (and showing the computation of f[0] in the arrays):

Different sequences of random mutations will lead to different rule arrays. But with the setup defined here, the resulting rule arrays will almost always succeed in accurately computing f[x]. Here are a few examples—in which we’re specifically showing the computation of f[0]:

And once again an important takeaway is that we don’t see “identifiable mechanism” in what’s going on. Instead, it looks more as if the rule arrays we’ve got just “happen” to do the computations we want. Their behavior is complicated, but somehow we can manage to “tap into it” to compute our f[x].

But how robust is this computation? A key feature of typical machine learning is that it can “generalize” away from the specific examples it’s been given. It’s never been clear just how to characterize that generalization (when does an image of a cat in a dog suit start being identified as an image of a dog?). But—at least when we’re talking about classification tasks—we can think of what’s going on in terms of basins of attraction that lead to attractors corresponding to our classes.

It’s all considerably easier to analyze, though, in the kind of discrete system we’re exploring here. For example, we can readily enumerate all our training inputs (i.e. all initial states containing a single black cell), and then see how frequently these cause any given cell to be black:

By the way, here’s what happens to this plot at successive “breakthroughs” during training:

But what about all possible inputs, including ones that don’t just contain a single black cell? Well, we can enumerate all of them, and compute the overall frequency for each cell in the array to be black:

As we would expect, the result is considerably “fuzzier” than what we got purely with our training inputs. But there’s still a strong trace of the discrete values for f[x] that appeared in the training data. And if we plot the overall probability for a given final cell to be black, we see peaks at positions corresponding to the values 0 and 1 that f[x] takes on:

But because our system is discrete, we can explicitly look at what outcomes occur:

The most common overall is the “meaningless” all-white state—that basically occurs when the computation from the input “never makes it” to the output. But the next most common outcomes correspond exactly to f[x] = 0 and f[x] = 1. After that is the “superposition” outcome where f[x] is in effect “both 0 and 1”.

But, OK, so what initial states are “in the basins of attraction of” (i.e. will evolve to) the various outcomes here? The fairly flat plots in the last column above indicate that the overall density of black cells gives little information about what attractor a particular initial state will evolve to.

So this means we have to look at specific configurations of cells in the initial conditions. As an example, start from the initial condition

which evolves to:

Now we can ask what happens if we look at a sequence of slightly different initial conditions. And here we show in black and white initial conditions that still evolve to the original “attractor” state, and in pink ones that evolve to some different state:

What’s actually going on inside here? Here are a few examples, highlighting cells whose values change as a result of changing the initial condition:

As is typical in machine learning, there doesn’t seem to be any simple characterization of the form of the basin of attraction. But now we have a sense of what the reason for this is: it’s another consequence of computational irreducibility. Computational irreducibility gives us the effective randomness that allows us to find useful results by adaptive evolution, but it also leads to changes having what seem like random and unpredictable effects. (It’s worth noting, by the way, that we could probably dramatically improve the robustness of our attractor basins by specifically including in our training data examples that have “noise” injected.)

Multiway Mutation Graphs

In doing machine learning in practice, the goal is typically to find some collection of weights, etc. that successfully solve a particular problem. But in general there will be many such collections of weights, etc. With typical continuous weights and random training steps it’s very difficult to see what the whole “ensemble” of possibilities is. But in our discrete rule array systems, this becomes more feasible.

Consider a tiny 2×2 rule array with two possible rules. We can make a graph whose edges represent all possible “point mutations” that can occur in this rule array:

In our adaptive evolution process, we’re always moving around a graph like this. But typically most “moves” will end up in states that are rejected because they increase whatever loss we’ve defined.

Consider the problem of generating an And+Xor rule array in which we end with lifetime-4 patterns. Defining the loss as how far we are from this lifetime, we can draw a graph that shows all possible adaptive evolution paths that always progressively decrease the loss:

The result is a multiway graph of the type we’ve now seen in a great many kinds of situations—notably our recent study of biological evolution.

And although this particular example is quite trivial, the idea in general is that different parts of such a graph represent “different strategies” for solving a problem. And—in direct analogy to our Physics Project and our studies of things like game graphs—one can imagine such strategies being laid out in a “branchial space” defined by common ancestry of configurations in the multiway graph.

And one can expect that while in some cases the branchial graph will be fairly uniform, in other cases it will have quite separated pieces—that represent fundamentally different strategies. Of course, the fact that underlying strategies may be different doesn’t mean that the overall behavior or performance of the system will be noticeably different. And indeed one expects that in most cases computational irreducibility will lead to enough effective randomness that there’ll be no discernable difference.

But in any case, here’s an example starting with a rule array that contains both And and Xor—where we observe distinct branches of adaptive evolution that lead to different solutions to the problem of finding a configuration with a lifetime of exactly 4:

Optimizing the Learning Process

How should one actually do the learning in machine learning? In practical work with traditional neural nets, learning is normally done using systematic algorithmic methods like backpropagation. But so far, all we’ve done here is something much simpler: we’ve “learned” by successively making random point mutations, and keeping only ones that don’t lead us further from our goal. And, yes, it’s interesting that such a procedure can work at all—and (as we’ve discussed elsewhere) this is presumably very relevant to understanding phenomena like biological evolution. But, as we’ll see, there are more efficient (and probably much more efficient) methods of doing machine learning, even for the kinds of discrete systems we’re studying.

Let’s start by looking again at our earlier example of finding an And+Xor rule array that gives a “lifetime” of exactly 30. At each step in our adaptive (“learning”) process we make a single-point mutation (changing a single rule in the rule array), keeping the mutation if it doesn’t take us further from our goal. The mutations gradually accumulate—every so often reaching a rule array that gives a lifetime closer to 30. Just as above, here’s a plot of the lifetime achieved by successive mutations—with the “internal” red dots corresponding to rejected mutations:

We see a series of “plateaus” at which mutations are accumulating but not changing the overall lifetime. And between these we see occasional “breakthroughs” where the lifetime jumps. Here are the actual rule array configurations for these breakthroughs, with mutations since the last breakthrough highlighted:

But in the end the process here is quite wasteful; in this example, we make a total of 1705 mutations, but only 780 of them actually contribute to generating the final rule array; all the others are discarded along the way.

So how can we do better? One strategy is to try to figure out at each step which mutation is “most likely to make a difference”. And one way to do this is to try every possible mutation in turn at every step (as in multiway evolution)—and see what effect each of them has on the ultimate lifetime. From this we can construct a “change map” in which we give the change of lifetime associated with a mutation at every particular cell. The results will be different for every configuration of rule array, i.e. at every step in the adaptive evolution. But for example here’s what they are for the particular “breakthrough” configurations shown above (elements in regions that are colored gray won’t affect the result if they are changed; ones colored red will have a positive effect (with more intense red being more positive), and ones colored blue a negative one:

Let’s say we start from a random rule array, then repeatedly construct the change map and apply the mutation that it implies gives the most positive change—in effect at each step following the “path of steepest descent” to get to the lifetime we want (i.e. reduce the loss). Then the sequence of “breakthrough” configurations we get is:

And this in effect corresponds to a slightly more direct “path to a solution” than our sequence of pure single-point mutations.

By the way, the particular problem of reaching a certain lifetime has a simple enough structure that this “steepest descent” method—when started from a simple uniform rule array—finds a very “mechanical” (if slow) path to a solution:

What about the problem of learning f[x] = ? Once again we can make a change map based on the loss we define. Here are the results for a sequence of “breakthrough” configurations. The gray regions are ones where changes will be “neutral”, so that there’s still exploration that can be done without affecting the loss. The red regions are ones that are in effect “locked in” and where any changes would be deleterious in terms of loss:

So what happens in this case if we follow the “path of steepest descent”, always making the change that would be best according to the change map? Well, the results are actually quite unsatisfactory. From almost any initial condition the system quickly gets stuck, and never finds any satisfactory solution. In effect it seems that deterministically following the path of steepest descent leads us to a “local minimum” from which we cannot escape. So what are we missing in just looking at the change map? Well, the change map as we’ve constructed it has the limitation that it’s separately assessing the effect of each possible individual mutation. It doesn’t deal with multiple mutations at a time—which could well be needed in general if one’s going to find the “fastest path to success”, and avoid getting stuck.

But even in constructing the change map there’s already a problem. Because at least the direct way of computing it scales quite poorly. In an n×n rule array we have to check the effect of flipping about n2 values, and for each one we have to run the whole system—taking altogether about n4 operations. And one has to do this separately for each step in the learning process.

So how do traditional neural nets avoid this kind of inefficiency? The answer in a sense involves a mathematical trick. And at least as it’s usually presented it’s all based on the continuous nature of the weights and values in neural nets—which allow us to use methods from calculus.

Let’s say we have a neural net like this

that computes some particular function f[x]:

We can ask how this function changes as we change each of the weights in the network:

And in effect this gives us something like our “change map” above. But there’s an important difference. Because the weights are continuous, we can think about infinitesimal changes to them. And then we can ask questions like “How does f[x] change when we make an infinitesimal change to a particular weight wi?”—or equivalently, “What is the partial derivative of f with respect to wi at the point x?” But now we get to use a key feature of infinitesimal changes: that they can always be thought of as just “adding linearly” (essentially because ε2 can always be ignored compared to ε). Or, in other words, we can summarize any infinitesimal change just by giving its “direction” in weight space, i.e. a vector that says how much of each weight should be (infinitesimally) changed. So if we want to change f[x] (infinitesimally) as quickly as possible, we should go in the direction of steepest descent defined by all the derivatives of f with respect to the weights.

In machine learning, we’re typically trying in effect to set the weights so that the form of f[x] we generate successfully minimizes whatever loss we’ve defined. And we do this by incrementally “moving in weight space”—at every step computing the direction of steepest descent to know where to go next. (In practice, there are all sorts of tricks like “ADAM” that try to optimize the way to do this.)

But how do we efficiently compute the partial derivative of f with respect to each of the weights? Yes, we could do the analog of generating pictures like the ones above, separately for each of the weights. But it turns out that a standard result from calculus gives us a vastly more efficient procedure that in effect “maximally reuses” parts of the computation that have already been done.

It all starts with the textbook chain rule for the derivative of nested (i.e. composed) functions:

This basically says that the (infinitesimal) change in the value of the “whole chain” d[c[b[a[x]]]] can be computed as a product of (infinitesimal) changes associated with each of the “links” in the chain. But the key observation is then that when we get to the computation of the change at a certain point in the chain, we’ve already had to do a lot of the computation we need—and so long as we stored those results, we always have only an incremental computation to perform.

So how does this apply to neural nets? Well, each layer in a neural net is in effect doing a function composition. So, for example, our d[c[b[a[x]]]] is like a trivial neural net:

But what about the weights, which, after all, are what we are trying to find the effect of changing? Well, we could include them explicitly in the function we’re computing:

And then we could in principle symbolically compute the derivatives with respect to these weights:

For our network above

the corresponding expression (ignoring biases) is

where ϕ denotes our activation function. Once again we’re dealing with nested functions, and once again—though it’s a bit more intricate in this case—the computation of derivatives can be done by incrementally evaluating terms in the chain rule and in effect using the standard neural net method of “backpropagation”.

So what about the discrete case? Are there similar methods we can use there? We won’t discuss this in detail here, but we’ll give some indications of what’s likely to be involved.

As a potentially simpler case, let’s consider ordinary cellular automata. The analog of our change map asks how the value of a particular “output” cell is affected by changes in other cells—or in effect what the “partial derivative” of the output value is with respect to changes in values of other cells.

For example, consider the highlighted “output” cell in this cellular automaton evolution:

Now we can look at each cell in this array, and make a change map based on seeing whether flipping the value of just that cell (and then running the cellular automaton forwards from that point) would change the value of the output cell:

The form of the change map is different if we look at different “output cells”:

Here, by the way, are some larger change maps for this and a couple of other cellular automaton rules:

But is there a way to construct such change maps incrementally? One might have thought that there would immediately be at least for cellular automata that (unlike the cases here) are fundamentally reversible. But actually such reversibility doesn’t seem to help much—because although it allows us to “backtrack” whole states of the cellular automaton, it doesn’t allow us to trace the separate effects of individual cells.

So how about using discrete analogs of derivatives and the chain rule? Let’s for example call the function computed by one step in rule 30 cellular automaton evolution w[x, y, z]. We can think of the “partial derivative” of this function with respect to x at the point x as representing whether the output of w changes when x is flipped starting from the value given:

(Note that “no change” is indicated as False or , while a change is indicated as True or . And, yes, one can either explicitly compute the rule outcomes here, and then deduce from them the functional form, or one can use symbolic rules to directly deduce the functional form.)

One can compute a discrete analog of a derivative for any Boolean function. For example, we have

and

which we can write as:

We also have:

And here is a table of “Boolean derivatives” for all 2-input Boolean functions:

And indeed there’s a whole “Boolean calculus” one can set up for these kinds of derivatives. And in particular, there’s a direct analog of the chain rule:

where Xnor[x,y] is effectively the equality test x == y:

But, OK, how do we use this to create our change maps? In our simple cellular automaton case, we can think of our change map as representing how a change in an output cell “propagates back” to previous cells. But if we just try to apply our discrete calculus rules we run into a problem: different “chain rule chains” can imply different changes in the value of the same cell. In the continuous case this path dependence doesn’t happen because of the way infinitesimals work. But in the discrete case it does. And ultimately we’re doing a kind of backtracking that can really be represented faithfully only as a multiway system. (Though if we just want probabilities, for example, we can consider averaging over branches of the multiway system—and the change maps we showed above are effectively the result of thresholding over the multiway system.)

But despite the appearance of such difficulties in the “simple” cellular automaton case, such methods typically seem to work better in our original, more complicated rule array case. There’s a bunch of subtlety associated with the fact that we’re finding derivatives not only with respect to the values in the rule array, but also with respect to the choice of rules (which are the analog of weights in the continuous case).

Let’s consider the And+Xor rule array:

Our loss is the number of cells whose values disagree with the row shown at the bottom. Now we can construct a change map for this rule array both in a direct “forward” way, and “backwards” using our discrete derivative methods (where we effectively resolve the small amount of “multiway behavior” by always picking “majority” values):

The results are similar, though in this case not exactly the same. Here are a few other examples:

And, yes, in detail there are essentially always local differences between the results from the forward and backward methods. But the backward method—like in the case of backpropagation in ordinary neural nets—can be implemented much more efficiently. And for purposes of practical machine learning it’s actually likely to be perfectly satisfactory—especially given that the forward method is itself only providing an approximation to the question of which mutations are best to do.

And as an example, here are the results of the forward and backward methods for the problem of learning the function f[x] = , for the “breakthrough” configurations that we showed above:

What Can Be Learned?

We’ve now shown quite a few examples of machine learning in action. But a fundamental question we haven’t yet addressed is what kind of thing can actually be learned by machine learning. And even before we get to this, there’s another question: given a particular underlying type of system, what kinds of functions can it even represent?

As a first example consider a minimal neural net of the form (essentially a single-layer perceptron):

With ReLU (AKA Ramp) as the activation function and the first set of weights all taken to be 1, the function computed by such a neural net has the form:

With enough weights and biases this form can represent any piecewise linear function—essentially just by moving around ramps using biases, and scaling them using weights. So for example consider the function:

This is the function computed by the neural net above—and here’s how it’s built up by adding in successive ramps associated with the individual intermediate nodes (neurons):

(It’s similarly possible to get all smooth functions from activation functions like ELU, etc.)

Things get slightly more complicated if we try to represent functions with more than one argument. With a single intermediate layer we can only get “piecewise (hyper)planar” functions (i.e. functions that change direction only at linear “fault lines”):

But already with a total of two intermediate layers—and sufficiently many nodes in each of these layers—we can generate any piecewise function of any number of arguments.

If we limit the number of nodes, then roughly we limit the number of boundaries between different linear regions in the values of the functions. But as we increase the number of layers with a given number of nodes, we basically increase the number of sides that polygonal regions within the function values can have:

So what happens with the mesh nets that we discussed earlier? Here are a few random examples, showing results very similar to shallow, fully connected networks with a comparable total number of nodes:

OK, so how about our fully discrete rule arrays? What functions can they represent? We already saw part of the answer earlier when we generated rule arrays to represent various Boolean functions. It turns out that there is a fairly efficient procedure based on Boolean satisfiability for explicitly finding rule arrays that can represent a given function—or determine that no rule array (say of a given size) can do this.

Using this procedure, we can find minimal And+Xor rule arrays that represent all (“even”) 3-input Boolean functions (i.e. r = 1 cellular automaton rules):

It’s always possible to specify any n-input Boolean function by an array of 2n bits, as in:

But we see from the pictures above that when we “compile” Boolean functions into And+Xor rule arrays, they can take different numbers of bits (i.e. different numbers of elements in the rule array). (In effect, the “algorithmic information content” of the function varies with the “language” we’re using to represent them.) And, for example, in the n = 3 case shown here, the distribution of minimal rule array sizes is:

There are some functions that are difficult to represent as And+Xor rule arrays (and seem to require 15 rule elements)—and others that are easier. And this is similar to what happens if we represent Boolean functions as Boolean expressions (say in conjunctive normal form) and count the total number of (unary and binary) operations used:

OK, so we know that there is in principle an And+Xor rule array that will compute any (even) Boolean function. But now we can ask whether an adaptive evolution process can actually find such a rule array—say with a sequence of single-point mutations. Well, if we do such adaptive evolution—with a loss that counts the number of “wrong outputs” for, say, rule 254—then here’s a sequence of successive breakthrough configurations that can be produced:

The results aren’t as compact as the minimal solution above. But it seems to always be possible to find at least some And+Xor rule array that “solves the problem” just by using adaptive evolution with single-point mutations.

Here are results for some other Boolean functions:

And so, yes, not only are all (even) Boolean functions representable in terms of And+Xor rule arrays, they’re also learnable in this form, just by adaptive evolution with single-point mutations.

In what we did above, we were looking at how machine learning works with our rule arrays in specific cases like for the function. But now we’ve got a case where we can explicitly enumerate all possible functions, at least of a given class. And in a sense what we’re seeing is evidence that machine learning tends to be very broad—and capable at least in principle of learning pretty much any function.

Of course, there can be specific restrictions. Like the And+Xor rule arrays we’re using here can’t represent (“odd”) functions where . (The Nand+First rule arrays we discussed above nevertheless can.) But in general it seems to be a reflection of the Principle of Computational Equivalence that pretty much any setup is capable of representing any function—and also adaptively “learning” it.

By the way, it’s a lot easier to discuss questions about representing or learning “any function” when one’s dealing with discrete (countable) functions—because one can expect to either be able to “exactly get” a given function, or not. But for continuous functions, it’s more complicated, because one’s pretty much inevitably dealing with approximations (unless one can use symbolic forms, which are basically discrete). So, for example, while we can say (as we did above) that (ReLU) neural nets can represent any piecewise-linear function, in general we’ll only be able to imagine successively approaching an arbitrary function, much like when you progressively add more terms in a simple Fourier series:

Looking back at our results for discrete rule arrays, one notable observation that is that while we can successfully reproduce all these different Boolean functions, the actual rule array configurations that achieve this tend to look quite messy. And indeed it’s much the same as we’ve seen throughout: machine learning can find solutions, but they’re not “structured solutions”; they’re in effect just solutions that “happen to work”.

Are there more structured ways of representing Boolean functions with rule arrays? Here are the two possible minimum-size And+Xor rule arrays that represent rule 30:

At the next-larger size there are more possibilities for rule 30:

And there are also rule arrays that can represent rule 110:

But in none of these cases is there obvious structure that allows us to immediately see how these computations work, or what function is being computed. But what if we try to explicitly construct—effectively by standard engineering methods—a rule array that computes a particular function? We can start by taking something like the function for rule 30 and writing it in terms of And and Xor (i.e. in ANF, or “algebraic normal form”):

We can imagine implementing this using an “evaluation graph”:

But now it’s easy to turn this into a rule array (and, yes, we haven’t gone all the way and arranged to copy inputs, etc.):

“Evaluating” this rule array for different inputs, we can see that it indeed gives rule 30:

Doing the same thing for rule 110, the And+Xor expression is

the evaluation graph is

and the rule array is:

And at least with the evaluation graph as a guide, we can readily “see what’s happening” here. But the rule array we’re using is considerably larger than our minimal solutions above—or even than the solutions we found by adaptive evolution.

It’s a typical situation that one sees in many other kinds of systems (like for example sorting networks): it’s possible to have a “constructed solution” that has clear structure and regularity and is “understandable”. But minimal solutions—or ones found by adaptive evolution—tend to be much smaller. But they almost always look in many ways random, and aren’t readily understandable or interpretable.

So far, we’ve been looking at rule arrays that compute specific functions. But in getting a sense of what rule arrays can do, we can consider rule arrays that are “programmable”, in that their input specifies what function they should compute. So here, for example, is an And+Xor rule array—found by adaptive evolution—that takes the “bit pattern” of any (even) Boolean function as input on the left, then applies that Boolean function to the inputs on the right:

And with this same rule array we can now compute any possible (even) Boolean function. So here, for example, it’s evaluating Or:

Other Kinds of Models and Setups

Our general goal here has been to set up models that capture the most essential features of neural nets and machine learning—but that are simple enough in their structure that we can readily “look inside” and get a sense of what they are doing. Mostly we’ve concentrated on rule arrays as a way to provide a minimal analog of standard “perceptron-style” feed-forward neural nets. But what about other architectures and setups?

In effect, our rule arrays are “spacetime-inhomogeneous” generalizations of cellular automata—in which adaptive evolution determines which rule (say from a finite set) should be used at every (spatial) position and every (time) step. A different idealization (that in fact we already used in one section above) is to have an ordinary homogeneous cellular automaton—but with a single “global rule” determined by adaptive evolution. Rule arrays are the analog of feed-forward networks in which a given rule in the rule array is in effect used only once as data “flows through” the system. Ordinary homogeneous cellular automata are like recurrent networks in which a single stream of data is in effect subjected over and over again to the same rule.

There are various interpolations between these cases. For example, we can imagine a “layered rule array” in which the rules at different steps can be different, but those on a given step are all the same. Such a system can be viewed as an idealization of a convolutional neural net in which a given layer applies the same kernel to elements at all positions, but different layers can apply different kernels.

A layered rule array can’t encode as much information as a general rule array. But it’s still able to show machine-learning-style phenomena. And here, for example, is adaptive evolution for a layered And+Xor rule array progressively solving the problem of generating a pattern that lives for exactly 30 steps:

One could also imagine “vertically layered” rule arrays, in which different rules are used at different positions, but any given position keeps running the same rule forever. However, at least for the kinds of problems we’ve considered here, it doesn’t seem sufficient to just be able to pick the positions at which different rules are run. One seems to either need to change rules at different (time) steps, or one needs to be able to adaptively evolve the underlying rules themselves.

Rule arrays and ordinary cellular automata share the feature that the value of each cell depends only on the values of neighboring cells on the step before. But in neural nets it’s standard for the value at a given node to depend on the values of lots of nodes on the layer before. And what makes this straightforward in neural nets is that (weighted, and perhaps otherwise transformed) values from previous nodes are taken to be combined just by simple numerical addition—and addition (being n-ary and associative) can take any number of “inputs”. In a cellular automaton (or Boolean function), however, there’s always a definite number of inputs, determined by the structure of the function. In the most straightforward case, the inputs come only from nearest-neighboring cells. But there’s no requirement that this is how things need to work—and for example we can pick any “local template” to bring in the inputs for our function. This template could either be the same at every position and every step, or it could be picked from a certain set differently at different positions—in effect giving us “template arrays” as well as rule arrays.

So what about having a fully connected network, as we did in our very first neural net examples above? To set up a discrete analog of this we first need some kind of discrete n-ary associative “accumulator” function to fill the place of numerical addition. And for this we could pick a function like And, Or, Xor—or Majority. And if we’re not just going to end up with the same value at each node on a given layer, we need to set up some analog of a weight associated with each connection—which we can achieve by applying either Identity or Not (i.e. flip or not) to the value flowing through each connection.

Here’s an example of a network of this type, trained to compute the function we discussed above:

There are just two kinds of connections here: flip and not. And at each node we’re computing the majority function—giving value 1 if the majority of its inputs are 1, and 0 otherwise. With the “one-hot encoding” of input and output that we used before, here are a few examples of how this network evaluates our function:

This was trained just using 1000 steps of single-point mutation applied to the connection types. The loss systematically goes down—but the configuration of the connection types continues to look quite random even as it achieves zero loss (i.e. even after the function has been completely learned):

In what we’ve just done we assume that all connections continue to be present, though their types (or effectively signs) can change. But we can also consider a network where connections can end up being zeroed out during training—so that they are effectively no longer present.

Much of what we’ve done here with machine learning has centered around trying to learn transformations of the form x f[x]. But another typical application of machine learning is autoencoding—or in effect learning how to compress data representing a certain set of examples. And once again it’s possible to do such a task using rule arrays, with learning achieved by a series of single-point mutations.

As a starting point, consider training a rule array (of cellular automaton rules 4 and 146) to reproduce unchanged a block of black cells of any width. One might have thought this would be trivial. But it’s not, because in effect the initial data inevitably gets “ground up” inside the rule array, and has to be reconstituted at the end. But, yes, it’s nevertheless possible to train a rule array to at least roughly do this—even though once again the rule arrays we find that manage to do this look quite random:

But to set up a nontrivial autoencoder let’s imagine that we progressively “squeeze” the array in the middle, creating an increasingly narrow “bottleneck” through which the data has to flow. At the bottleneck we effectively have a compressed version of the original data. And we find that at least down to some width of bottleneck, it’s possible to create rule arrays that—with reasonable probability—can act as successful autoencoders of the original data:

The success of LLMs has highlighted the use of machine learning for sequence continuation—and the effectiveness of transformers for this. But just as with other neural nets, the forms of transformers that are used in practice are typically very complicated. But can one find a minimal model that nevertheless captures the “essence of transformers”?

Let’s say that we have a sequence that we want to continue, like:

We want to encode each possible value by a vector, as in

so that, for example, our original sequence is encoded as:

Then we have a “head” that reads a block of consecutive vectors, picking off certain values and feeding pairs of them into And and Xor functions, to get a vector of Boolean values:

Ultimately this head is going to “slide” along our sequence, “predicting” what the next element in the sequence will be. But somehow we have to go from our vector of Boolean values to (probabilities of) sequence elements. Potentially we might be able to do this just with a rule array. But for our purposes here we’ll use a fully connected single-layer Identity+Not network in which at each output node we just find the sum of the number of values that come to it—and treat this as determining (through a softmax) the probability of the corresponding element:

In this case, the element with the maximum value is 5, so at “zero temperature” this would be our “best prediction” for the next element.

To train this whole system we just make a sequence of random point mutations to everything, keeping mutations that don’t increase the loss (where the loss is basically the difference between predicted next values and actual next values, or, more precisely, the “categorical cross-entropy”). Here’s how this loss progresses in a typical such training:

At the end of this training, here are the components of our minimal transformer:

First come the encodings of the different possible elements in the sequence. Then there’s the head, here shown applied to the encoding of the first elements of the original sequence. Finally there’s a single-layer discrete network that takes the output from the head, and deduces relative probabilities for different elements to come next. In this case the highest-probability prediction for the next element is that it should be element 6.

To do the analog of an LLM we start from some initial “prompt”, i.e. an initial sequence that fits within the width (“context window”) of the head. Then we progressively apply our minimal transformer, for example at each step taking the next element to be the one with the highest predicted probability (i.e. operating “at zero temperature”). With this setup the collection of “prediction strengths” is shown in gray, with the “best prediction” shown in red:

Running this even far beyond our original training data, we see that we get a “prediction” of a continued sine wave:

As we might expect, the fact that our minimal transformer can make such a plausible prediction relies on the simplicity of our sine curve. If we use “more complicated” training data, such as the “mathematically defined” () blue curve in

the result of training and running a minimal transformer is now:

And, not surprisingly, it can’t “figure out the computation” to correctly continue the curve. By the way, different training runs will involve different sequences of mutations, and will yield different predictions (often with periodic “hallucinations”):

In looking at “perceptron-style” neural nets we wound up using rule arraysor, in effect, spacetime-inhomogeneous cellular automataas our minimal models. Here we’ve ended up with a slightly more complicated minimal model for transformer neural nets. But if we were to simplify it further, we would end up not with something like a cellular automaton but instead with something like a tag system, in which one has a sequence of elements, and at each step removes a block from the beginning, anddepending on its formadds a certain block at the end, as in:

And, yes, such systems can generate extremely complex behaviorreinforcing the idea (that we have repeatedly seen here) that machine learning works by selecting complexity that aligns with goals that have been set.

And along these lines, one can consider all sorts of different computational systems as foundations for machine learning. Here we’ve been looking at cellular-automaton-like and tag-system-like examples. But for example our Physics Project has shown us the power and flexibility of systems based on hypergraph rewriting. And from what we’ve seen here, it seems very plausible that something like hypergraph rewriting can serve as a yet more powerful and flexible substrate for machine learning.

So in the End, What’s Really Going On in Machine Learning?

There are, I think, several quite striking conclusions from what we’ve been able to do here. The first is just that models much simpler than traditional neural nets seem capable of capturing the essential features of machine learning—and indeed these models may well be the basis for a new generation of practical machine learning.

But from a scientific point of view, one of the things that’s important about these models is that they are simple enough in structure that it’s immediately possible to produce visualizations of what they’re doing inside. And studying these visualizations, the most immediately striking feature is how complicated they look.

It could have been that machine learning would somehow “crack systems”, and find simple representations for what they do. But that doesn’t seem to be what’s going on at all. Instead what seems to be happening is that machine learning is in a sense just “hitching a ride” on the general richness of the computational universe. It’s not “specifically building up behavior one needs”; rather what it’s doing is to harness behavior that’s “already out there” in the computational universe.

The fact that this could possibly work relies on the crucial—and at first unexpected—fact that in the computational universe even very simple programs can ubiquitously produce all sorts of complex behavior. And the point then is that this behavior has enough richness and diversity that it’s possible to find instances of it that align with machine learning objectives one’s defined. In some sense what machine learning is doing is to “mine” the computational universe for programs that do what one wants.

It’s not that machine learning nails a specific precise program. Rather, it’s that in typical successful applications of machine learning there are lots of programs that “do more or less the right thing”. If what one’s trying to do involves something computationally irreducible, machine learning won’t typically be able to “get well enough aligned” to correctly “get through all the steps” of the irreducible computation. But it seems that many “human-like tasks” that are the particular focus of modern machine learning can successfully be done.

And by the way, one can expect that with the minimal models explored here, it becomes more feasible to get a real characterization of what kinds of objectives can successfully be achieved by machine learning, and what cannot. Critical to the operation of machine learning is not only that there exist programs that can do particular kinds of things, but also that they can realistically be found by adaptive evolution processes.

In what we’ve done here we’ve often used what’s essentially the very simplest possible process for adaptive evolution: a sequence of point mutations. And what we’ve discovered is that even this is usually sufficient to lead us to satisfactory machine learning solutions. It could be that our paths of adaptive evolution would always be getting stuck—and not reaching any solution. But the fact that this doesn’t happen seems crucially connected to the computational irreducibility that’s ubiquitous in the systems we’re studying, and that leads to effective randomness that with overwhelming probability will “give us a way out” of anywhere we got stuck.

In some sense computational irreducibility “levels the playing field” for different processes of adaptive evolution, and lets even simple ones be successful. Something similar seems to happen for the whole framework we’re using. Any of a wide class of systems seem capable of successful machine learning, even if they don’t have the detailed structure of traditional neural nets. We can see this as a typical reflection of the Principle of Computational Equivalence: that even though systems may differ in their details, they are ultimately all equivalent in the computations they can do.

The phenomenon of computational irreducibility leads to a fundamental tradeoff, of particular importance in thinking about things like AI. If we want to be able to know in advance—and broadly guarantee—what a system is going to do or be able to do, we have to set the system up to be computationally reducible. But if we want the system to be able to make the richest use of computation, it’ll inevitably be capable of computationally irreducible behavior. And it’s the same story with machine learning. If we want machine learning to be able to do the best it can, and perhaps give us the impression of “achieving magic”, then we have to allow it to show computational irreducibility. And if we want machine learning to be “understandable” it has to be computationally reducible, and not able to access the full power of computation.

At the outset, though, it’s not obvious whether machine learning actually has to access such power. It could be that there are computationally reducible ways to solve the kinds of problems we want to use machine learning to solve. But what we’ve discovered here is that even in solving very simple problems, the adaptive evolution process that’s at the heart of machine learning will end up sampling—and using—what we can expect to be computationally irreducible processes.

Like biological evolution, machine learning is fundamentally about finding things that work—without the constraint of “understandability” that’s forced on us when we as humans explicitly engineer things step by step. Could one imagine constraining machine learning to make things understandable? To do so would effectively prevent machine learning from having access to the power of computationally irreducible processes, and from the evidence here it seems unlikely that with this constraint the kind of successes we’ve seen in machine learning would be possible.

So what does this mean for the “science of machine learning”? One might have hoped that one would be able to “look inside” machine learning systems and get detailed narrative explanations for what’s going on; that in effect one would be able to “explain the mechanism” for everything. But what we’ve seen here suggests that in general nothing like this will work. All one will be able to say is that somewhere out there in the computational universe there’s some (typically computationally irreducible) process that “happens” to be aligned with what we want.

Yes, we can make general statements—strongly based on computational irreducibility—about things like the findability of such processes, say by adaptive evolution. But if we ask “How in detail does the system work?”, there won’t be much of an answer to that. Of course we can trace all its computational steps and see that it behaves in a certain way. But we can’t expect what amounts to a “global human-level explanation” of what it’s doing. Rather, we’ll basically just be reduced to looking at some computationally irreducible process and observing that it “happens to work”—and we won’t have a high-level explanation of “why”.

But there is one important loophole to all this. Within any computationally irreducible system, there are always inevitably pockets of computational reducibility. And—as I’ve discussed at length particularly in connection with our Physics Project—it’s these pockets of computational reducibility that allow computationally bounded observers like us to identify things like “laws of nature” from which we can build “human-level narratives”.

So what about machine learning? What pockets of computational reducibility show up there, from which we might build “human-level scientific laws”? Much as with the emergence of “simple continuum behavior” from computationally irreducible processes happening at the level of molecules in a gas or ultimate discrete elements of space, we can expect that at least certain computationally reducible features will be more obvious when one’s dealing with larger numbers of components. And indeed in sufficiently large machine learning systems, it’s routine to see smooth curves and apparent regularity when one’s looking at the kind of aggregated behavior that’s probed by things like training curves.

But the question about pockets of reducibility is always whether they end up being aligned with things we consider interesting or useful. Yes, it could be that machine learning systems would exhibit some kind of collective (“EEG-like”) behavior. But what’s not clear is whether this behavior will tell us anything about the actual “information processing” (or whatever) that’s going on in the system. And if there is to be a “science of machine learning” what we have to hope for is that we can find in machine learning systems pockets of computational reducibility that are aligned with things we can measure, and care about.

So given what we’ve been able to explore here about the foundations of machine learning, what can we say about the ultimate power of machine learning systems? A key observation has been that machine learning works by “piggybacking” on computational irreducibility—and in effect by finding “natural pieces of computational irreducibility” that happen to fit with the objectives one has. But what if those objectives involve computational irreducibility—as they often do when one’s dealing with a process that’s been successfully formalized in computational terms (as in math, exact science, computational X, etc.)? Well, it’s not enough that our machine learning system “uses some piece of computational irreducibility inside”. To achieve a particular computationally irreducible objective, the system would have to do something closely aligned with that actual, specific objective.

It has to be said, however, that by laying bare more of the essence of machine learning here, it becomes easier to at least define the issues of merging typical “formal computation” with machine learning. Traditionally there’s been a tradeoff between the computational power of a system and its trainability. And indeed in terms of what we’ve seen here this seems to reflect the sense that “larger chunks of computational irreducibility” are more difficult to fit into something one’s incrementally building up by a process of adaptive evolution.

So how should we ultimately think of machine learning? In effect its power comes from leveraging the “natural resource” of computational irreducibility. But when it uses computational irreducibility it does so by “foraging” pieces that happen to advance its objectives. Imagine one’s building a wall. One possibility is to fashion bricks of a particular shape that one knows will fit together. But another is just to look at stones one sees lying around, then to build the wall by fitting these together as best one can.

And if one then asks “Why does the wall have such-and-such a pattern?” the answer will end up being basically “Because that’s what one gets from the stones that happened to be lying around”. There’s no overarching theory to it in itself; it’s just a reflection of the resources that were out there. Or, in the case of machine learning, one can expect that what one sees will be to a large extent a reflection of the raw characteristics of computational irreducibility. In other words, the foundations of machine learning are as much as anything rooted in the science of ruliology. And it’s in large measure to that science we should look in our efforts to understand more about “what’s really going on” in machine learning, and quite possibly also in neuroscience.

Historical & Personal Notes

In some ways it seems like a quirk of intellectual history that the kinds of foundational questions I’ve been discussing here weren’t already addressed long ago—and in some ways it seems like an inexorable consequence of the only rather recent development of certain intuitions and tools.

The idea that the brain is fundamentally made of connected nerve cells was considered in the latter part of the nineteenth century, and took hold in the first decades of the twentieth century—with the formalized concept of a neural net that operates in a computational way emerging in full form in the work of Warren McCulloch and Walter Pitts in 1943. By the late 1950s there were hardware implementations of neural nets (typically for image processing) in the form of “perceptrons”. But despite early enthusiasm, practical results were mixed, and at the end of the 1960s it was announced that simple cases amenable to mathematical analysis had been “solved”—leading to a general belief that “neural nets couldn’t do anything interesting”.

Ever since the 1940s there had been a trickle of general analyses of neural nets, particularly using methods from physics. But typically these analyses ended up with things like continuum approximations—that could say little about the information-processing aspects of neural nets. Meanwhile, there was an ongoing undercurrent of belief that somehow neural networks would both explain and reproduce how the brain works—but no methods seemed to exist to say quite how. Then at the beginning of the 1980s there was a resurgence of interest in neural networks, coming from several directions. Some of what was done concentrated on very practical efforts to get neural nets to do particular “human-like” tasks. But some was more theoretical, typically using methods from statistical physics or dynamical systems.

Before long, however, the buzz died down, and for several decades only a few groups were left working with neural nets. Then in 2011 came a surprise breakthrough in using neural nets for image analysis. It was an important practical advance. But it was driven by technological ideas and development—not any significant new theoretical analysis or framework.

And this was also the pattern for almost all of what followed. People spent great effort to come up with neural net systems that worked—and all sorts of folklore grew up about how this should best be done. But there wasn’t really even an attempt at an underlying theory; this was a domain of engineering practice, not basic science.

And it was in this tradition that ChatGPT burst onto the scene in late 2022. Almost everything about LLMs seemed to be complicated. Yes, there were empirically some large-scale regularities (like scaling laws). And I quickly suspected that the success of LLMs was a strong hint of general regularities in human language that hadn’t been clearly identified before. But beyond a few outlier examples, almost nothing about “what’s going on inside LLMs” has seemed easy to decode. And efforts to put “strong guardrails” on the operation of the system—in effect so as to make it in some way “predictable” or “understandable”—typically seem to substantially decrease its power (a point that now makes sense in the context of computational irreducibility).

My own interaction with machine learning and neural nets began in 1980 when I was developing my SMP symbolic computation system, and wondering whether it might be possible to generalize the symbolic pattern-matching foundations of the system to some kind of “fuzzy pattern matching” that would be closer to human thinking. I was aware of neural nets but thought of them as semi-realistic models of brains, not for example as potential sources of algorithms of the kind I imagined might “solve” fuzzy matching.

And it was partly as a result of trying to understand the essence of systems like neural nets that in 1981 I came up with what I later learned could be thought of as one-dimensional cellular automata. Soon I was deeply involved in studying cellular automata and developing a new intuition about how complex behavior could arise even from simple rules. But when I learned about recent efforts to make idealized models of neural nets using ideas from statistical mechanics, I was at least curious enough to set up simulations to try to understand more about these models.

But what I did wasn’t a success. I could neither get the models to do anything of significant practical interest—nor did I manage to derive any good theoretical understanding of them. I kept wondering, though, what relationship there might be between cellular automata that “just run”, and systems like neural nets that can also “learn”. And in fact in 1985 I tried to make a minimal cellular-automaton-based model to explore this. It was what I’m now calling a “vertically layered rule array”. And while in many ways I was already asking the right questions, this was an unfortunate specific choice of system—and my experiments on it didn’t reveal the kinds of phenomena we’re now seeing.

Years went by. I wrote a section on “Human Thinking” in A New Kind of Science, that discussed the possibility of simple foundational rules for the essence of thinking, and even included a minimal discrete analog of a neural net. At the time, though, I didn’t develop these ideas. By 2017, though, 15 years after the book was published—and knowing about the breakthroughs in deep learning—I had begun to think more concretely about neural nets as getting their power by sampling programs from across the computational universe. But still I didn’t see quite how this would work.

Meanwhile, there was a new intuition emerging from practical experience with machine learning: that if you “bashed” almost any system “hard enough”, it would learn. Did that mean that perhaps one didn’t need all the details of neural networks to successfully do machine learning? And could one perhaps make a system whose structure was simple enough that its operation would for example be accessible to visualization? I particularly wondered about this when I was writing an exposition of ChatGPT and LLMs in early 2023. And I kept talking about “LLM science”, but didn’t have much of a chance to work on it.

But then, a few months ago, as part of an effort to understand the relation between what science does and what AI does, I tried a kind of “throwaway experiment”—which, to my considerable surprise, seemed to successfully capture some of the essence of what makes biological evolution possible. But what about other adaptive evolution—and in particular, machine learning? The models that seemed to be needed were embarrassingly close to what I’d studied in 1985. But now I had a new intuition—and, thanks to Wolfram Language, vastly better tools. And the result has been my effort here.

Of course this is only a beginning. But I’m excited to be able to see what I consider to be the beginnings of foundational science around machine learning. Already there are clear directions for practical applications (which, needless to say, I plan to explore). And there are signs that perhaps we may finally be able to understand just why—and when—the “magic” of machine learning works.

Thanks

Thanks to Richard Assar of the Wolfram Institute for extensive help. Thanks also to Brad Klee, Tianyi Gu, Nik Murzin and Max Niederman for specific results, to George Morgan and others at Symbolica for their early interest, and to Kovas Boguta for suggesting many years ago to link machine learning to the ideas in A New Kind of Science.

Yet More New Ideas and New Functions: Launching Version 14.1 of Wolfram Language & Mathematica

2024-08-01 05:53:02

For the 36th Time… the Latest from Our R&D Pipeline

Today we celebrate the arrival of the 36th (x.x) version of the Wolfram Language and Mathematica: Version 14.1. We’ve been doing this since 1986: continually inventing new ideas and implementing them in our larger and larger tower of technology. And it’s always very satisfying to be able to deliver our latest achievements to the world.

We released Version 14.0 just half a year ago. And—following our modern version scheduling—we’re now releasing Version 14.1. For most technology companies a .1 release would contain only minor tweaks. But for us it’s a snapshot of what our whole R&D pipeline has delivered—and it’s full of significant new features and new enhancements.

If you’ve been following our livestreams, you may have already seen many of these features and enhancements being discussed as part of our open software design process. And we’re grateful as always to members of the Wolfram Language community who’ve made suggestions—and requests. And in fact Version 14.1 contains a particularly large number of long-requested features, some of which involved development that has taken many years and required many intermediate achievements.

There’s lots of both extension and polishing in Version 14.1. There are a total of 89 entirely new functions—more than in any other version for the past couple of years. And there are also 137 existing functions that have been substantially updated. Along with more than 1300 distinct bug fixes and specific improvements.

Some of what’s new in Version 14.1 relates to AI and LLMs. And, yes, we’re riding the leading edge of these kinds of capabilities. But the vast majority of what’s new has to do with our continued mission to bring computational language and computational knowledge to everything. And today that mission is even more important than ever, supporting not only human users, but also rapidly proliferating AI “users”—who are beginning to be able to routinely make even broader and deeper use of our technology than humans.

Each new version of Wolfram Language represents a large amount of R&D by our team, and the encapsulation of a surprisingly large number of ideas about what should be implemented, and how it should be implemented. So, today, here it is: the latest stage in our four-decade journey to bring the superpower of the computational paradigm to everything.

There’s Now a Unified Wolfram App

In the beginning we just had “Mathematica”—that we described as “A System for Doing Mathematics by Computer”. But the core of “Mathematica”—based on the very general concept of transformations for symbolic expressions—was always much broader than “mathematics”. And it didn’t take long before “mathematics” was an increasingly small part of the system we had built. We agonized for years about how to rebrand things to better reflect what the system had become. And eventually, just over a decade ago, we did the obvious thing, and named what we had “the Wolfram Language”.

But when it came to actual software products and executables, so many people were familiar with having a “Mathematica” icon on their desktop that we didn’t want to change that. Later we introduced Wolfram|One as a general product supporting Wolfram Language across desktop and cloud—with Wolfram Desktop being its desktop component. But, yes, it’s all been a bit confusing. Ultimately there’s just one “bag of bits” that implements the whole system we’ve built, even though there are different usage patterns, and differently named products that the system supports. Up to now, each of these different products has been a different executable, that’s separately downloaded.

But starting with Version 14.1 we’re unifying all these things—so that now there’s just a single unified Wolfram app, that can be configured and activated in different ways corresponding to different products.

So now you just go to wolfram.com/download-center and download the Wolfram app:

Wolfram app

After you’ve installed the app, you activate it as whatever product(s) you’ve got: Wolfram|One, Mathematica, Wolfram|Alpha Notebook Edition, etc. Why have separate products? Each one has a somewhat different usage pattern, and provides a somewhat different interface optimized for that usage pattern. But now the actual downloading of bits has been unified; you just have to download the unified Wolfram app and you’ll get what you need.

Vector Databases and Semantic Search

Let’s say you’ve got a million documents (or webpages, or images, or whatever) and you want to find the ones that are “closest” to something. Version 14.1 now has a function—SemanticSearch—for doing this. How does SemanticSearch work? Basically it uses machine learning methods to find “vectors” (i.e. lists) of numbers that somehow represent the “meaning” of each of your documents. Then when you want to know which documents are “closest” to something, SemanticSearch computes the vector for the something, and then sees which of the document vectors are closest to this vector.

In principle one could use Nearest to find closest vectors. And indeed this works just fine for small examples where one can readily store all the vectors in memory. But SemanticSearch uses a full industrial-strength approach based on the new vector database capabilities of Version 14.1—which can work with huge collections of vectors stored in external files.

There are lots of ways to use both SemanticSearch and vector databases. You can use them to find documents, snippets within documents, images, sounds or anything else whose “meaning” can somehow be captured by a vector of numbers. Sometimes the point is to retrieve content directly for human consumption. But a particularly strong modern use case is to set up “retrieval-augmented generation” (RAG) for LLMs—in which relevant content found with a vector database is used to provide a “dynamic prompt” for the LLM. And indeed in Version 14.1—as we’ll discuss later—we now have LLMPromptGenerator to implement exactly this pipeline.

But let’s come back to SemanticSearch on its own. Its basic design is modeled after TextSearch, which does keyword-based searching of text. (Note, though, that SemanticSearch also works on many things other than text.)

In direct analogy to CreateSearchIndex for TextSearch, there’s now a CreateSemanticSearchIndex for SemanticSearch. Let’s do a tiny example to see how it works. Essentially we’re going to make an (extremely restricted) “inverse dictionary”. We set up a list of definition word elements:

Now create a semantic search index from this:

Behind the scenes this is a vector database. But we can access it with SemanticSearch:

And since “whale” is considered closest, it comes first.

What about a more realistic example? Instead of just using 3 words, let’s set up definitions for all words in the dictionary. It takes a little while (like a few minutes) to do the machine learning feature extraction for all the definitions. But in the end you get a new semantic search index:

This time it has 39,186 entries—but SemanticSearch picks out the (by default) 10 that it considers closest to what you asked for (and, yes, there’s an archaic definition of “seahorse” as “walrus”):

We can see a bit more detail about what’s going on by asking SemanticSearch to explicitly show us distances:

SemanticSearch distances

And plotting these we can see that “whale” is the winner by a decent margin:

One subtlety when dealing with semantic search indices is where to store them. When they’re sufficiently small, you can store them directly in memory, or in a notebook. But usually you’ll want to store them in a separate file, and if you want to share an index you’ll want to put this file in the cloud. You can do this either interactively from within a notebook

SemanticSearchIndex

or programmatically:

And now the SemanticSearchIndex object you have can be used by anyone, with its data being accessed in the cloud.

In most cases SemanticSearch will be what you need. But sometimes it’s worthwhile to “go underneath” and directly work with vector databases. Here’s a collection of small vectors:

We can use Nearest to find the nearest vector to one we give:

But we can also do this with a vector database. First we create the database:

And now we can search for the nearest vector to the one we give:

In this case we get exactly the same answer as from Nearest. But whereas the mission of Nearest is to give us the mathematically precise nearest vector, VectorDatabaseSearch is doing something less precise—but is able to do it for extremely large numbers of vectors that don’t need to be stored directly in memory.

Those vectors can come from anywhere. For example, here they’re coming from extracting features from some images:

Now let’s say we’ve got a specific image. Then we can search our vector database to get the image whose feature vector is closest to the one for the image we provided:

And, yes, this works for other kinds of objects too:

CreateSemanticSearchIndex and CreateVectorDatabase create vector databases from scratch using data you provide. But—just like with text search indices—an important feature of vector databases is that you can incrementally add to them. So, for example, UpdateSemanticSearchIndex and AddToVectorDatabase let you efficiently add individual entries or lists of entries to vector databases.

In addition to providing capabilities for building (and growing) your own vector databases, there are several pre-built vector databases that are now available in the Wolfram Data Repository:

Vector Databases

So now we can use a pre-built vector database of Wolfram Language function documentation to do a semantic search for snippets that are “semantically close” to being about iterating functions:

(In the next section, we’ll see how to actually “synthesize a report” based on this.)

The basic function of SemanticSearch is to determine what “chunks of content” are closest to what you are asking about. But given a semantic search index (AKA vector database) there are also other important things you can do. One of them is to use TextSummarize to ask not for specific chunks but rather for some kind of overall summary of what can be said about a given topic from the content in the semantic search index:

RAGs and Dynamic Prompting for LLMs

How does one tell an LLM what one wants it to do? Fundamentally, one provides a prompt, and then the LLM generates output that “continues” that prompt. Typically the last part of the prompt is the specific question (or whatever) that a user is asking. But before that, there’ll be “pre-prompts” that prime the LLM in various ways to determine how it should respond.

In Version 13.3 in mid-2023 (i.e. a long time ago in the world of LLMs!) we introduced LLMPrompt as a symbolic way to specify a prompt, and we launched the Wolfram Prompt Repository as a broad source for pre-built prompts. Here’s an example of using LLMPrompt with a prompt from the Wolfram Prompt Repository:

In its simplest form, LLMPrompt just adds fixed text to “pre-prompt” the LLM. LLMPrompt is also set up to take arguments that modify the text it’s adding:

But what if one wants the LLM to be pre-prompted in a way that depends on information that’s only available once the user actually asks their question (like, for example, the text of the question itself)? In Version 14.1 we’re adding LLMPromptGenerator to dynamically generate pre-prompts. And it turns out that this kind of “dynamic prompting” is remarkably powerful, and—particularly together with tool calling—opens up a whole new level of capabilities for LLMs.

For example, we can set up a prompt generator that produces a pre-prompt that gives the registered name of the user, so the LLM can use this information when generating its answer:

Or for example here the prompt generator is producing a pre-prompt about sunrise, sunset and the current time:

And, yes, if the pre-prompt contains extra information (like about the Moon) the LLM will (probably) ignore it:

As another example, we can take whatever the user asks, and first do a web search on it, then include as a pre-prompt snippets we get from the web. The result is that we can get answers from the LLM that rely on specific “web knowledge” that we can’t expect will be “known in detail” by the raw LLM:

But often one doesn’t want to just “search at random on the web”; instead one wants to systematically retrieve information from some known source to give as “briefing material” to the LLM to help it in generating its answer. And a typical way to implement this kind of “retrieval-augmented generation (RAG)” is to set up an LLMPromptGenerator that uses the SemanticSearch and vector database capabilities that we introduced in Version 14.1.

So, for example, here’s a semantic search index generated from my (rather voluminous) writings:

By setting up a prompt generator based on this, I can now ask the LLM “personal questions”:

How did the LLM “know that”? Internally the prompt generator used SemanticSearch to generate a collection of snippets, which the LLM then “trawled through” to produce a specific answer:

It’s already often very useful just to “retrieve static text” to “brief” the LLM. But even more powerful is to brief the LLM with what it needs to call tools that can do further computation, etc. So, for example, if you want the LLM to write and run Wolfram Language code that uses functions you’ve created, you can do that by having it first “read the documentation” for those functions.

As an example, this uses a prompt generator that uses a semantic search index built from the Wolfram Function Repository:

Connect to Your Favorite LLM

There are now many ways to use LLM functionality from within the Wolfram Language, and Wolfram Notebooks. You can do it programmatically, with LLMFunction, LLMSynthesize, etc. You can do it interactively through Chat Notebooks and related chat capabilities.

But (at least for now) there’s no full-function LLM built directly into the Wolfram Language. So that means that (at least for now) you have to choose your “flavor” of external LLM to power Wolfram Language LLM functionality. And in Version 14.1 we have support for basically all major available foundation-model LLMs.

We’ve made it as straightforward as possible to set up connections to external LLMs. Once you’ve done it, you can select what you want directly in any Chat Notebook

Choose your LLM

or from your global Preferences:

LLM global preferences

When you’re using a function you specify the “model” (i.e. service and specific model name) as part of the setting for LLMEvaluator:

In general you can use LLMConfiguration to define the whole configuration of an LLM you want to connect to, and you can make a particular configuration your default either interactively using Preferences, or by explicitly setting the value of $LLMEvaluator.

So how do you initially set up a connection to a new LLM? You can do it interactively by pressing Connect in the AI Settings pane of Preferences. Or you can do it programmatically using ServiceConnect:

ServiceConnect

At the “ServiceConnect level” you have very direct access to the features of LLM APIs, though unless you’re studying LLM APIs you probably won’t need to use these. But talking of LLM APIs, one of the things that’s now easy to do with Wolfram Language is to compare LLMs, for example programmatically sending the same question to multiple LLMs:

And in fact we’ve recently started posting weekly results that we get from a full range of LLMs on the task of writing Wolfram Language code (conveniently, the exercises in my book An Elementary Introduction to the Wolfram Language have textual “prompts”, and we have a well-developed system that we’ve used for many years in assessing code for the online course based on the book):

Wolfram LLM Benchmarking Project

Symbolic Arrays and Their Calculus

I want A to be an n×n matrix. I don’t want to say what its elements are, and I don’t even want to say what n is. I just want to have a way to treat the whole thing symbolically. Well, in Version 14.1 we’ve introduced MatrixSymbol to do that.

A MatrixSymbol has a name (just like an ordinary symbol)—and it has a way to specify its dimensions. We can use it, for example, to set up a symbolic representation for our matrix A:

Hovering over this in a notebook, we’ll get a tooltip that explains what it is:

Matrix dimensions tooltip

We can ask for its dimensions as a tensor:

Here’s its inverse, again represented symbolically:

That also has dimensions n×n:

In Version 14.1 you can not only have symbolic matrices, you can also have symbolic vectors and, for that matter, symbolic arrays of any rank. Here’s a length-n symbolic vector (and, yes, we can have a symbolic vector named v that we assign to a symbol v):

So now we can construct something like the quadratic form:

A classic thing to compute from this is its gradient with respect to the vector :

And actually this is just the same as the “vector derivative”:

If we do a second derivative we get:

What happens if we differentiate with respect to ? Well, then we get a symbolic identity matrix

which again has dimensions n×n:

is a rank-2 example of a symbolic identity array:

If we give n an explicit value, we can get an explicit componentwise array:

Let’s say we have a function of , like Total. Once again we can find the derivative with respect to :

And now we see another symbolic array construct: SymbolicOnesArray:

This is simply an array whose elements are all 1:

Differentiating a second time gives us a SymbolicZerosArray:

Although we’re not defining explicit elements for , it’s sometimes important to specify, for example, that all the elements are reals:

For a vector whose elements are reals, it’s straightforward to find the derivative of the norm:

The third derivative, though, is a bit more complicated:

The ⊗ here is TensorProduct, and the T:(1,3,2) represents Transpose[..., {1, 3, 2}].

In the Wolfram Language, a symbol, say s, can stand on its own, and represent a “variable”. It can also appear as a head—as in s[x]—and represent a function. And the same is true for vector and matrix symbols:

Importantly, the chain rule also works for matrix and vector functions:

Things get a bit trickier when one’s dealing with functions of matrices:

The here represents ArrayDot[..., ..., 2], which is a generalization of Dot. Given two arrays u and v, Dot will contract the last index of u with the first index of v:

ArrayDot[u, v, n], on the other hand, contracts the last n indices of u with the first n of v. ArrayDot[u, v, 1] is just the same as Dot[u, v]:

But now in this particular example all the indices get “contracted out”:

We’ve talked about symbolic vectors and matrices. But—needless to say—what we have is completely general, and will work for arrays of any rank. Here’s an example of a p×q×r array:

The overscript indicates that this is an array of rank 3.

When one takes derivatives, it’s very easy to end up with high-rank arrays. Here’s the result of differentiating with respect to a matrix:

is a rank-4 n×n×n×n identity array.

When one’s dealing with higher-rank objects, there’s one more construct that appears—that we call SymbolicDeltaProductArray. Let’s set up a rank-3 array with dimensions 3×3×3:

Now let’s compute a derivative:

The result is a rank-5 array that’s effectively a combination of two KroneckerDelta objects for indices 1,4 and 2,5, respectively:

We can visualize this with ArrayPlot3D:

The most common way to deal with arrays in the Wolfram Language has always been in terms of explicit lists of elements. And in this representation it’s extremely convenient that operations are normally done elementwise:

Non-lists are then by default treated as scalars—and for example here added into every element:

But now there’s something new, namely symbolic arrays—which in effect implicitly contain multiple list elements, and thus can’t be “added into every element”:

This is what happens when we have an “ordinary scalar” together with a symbolic vector:

How does this work? “Under the hood” there’s a new attribute NonThreadable which specifies that certain heads (like ArraySymbol) shouldn’t be threaded by Listable functions (like Plus).

By the way, ever since Version 9 a dozen years ago we’ve had a limited mechanism for assuming that symbols represent vectors, matrices or arrays—and now that mechanism interoperates with all our new symbolic array functionality:

When you’re doing explicit computations there’s often no choice but to deal directly with individual array elements. But it turns out that there are all sorts of situations where it’s possible to work instead in terms of “whole” vectors, matrices, etc. And indeed in the literature of fields like machine learning, optimization, statistics and control theory, it’s become quite routine to write down formulas in terms of symbolic vectors, matrices, etc. And what Version 14.1 now adds is a streamlined way to compute in terms of these symbolic array constructs.

The results are often very elegant. So, for example, here’s how one might set up a general linear least-squares problem using our new symbolic array constructs. First we define a symbolic n×m matrix A, and symbolic vectors b and x:

Our goal is to find a vector that minimizes . And with our definitions we can now immediately write down this quantity:

To extremize it, we compute its derivative

and to ensure we get a minimum, we compute the second derivative:

These are standard textbook formulas, but the cool thing is that in Version 14.1 we’re now in a position to generate them completely automatically. By the way, if we take another derivative, the result will be a zero tensor:

We can look at other norms too:

Binomials and Pitchforks: Navigating Mathematical Conventions

Binomial coefficients have been around for at least a thousand years, and one might not have thought there could possibly be anything shocking or controversial about them anymore (notwithstanding the fictional Treatise on the Binomial Theorem by Sherlock Holmes’s nemesis Professor Moriarty). But in fact we have recently been mired in an intense debate about binomial coefficients—which has caused us in Version 14.1 to introduce a new function PascalBinomial alongside our existing Binomial.

When one’s dealing with positive integer arguments, there’s no issue with binomials. And even when one extends to generic complex arguments, there’s again a unique way to do this. But negative integer arguments are a special degenerate case. And that’s where there’s trouble—because there are two different definitions that have historically been used.

In early versions of Mathematica, we picked one of these definitions. But over time we realized that it led to some subtle inconsistencies, and so for Version 7—in 2008—we changed to the other definition. Some of our users were happy with the change, but some were definitely not. A notable (vociferous) example was my friend Don Knuth, who has written several well-known books that make use of binomial coefficients—always choosing what amounts to our pre-2008 definition.

So what could we do about this? For a while we thought about adding an option to Binomial, but to do this would have broken our normal conventions for mathematical functions. And somehow we kept on thinking that there was ultimately a “right answer” to how binomial coefficients should be defined. But after a lot of discussion—and historical research—we finally concluded that since at least before 1950 there have just been two possible definitions, each with their own advantages and disadvantages, with no obvious “winner”. And so in Version 14.1 we decided just to introduce a new function PascalBinomial to cover the “other definition”.

And—though at first it might not seem like much—here’s a big difference between Binomial and PascalBinomial:

Part of why things get complicated is the relation to symbolic computation. Binomial has a symbolic simplification rule, valid for any n:

But there isn’t a corresponding generic simplification rule for PascalBinomial:

FunctionExpand shows us the more nuanced result in this case:

To see a bit more of what’s going on, we can compute arrays of nonzero results for Binomial and PascalBinomial:

Binomial[n, k] has the “nice feature” that it’s symmetric in k even when n Pascal’s identity (that says a particular binomial coefficient is the sum of two coefficients “above it”) isn’t always true. PascalBinomial, on the other hand, always satisfies the identity, and it’s in recognition of this that we put “Pascal” in its name.

And, yes, this is all quite subtle. And, remember, the differences between Binomial and PascalBinomial only show up at negative integer values. Away from such values, they’re both given by the same expression, involving gamma functions. But at negative integer values, they correspond to different limits, respectively:

The story of Binomial and PascalBinomial is a complicated one that mainly affects only the upper reaches of discrete mathematics. But there’s another, much more elementary convention that we’ve also tackled in Version 14.1: the convention of what the arguments of trigonometric functions mean.

We’ve always taken the “fundamentally mathematical” point of view that the x in Sin[x] is in radians:

You’ve always been able to explicitly give the argument in degrees (using Degree—or after Version 3 in 1996—using °):

But a different convention would just say that the argument to Sin should always be interpreted as being in degrees, even if it’s just a plain number. Calculators would often have a physical switch that globally toggles to this convention. And while that might be OK if you are just doing a small calculation and can physically see the switch, nothing like that would make any sense at all in our system. But still, particularly in elementary mathematics, one might want a “degrees version” of trigonometric functions. And in Version 14.1 we’ve introduced these:

One might think this was somewhat trivial. But what’s nontrivial is that the “degrees trigonometric functions” are consistently integrated throughout the system. Here, for example, is the period in SinDegrees:

You can take the integral as well

and the messiness of this form shows why for more than three decades we’ve just dealt with Sin[x] and radians.

Fixed Points and Stability for Differential and Difference Equations

All sorts of differential equations have the feature that their solutions exhibit fixed points. It’s always in principle been possible to find these by looking for points where derivatives vanish. But in Version 14.1 we now have a general, robust function that takes the same form of input as DSolve and finds all fixed points:

Here’s a stream plot of the solutions to our equations, together with the fixed points we’ve found:

And we can see that there are two different kinds of fixed points here. The ones on the left and right are “stable” in the sense that solutions that start near them always stay near them. But it’s a different story for the fixed points at the top and bottom; for these, solutions that start nearby can diverge. The function DStabilityConditions computes fixed points, and specifies whether they are stable or not:

As another example, here are the Lorenz equations, which have one unstable fixed point, and two stable ones:

If your equations have parameters, their stability fixed points can depend on those parameters:

Extracting the conditions here, we can now plot the region of parameter space where this fixed point is stable:

This kind of stability analysis is important in all sorts of fields, including dynamical systems theory, control theory, celestial mechanics and computational ecology.

And just as one can find fixed points and do stability analysis for differential equations, one can also do it for difference equations—and this is important for discrete dynamical systems, digital control systems, and for iterative numerical algorithms. Here’s a classic example in Version 14.1 for the logistic map:

The Steady Advance of PDEs

Five years ago—in Version 11.3—we introduced our framework for symbolically representing physical systems using PDEs. And in every version since we’ve been steadily adding more and more capabilities. At this point we’ve now covered the basics of heat transfer, mass transport, acoustics, solid mechanics, fluid mechanics, electromagnetics and (one-particle) quantum mechanics. And with our underlying symbolic framework, it’s easy to mix components of all these different kinds.

Our goal now is to progressively cover what’s needed for more and more kinds of applications. So in Version 14.1 we’re adding von Mises stress analysis for solid mechanics, electric current density models for electromagnetics and anisotropic effective masses for quantum mechanics.

So as an example of what’s now possible, here’s a piece of geometry representing a spiral inductor of the kind that might be used in a modern MEMS device:

Let’s define our variables—voltage and position:

And let’s specify parameters—here just that the material we’re going to deal with is copper:

Now we’re in a position to set up the PDE for this system, making use of the new constructs ElectricCurrentPDEComponent and ElectricCurrentDensityValue:

All it takes to solve this PDE for the voltage is then:

From the voltage we can compute the current density

and then plot it (and, yes, the current tends to avoid the corners):

Symbolic Biomolecules and Their Visualization

Ever since Version 12.2 we’ve had the ability to represent and manipulate bio sequences of the kind that appear in DNA, RNA and proteins. We’ve also been able to do things like import PDB (Protein Data Bank) files and generate graphics from them. But now in Version 14.1 we’re adding a symbolic BioMolecule construct, to represent the full structure of biomolecules:

Ultimately this is “just a molecule” (and in this case its data is so big it’s not by default stored locally in your notebook):

But what BioMolecule does is also to capture the “higher-order structure” of the molecule, for example how it’s built up from distinct chains, where structures like α-helices occur in these, and so on. For example, here are the two (bio sequence) chains that appear in this case:

And here are where the α-helices occur:

What about visualization? Well, there’s BioMoleculePlot3D for that:

There are different “themes” you can use for this:

Here’s a raw-atom-level view:

You can combine the views—and for example add coordinate values (specified in angstroms):

You can also specify “color rules” that define how particular parts of the biomolecule should be rendered:

But the structure here isn’t just something you can make graphics out of; it’s also something you can compute with. For example, here’s a geometric region formed from the biomolecule:

And this computes its surface area (in square angstroms):

The Wolfram Language has built-in data on a certain number of proteins. But you can get data on many more proteins from external sources—specifying them with external identifiers:

When you get a protein—say from an external source—it’ll often come with a 3D structure specified, for example as deduced from experimental measurements. But even without that, Version 14.1 will attempt to find at least an approximate structure—by using machine-learning-based protein-folding methods. As an example, here’s a random bio sequence:

If you make a BioMolecule out of this, a “predicted” 3D structure will be generated:

Here’s a visualization of this structure—though more work would be needed to determine how it’s related to what one might actually observe experimentally:

Optimizing Neural Nets for GPUs and NPUs

Many computers now come with GPU and NPU hardware accelerators for machine learning, and in Version 14.1 we’ve added more support for these. Specifically, on macOS (Apple Silicon) and Windows machines, built-in functions like ImageIdentify and SpeechRecognize now automatically use CoreML (Neural Engine) and DirectML capabilities—and the result is typically 2X to 10X faster performance.

We’ve always supported explicit CUDA GPU acceleration, for both training and inference. But in Version 14.1 we now support CoreML and DirectML acceleration for inference tasks with explicitly specified neural nets. But whereas this acceleration is now the default for built-in machine-learning-based functions, for explicitly specified models the default isn’t yet the default.

So, for example, this doesn’t use GPU acceleration:

But you can explicitly request it—and then (assuming all features of the model can be accelerated) things will typically run significantly faster:

We’re continually sprucing up our infrastructure for machine learning. And as part of that, in Version 14.1 we’ve enhanced our diagrams for neural nets to make layers more visually distinct—and to immediately produce diagrams suitable for publication:

The Statistics of Dates

We’ve been releasing versions of what’s now the Wolfram Language for 36 years. And looking at that whole collection of release dates, we can ask statistical questions. Like “What’s the median date for all the releases so far?” Well, in Version 14.1 there’s a direct way to answer that—because statistical functions like Median now just immediately work on dates:

What if we ask about all 7000 or so functions in the Wolfram Language? Here’s a histogram of when they were introduced:

And now we can compute the median, showing quantitatively that, yes, Wolfram Language development has speeded up:

Dates are a bit like numbers, but not quite. For example, their “zero” shifts around depending on the calendar. And their granularity is more complicated than precision for numbers. In addition, a single date can have multiple different representations (say in different calendars or with different granularities). But it nevertheless turns out to be possible to define many kinds of statistics for dates. To understand these statistics—and to compute them—it’s typically convenient to make one’s whole collection of dates have the same form. And in Version 14.1 this can be achieved with the new function ConformDates (which here converts all dates to the format of the first one listed):

By the way, in Version 14.1 the whole pipeline for handling dates (and times) has been dramatically speeded up, most notably conversion from strings, as needed in the import of dates.

The concept of doing statistics on dates introduces another new idea: date (and time) distributions. And in Version 14.1 there are two new functions DateDistribution and TimeDistribution for defining such distributions. Unlike for numerical (or quantity) distributions, date and time distributions require the specification of an origin, like Today, as well as of a scale, like "Days":

But given this symbolic specification, we can now do operations just like for any other distribution, say generating some random variates:

Building Videos with Programs

Introduced in Version 6 back in 2007, Manipulate provides an immediate way to create an interactive “manipulable” interface. And it’s been possible for a long time to export Manipulate objects to video. But just what should happen in the video? What sliders should move in what way? In Version 12.3 we introduced AnimationVideo to let you make a video in which one parameter is changing with time. But now in Version 14.1 we have ManipulateVideo which lets you create a video in which many parameters can be varied simultaneously. One way to specify what you want is to say for each parameter what value it should get at a sequence of times (by default measured in seconds from the beginning of the video). ManipulateVideo then produces a smooth video by interpolating between these values:

(An alternative is to specify complete “keyframes” by giving operations to be done at particular times.)

ManipulateVideo in a sense provides a “holistic” way to create a video by controlling a Manipulate. And in the last several versions we’ve introduced many functions for creating videos from “existing structures” (for example FrameListVideo assembles a video from a list of frames). But sometimes you want to build up videos one frame at a time. And in Version 14.1 we’ve introduced SowVideo and ReapVideo for doing this. They’re basically the analog of Sow and Reap for video frames. SowVideo will “sow” one or more frames, and all frames you sow will then be collected and assembled into a video by ReapVideo:

One common application of SowVideo/ReapVideo is to assemble a video from frames that are programmatically picked out by some criterion from some other video. So, for example, this “sows” frames that contain a bird, then “reaps” them to assemble a new video.

Another way to programmatically create one video from another is to build up a new video by progressively “folding in” frames from an existing video—which is what the new function VideoFrameFold does:

Version 14.1 also has a variety of new “convenience functions” for dealing with videos. One example is VideoSummaryPlot which generates various “at-a-glance” summaries of videos (and their audio):

Another new feature in Version 14.1 is the ability to apply audio processing functions directly to videos:

And, yes, it’s a bird:

Optimizing the Speech Recognition Workflow

We first introduced SpeechRecognize in 2019 in Version 12.0. And now in Version 14.1 SpeechRecognize is getting a makeover.

The most dramatic change is speed. In the past, SpeechRecognize would typically take at least as long to recognize a piece of speech as the duration of the speech itself. But now in Version 14.1, SpeechRecognize runs many tens of times faster, so you can recognize speech much faster than real time.

And what’s more, SpeechRecognize now produces full, written text, complete with capitalization, punctuation, etc. So here, for example, is a transcription of a little video:

There’s also a new function, VideoTranscribe, that will take a video, transcribe its audio, and insert the transcription back into the subtitle track of the video.

And, by the way, SpeechRecognize runs entirely locally on your computer, without having to access a server (except maybe for updates to the neural net it’s using).

In the past SpeechRecognize could only handle English. In Version 14.1 it can handle 100 languages—and can automatically produce translated transcriptions. (By default it produces transcriptions in the language you’ve specified with $Language.) And if you want to identify what language a piece of audio is in, LanguageIdentify now works directly on audio.

SpeechRecognize by default produces a single string of text. But it now also has the option to break up its results into a list, say of sentences:

And in addition to producing a transcription, SpeechRecognize can give time intervals or audio fragments for each element:

Historical Geography Becomes Computable

History is complicated. But that doesn’t mean there isn’t much that can be made computable about it. And in Version 14.1 we’re taking a major step forward in making historical geography computable. We’ve had extensive geographic computation capabilities in the Wolfram Language for well over a decade. And in Version 14.1 we’re extending that to historical geography.

So now you can not only ask for a map of where the current country of Italy is, you can also ask to make a map of the Roman Empire in 100 AD:

And “the Roman Empire in 100 AD” is now a computable entity. So you can ask for example what its approximate area was:

And you can even make a plot of how the area of the Roman Empire changed over the period from 0 AD to 200 AD:

We’ve been building our knowledgebase of historical geography for many years. Of course, country borders may be disputed, and—particularly in the more distant past—may not have been well defined. But by now we’ve accumulated computable data on basically all of the few thousand known historical countries. Still—with history being complicated—it’s not surprising that there are all sorts of often subtle issues.

Let’s start by asking what historical countries the location that’s now Mexico City has been in. GeoIdentify gives the answer:

And already we see subtlety. For example, our historical country entities are labeled by their overall beginning and ending dates. But most of them covered Mexico City only for part of their existence. And here we can see what’s going on:

Often there’s subtlety in identifying what should count as a “different country”. If there was just an “acquisition” or a small “change of management” maybe it’s still the same country. But if there was a “dramatic reorganization”, maybe it’s a different country. Sometimes the names of countries (if they even had official names) give clues. But in general it’s taken lots of case-by-case curation, trying to follow the typical conventions used by historians of particular times and places.

For London we see several “close-but-we-consider-it-a-different-country” issues—along with various confusing repeated conquerings and reconquerings:

Here’s a timeline plot of the countries that have contained London:

And because everything is computable, it’s easy to identify the longest contiguous segment here:

GeoIdentify can tell us what entities something like a city is inside. GeoEntities, on the other hand, can tell us what entities are inside something like a country. So, for example, this tells us what historical countries were inside (or at least overlapped with) the current boundaries of the UK in 800 AD:

This then makes a map (the extra list makes these countries be rendered separately):

In the Wolfram Language we have data on quite a few kinds of historical entities beyond countries. For example, we have extensive data on military conflicts. Here we’re asking what military conflicts occurred within the borders of what’s now France between 200 BC and 200 AD:

Here’s a map of their locations:

And here are conflicts in the Atlantic Ocean in the period 1939–1945:

And—combining several things—here’s a map of conflicts that, at the time when they occurred, were within the region of what was then Carthage:

There are all sorts of things that we can compute from historical geography. For example, this asks for the (minimum) geo distance between the territory of the Roman Empire and the Han Dynasty in 100 AD:

But what about the overall minimum distance across all years when these historical countries existed? This gives the result for that:

Let’s compare this with a plot of these two entities:

But there’s a subtlety here. What version of the Roman Empire is it that we’re showing on the map here? Our convention is by default to show historical countries “at their zenith”, i.e. at the moment when they had their maximum extent.

But what about other choices? Dated gives us a way to specify a particular date. But another possibility is to include in what we consider to be a particular historical country any territory that was ever part of that country, at any time in its history. And you can do this using GeoVariant[…, "UnionArea"]. In the particular case we’re showing here, it doesn’t make much difference, except that there’s more territory in Germany and Scotland included in the Roman Empire:

By the way, you can combine Dated and GeoVariant, to get things like “the zenith within a certain period” or “any territory that was included at any time within a period”. And, yes, it can get quite complicated. In a rather physics-like way you can think of the extent of a historical country as defining a region in spacetime—and indeed GeoVariant[…, "TimeSeries"] in effect represents a whole “stack of spacelike slices” in this spacetime region:

And—though it takes a little while—you can use it to make a video of the rise and fall of the Roman Empire:

Astronomical Graphics and Their Axes

It’s complicated to define where things are in the sky. There are four main coordinate systems that get used in doing this: horizon (relative to local horizon), equatorial (relative to the Earth’s equator), ecliptic (relative to the orbit of the Earth around the Sun) and galactic (relative to the plane of the galaxy). And when we draw a diagram of the sky (here on white for clarity) it’s typical to show the “axes” for all these coordinate systems:

But here’s a tricky thing: how should those axes be labeled? Each one is different: horizon is most naturally labeled by things like cardinal directions (N, E, S, W, etc.), equatorial by hours in the day (in sidereal time), ecliptic by months in the year, and galactic by angle from the center of the galaxy.

In ordinary plots axes are usually straight, and labeled uniformly (or perhaps, say, logarithmically). But in astronomy things are much more complicated: the axes are intrinsically circular, and then get rendered through whatever projection we’re using.

And we might have thought that such axes would require some kind of custom structure. But not in the Wolfram Language. Because in the Wolfram Language we try to make things general. And axes are no exception:

So in AstroGraphics all our various axes are just AxisObject constructs—that can be computed with. And so, for example, here’s a Mollweide projection of the sky:

If we insist on “seeing the whole sky”, the bottom half is just the Earth (and, yes, the Sun isn’t shown because I’m writing this after it’s set for the day…):

Things get a bit wild if we start adding grid lines, here for galactic coordinates:

And, yes, the galactic coordinate axis is indeed aligned with the plane of the Milky Way (i.e. our galaxy):

When Is Earthrise on Mars? New Level of Astronomical Computation

When will the Earth next rise above the horizon from where the Perseverance rover is on Mars? In Version 14.1 we can now compute this (and, yes, this is an “Earth time” converted from Mars time using the standard barycentric celestial reference system (BCRS) solar-system-wide spacetime coordinate system):

This is a fairly complicated computation that takes into account not only the motion and rotation of the bodies involved, but also various other physical effects. A more “down to Earth” example that one might readily check by looking out of one’s window is to compute the rise and set times of the Moon from a particular point on the Earth:

There’s a slight variation in the times between moonrises:

Over the course of a year we see systematic variations associated with the periods of different kinds of lunar months:

There are all sorts of subtleties here. For example, when exactly does one define something (like the Sun) to have “risen”? Is it when the top of the Sun first peeks out? When the center appears? Or when the “whole Sun” is visible? In Version 14.1 you can ask about any of these:

Oh, and you could compute the same thing for the rise of Venus, but now to see the differences, you’ve got to go to millisecond granularity (and, by the way, granularities of milliseconds down to picoseconds are new in Version 14.1):

By the way, particularly for the Sun, the concept of ReferenceAltitude is useful in specifying the various kinds of sunrise and sunset: for example, “civil twilight” corresponds to a reference altitude of –6°.

Geometry Goes Color, and Polar

Last year we introduced the function ARPublish to provide a streamlined way to take 3D geometry and publish it for viewing in augmented reality. In Version 14.1 we’ve now extended this pipeline to deal with color:

(Yes, the color is a little different on the phone because the phone tries to make it look “more natural”.)

Augmented reality via QR code

And now it’s easy to view this not just on a phone, but also, for example, on the Apple Vision Pro:

Graphics have always had color. But now in Version 14.1 symbolic geometric regions can have color too:

And constructive geometric operations on regions preserve color:

Two other new functions in Version 14.1 are PolarCurve and FilledPolarCurve:

And while at this level this may look simple, what’s going on underneath is actually seriously complicated, with all sorts of symbolic analysis needed in order to determine what the “inside” of the parametric curve should be.

Talking about geometry and color brings up another enhancement in Version 14.1: plot themes for diagrams in synthetic geometry. Back in Version 12.0 we introduced symbolic synthetic geometry—in effect finally providing a streamlined computable way to do the kind of geometry that Euclid did two millennia ago. In the past few versions we’ve been steadily expanding our synthetic geometry capabilities, and now in Version 14.1 one notable thing we’ve added is the ability to use plot themes—and explicit graphics options—to style geometric diagrams. Here’s the default version of a geometric diagram:

Now we can “theme” this for the web:

New Computation Flow in Notebooks: Introducing Cell-Linked %

In building up computations in notebooks, one very often finds oneself wanting to take a result one just got and then do something with it. And ever since Version 1.0 one’s been able to do this by referring to the result one just got as %. It’s very convenient. But there are some subtle and sometimes frustrating issues with it, the most important of which has to do with what happens when one reevaluates an input that contains %.

Let’s say you’ve done this:

Range

But now you decide that actually you wanted Median[ % ^ 2 ] instead. So you edit that input and reevaluate it:

Edit and reevaluate

Oops! Even though what’s right above your input in the notebook is a list, the value of % is the latest result that was computed, which you can’t now see, but which was 3.

OK, so what can one do about this? We’ve thought about it for a long time (and by “long” I mean decades). And finally now in Version 14.1 we have a solution—that I think is very nice and very convenient. The core of it is a new notebook-oriented analog of %, that lets one refer not just to things like “the last result that was computed” but instead to things like “the result computed in a particular cell in the notebook”.

So let’s look at our sequence from above again. Let’s start typing another cell—say to “try to get it right”. In Version 14.1 as soon as we type % we see an autosuggest menu:

Autosuggest menu

The menu is giving us a choice of (output) cells that we might want to refer to. Let’s pick the last one listed:

Last menu option

The object is a reference to the output from the cell that’s currently labeled In[1]—and using now gives us what we wanted.

But let’s say we go back and change the first (input) cell in the notebook—and reevaluate it:

Reevaluate Range

The cell now gets labeled In[5]—and the (in In[4]) that refers to that cell will immediately change to :

Median

And if we now evaluate this cell, it’ll pick up the value of the output associated with In[5], and give us a new answer:

New answer

So what’s really going on here? The key idea is that signifies a new type of notebook element that’s a kind of cell-linked analog of %. It represents the latest result from evaluating a particular cell, wherever the cell may be, and whatever the cell may be labeled. (The object always shows the current label of the cell it’s linked to.) In effect is “notebook front end oriented”, while ordinary % is kernel oriented. is linked to the contents of a particular cell in a notebook; % refers to the state of the Wolfram Language kernel at a certain time.

gets updated whenever the cell it’s referring to is reevaluated. So its value can change either through the cell being explicitly edited (as in the example above) or because reevaluation gives a different value, say because it involves generating a random number:

RandomInteger

OK, so always refers to “a particular cell”. But what makes a cell a particular cell? It’s defined by a unique ID that’s assigned to every cell. When a new cell is created it’s given a universally unique ID, and it carries that same ID wherever it’s placed and whatever its contents may be (and even across different sessions). If the cell is copied, then the copy gets a new ID. And although you won’t explicitly see cell IDs, works by linking to a cell with a particular ID.

One can think of as providing a “more stable” way to refer to outputs in a notebook. And actually, that’s true not just within a single session, but also across sessions. Say one saves the notebook above and opens it in a new session. Here’s what you’ll see:

Saving across sessions

The is now grayed out. So what happens if we try to reevaluate it? Well, we get this:

Reconstruct or reevaluate

If we press Reconstruct from output cell the system will take the contents of the first output cell that was saved in the notebook, and use this to get input for the cell we’re evaluating:

Reconstruct from output cell

In almost all cases the contents of the output cell will be sufficient to allow the expression “behind it” to be reconstructed. But in some cases—like when the original output was too big, and so was elided—there won’t be enough in the output cell to do the reconstruction. And in such cases it’s time to take the Go to input cell branch, which in this case will just take us back to the first cell in the notebook, and let us reevaluate it to recompute the output expression it gives.

By the way, whenever you see a “positional %” you can hover over it to highlight the cell it’s referring to:

Positional % highlighting

Having talked a bit about “cell-linked %” it’s worth pointing out that there are still cases when you’ll want to use “ordinary %”. A typical example is if you have an input line that you’re using a bit like a function (say for post-processing) and that you want to repeatedly reevaluate to see what it produces when applied to your latest output.

In a sense, ordinary % is the “most volatile” in what it refers to. Cell-linked % is “less volatile”. But sometimes you want no volatility at all in what you’re referring to; you basically just want to burn a particular expression into your notebook. And in fact the % autosuggest menu gives you a way to do just that.

Notice the that appears in whatever row of the menu you’re selecting:

Iconize option

Press this and you’ll insert (in iconized form) the whole expression that’s being referred to:

Iconized expression

Now—for better or worse—whatever changes you make in the notebook won’t affect the expression, because it’s right there, in literal form, “inside” the icon. And yes, you can explicitly “uniconize” to get back the original expression:

Uniconize

Once you have a cell-linked % it always has a contextual menu with various actions:

Contextual menu

One of those actions is to do what we just mentioned, and replace the positional by an iconized version of the expression it’s currently referring to. You can also highlight the output and input cells that the is “linked to”. (Incidentally, another way to replace a by the expression it’s referring to is simply to “evaluate in place” , which you can do by selecting it and pressing CMDReturn or ShiftControlEnter.)

Another item in the menu is Replace With Rolled-Up Inputs. What this does is—as it says—to “roll up” a sequence of “ references” and create a single expression from them:

Replace with rolled-up inputs

What we’ve talked about so far one can think of as being “normal and customary” uses of . But there are all sorts of corner cases that can show up. For example, what happens if you have a that refers to a cell you delete? Well, within a single (kernel) session that’s OK, because the expression “behind” the cell is still available in the kernel (unless you reset your $HistoryLength etc.). Still, the will show up with a “red broken link” to indicate that “there could be trouble”:

Red broken link

And indeed if you go to a different (kernel) session there will be trouble—because the information you need to get the expression to which the refers is simply no longer available, so it has no choice but to show up in a kind of everything-has-fallen-apart “surrender state” as:

Surrender state

is primarily useful when it refers to cells in the notebook you’re currently using (and indeed the autosuggest menu will contain only cells from your current notebook). But what if it ends up referring to a cell in a different notebook, say because you copied the cell from one notebook to another? It’s a precarious situation. But if all relevant notebooks are open, can still work, though it’s displayed in purple with an action-at-a-distance “wi-fi icon” to indicate its precariousness:

Wi-fi icon

And if, for example, you start a new session, and the notebook containing the “source” of the isn’t open, then you’ll get the “surrender state”. (If you open the necessary notebook it’ll “unsurrender” again.)

Yes, there are lots of tricky cases to cover (in fact, many more than we’ve explicitly discussed here). And indeed seeing all these cases makes us not feel bad about how long it’s taken for us to conceptualize and implement .

The most common way to access is to use the % autosuggest menu. But if you know you want a , you can always get it by “pure typing”, using for example ESC%ESC. (And, yes, ESC%%ESC or ESC%5ESC etc. also work, so long as the necessary cells are present in your notebook.)

The UX Journey Continues: New Typing Affordances, and More

We invented Wolfram Notebooks more than 36 years ago, and we’ve been improving and polishing them ever since. And in Version 14.1 we’re implementing several new ideas, particularly around making it even easier to type Wolfram Language code.

It’s worth saying at the outset that good UX ideas quickly become essentially invisible. They just give you hints about how to interpret something or what to do with it. And if they’re doing their job well, you’ll barely notice them, and everything will just seem “obvious”.

So what’s new in UX for Version 14.1? First, there’s a story around brackets. We first introduced syntax coloring for unmatched brackets back in the late 1990s, and gradually polished it over the following two decades. Then in 2021 we started “automatching” brackets (and other delimiters), so that as soon as you type “f[” you immediately get f[ ].

But how do you keep on typing? You could use an to “move through” the ]. But we’ve set it up so you can just “type through” ] by typing ]. In one of those typical pieces of UX subtlety, however, “type through” doesn’t always make sense. For example, let’s say you typed f[x]. Now you click right after [ and you type g[, so you’ve got f[g[x]. You might think there should be an autotyped ] to go along with the [ after g. But where should it go? Maybe you want to get f[g[x]], or maybe you’re really trying to type f[g[],x]. We definitely don’t want to autotype ] in the wrong place. So the best we can do is not autotype anything at all, and just let you type the ] yourself, where you want it. But remember that with f[x] on its own, the ] is autotyped, and so if you type ] yourself in this case, it’ll just type through the autotyped ] and you won’t explicitly see it.

So how can you tell whether a ] you type will explicitly show up, or will just be “absorbed” as type-through? In Version 14.1 there’s now different syntax coloring for these cases: yellow if it’ll be “absorbed”, and pink if it’ll explicitly show up.

This is an example of non-type-through, so Range is colored yellow and the ] you type is “absorbed”:

Range highlighted yellow

And this is an example of non-type-through, so Round is colored pink and the ] you type is explicitly inserted:

Round highlighted pink

This may all sound very fiddly and detailed—and for us in developing it, it is. But the point is that you don’t explicitly have to think about it. You quickly learn to just “take the hint” from the syntax coloring about when your closing delimiters will be “absorbed” and when they won’t. And the result is that you’ll have an even smoother and faster typing experience, with even less chance of unmatched (or incorrectly matched) delimiters.

The new syntax coloring we just discussed helps in typing code. In Version 14.1 there’s also something new that helps in reading code. It’s an enhanced version of something that’s actually common in IDEs: when you click (or select) a variable, every instance of that variable immediately gets highlighted:

Highlighted variable

What’s subtle in our case is that we take account of the scoping of localized variables—putting a more colorful highlight on instances of a variable that are in scope:

Multiple instances of a variable

One place this tends to be particularly useful is in understanding nested pure functions that use #. By clicking a # you can see which other instances of # are in the same pure function, and which are in different ones (the highlight is bluer inside the same function, and grayer outside):

Highlighting in nested functions

On the subject of finding variables, another change in Version 14.1 is that fuzzy name autocompletion now also works for contexts. So if you have a symbol whose full name is context1`subx`var2 you can type c1x and you’ll get a completion for the context; then accept this and you get a completion for the symbol.

There are also several other notable UX “tune-ups” in Version 14.1. For many years, there’s been an “information box” that comes up whenever you hover over a symbol. Now that’s been extended to entities—so (alongside their explicit form) you can immediately get to information about them and their properties:

Entity information box

Next there’s something that, yes, I personally have found frustrating in the past. Say you’ve a file, or an image, or something else somewhere on your computer’s desktop. Normally if you want it in a Wolfram Notebook you can just drag it there, and it will very beautifully appear. But what if the thing you’re dragging is very big, or has some other kind of issue? In the past, the drag just failed. Now what happens is that you get the explicit Import that the dragging would have done, so that you can run it yourself (getting progress information, etc.), or you can modify it, say adding relevant options.

Another small piece of polish that’s been added in Version 14.1 has to do with Preferences. There are a lot of things you can set in the notebook front end. And they’re explained, at least briefly, in the many Preferences panels. But in Version 14.1 there are now (i) buttons that give direct links to the relevant workflow documentation:

Direct link to workflow documentation

Syntax for Natural Language Input

Ever since shortly after Wolfram|Alpha was released in 2009, there’ve been ways to access its natural language understanding capabilities in the Wolfram Language. Foremost among these has been CTRL=—which lets you type free-form natural language and immediately get a Wolfram Language version, often in terms of entities, etc.:

Wolfram|Alpha entities

Generally this is a very convenient and elegant capability. But sometimes one may want to just use plain text to specify natural language input, for example so that one doesn’t interrupt one’s textual typing of input.

In Version 14.1 there’s a new mechanism for this: syntax for directly entering free-form natural language input. The syntax is a kind of a “textified” version of CTRL=: =[…]. When you type =[...] as input nothing immediately happens. It’s only when you evaluate your input that the natural language gets interpreted—and then whatever it specifies is computed.

Here’s a very simple example, where each =[…] just turns into an entity:

But when the result of interpreting the natural language is an expression that can be further evaluated, what will come out is the result of that evaluation:

One feature of using =[…] instead of CTRL= is that =[…] is something anyone can immediately see how to type:

But what actually is =[…]? Well, it’s just input syntax for the new function FreeformEvaluate:

You can use FreeformEvaluate inside a program—here, rather whimsically, to see what interpretations are chosen by default for “a” followed by each letter of the alphabet:

By default, FreeformEvaluate interprets your input, then evaluates it. But you can also specify that you want to hold the result of the interpretation:

Diff[ ] … for Notebooks and More!

It’s been a very long-requested capability: give me a way to tell what changed, particularly in a notebook. It’s fairly easy to do “diffs” for plain text. But for notebooks—as structured symbolic documents—it’s a much more complicated story. But in Version 14.1 it’s here! We’ve got a function Diff for doing diffs in notebooks, and actually also in many other kinds of things.

Here’s an example, where we’re requesting a “side-by-side view” of the diff between two notebooks:

And here’s an “alignment chart view” of the diff:

Click to enlarge

Like everything else in the Wolfram Language, a “diff” is a symbolic expression. Here’s an example:

There are lots of different ways to display a diff object; many of them one can select interactively with the menu:

Diff object viewing options

But the most important thing about diff objects is that they can be used programmatically. And in particular DiffApply applies the diffs from a diff object to an existing object, say a notebook.

What’s the point of this? Well, let’s imagine you’ve made a notebook, and given a copy of it to someone else. Then both you and the person to whom you’ve given the copy make changes. You can create a diff object of the diffs between the original version of the notebook, and the version with your changes. And if the changes the other person made don’t overlap with yours, you can just take your diffs and use DiffApply to apply your diffs to their version, thereby getting a “merged notebook” with both sets of changes made.

But what if your changes might conflict? Well, then you need to use the function Diff3. Diff3 takes your original notebook and two modified versions, and does a “three-way diff” to give you a diff object in which any conflicts are explicitly identified. (And, yes, three-way diffs are familiar from source control systems in which they provide the back end for making the merging of files as automated as possible.)

Notebooks are an important use case for Diff and related functions. But they’re not the only one. Diff can perfectly well be applied, for example, just to lists:

There are many ways to display this diff object; here’s a side-by-side view:

Side-by-side diff view

And here’s a “unified view” reminiscent of how one might display diffs for lines of text in a file:

Unified diff view

And, speaking of files, Diff, etc. can immediately be applied to files:

Diff, etc. can also be applied to cells, where they can analyze changes in both content and styles or metadata. Here we’re creating two cells and then diffing them—showing the result in a side by side:

In “Combined” view the “pure insertions” are highlighted in green, the “pure deletions” in red, and the “edits” are shown as deletion/insertion stacks:

Combined diff view highlighting

Many uses of diff technology revolve around content development—editing, software engineering, etc. But in the Wolfram Language Diff, etc. are set up also to be convenient for information visualization and for various kinds of algorithmic operations. For example, to see what letters differ between the Spanish and Polish alphabets, we can just use Diff:

Here’s the “pure visualization”:

And here’s an alternate “unified summary” form:

Another use case for Diff is bioinformatics. We retrieve two genome sequences—as strings—then use Diff:

We can take the resulting diff object and show it in a different form—here character alignment:

Under the hood, by the way, Diff is finding the differences using SequenceAlignment. But while Diff is giving a “high-level symbolic diff object”, SequenceAlignment is giving a direct low-level representation of the sequence alignment:

Information visualization isn’t restricted to two-way diffs; here’s an example with a three-way diff:

And here it is as a “unified summary”:

There are all sorts of options for diffs. One that is sometimes important is DiffGranularity. By default the granularity for diffs of strings is "Characters":

But it’s also possible to set it to be "Words":

Coming back to notebooks, the most “interactive” form of diff is a “report”:

In such a report, you can open cells to see the details of a specific change, and you can also click to jump to where the change occurred in the underlying notebooks.

When it comes to analyzing notebooks, there’s another new feature in Version 14.1: NotebookCellData. NotebookCellData gives you direct programmatic access to lots of properties of notebooks. By default it generates a dataset of some of them, here for the notebook in which I’m currently authoring this:

There are properties like the word count in each cell, the style of each cell, the memory footprint of each cell, and a thumbnail image of each cell.

Ever since Version 6 in 2007 we’ve had the CellChangeTimes option which records when cells in notebooks are created or modified. And now in Version 14.1 NotebookCellData provides direct programmatic access to this data. So, for example, here’s a date histogram of when the cells in the current notebook were last changed:

Lots of Little Language Tune-Ups

It’s part of a journey of almost four decades. Steadily discovering—and inventing—new “lumps of computational work” that make sense to implement as functions or features in the Wolfram Language. The Wolfram Language is of course very much strong enough that one can build essentially any functionality from the primitives that already exist in it. But part of the point of the language is to define the best “elements of computational thought”. And particularly as the language progresses, there’s a continual stream of new opportunities for convenient elements that get exposed. And in Version 14.1 we’ve implemented quite a diverse collection of them.

Let’s say you want to nestedly compose a function. Ever since Version 1.0 there’s been Nest for that:

But what if you want the abstract nested function, not yet applied to anything? Well, in Version 14.1 there’s now an operator form of Nest (and NestList) that represents an abstract nested function that can, for example, be composed with other functions, as in

or equivalently:

A decade ago we introduced functions like AllTrue and AnyTrue that effectively “in one gulp” do a whole collection of separate tests. If one wanted to test whether there are any primes in a list, one can always do:

But it’s better to “package” this “lump of computational work” into the single function AnyTrue:

In Version 14.1 we’re extending this idea by introducing AllMatch, AnyMatch and NoneMatch:

Another somewhat related new function is AllSameBy. SameQ tests whether a collection of expressions are immediately the same. AllSameBy tests whether expressions are the same by the criterion that the value of some function applied to them is the same:

Talking of tests, another new feature in Version 14.1 is a second argument to QuantityQ (and KnownUnitQ), which lets you test not only whether something is a quantity, but also whether it’s a specific type of physical quantity:

And now talking about “rounding things out”, Version 14.1 does that in a very literal way by enhancing the RoundingRadius option. For a start, you can now specify a different rounding radius for particular corners:

And, yes, that’s useful if you’re trying to fit button-like constructs together:

By the way, RoundingRadius now also works for rectangles inside Graphics:

Let’s say you have a string, like “hello”. There are many functions that operate directly on strings. But sometimes you really just want to use a function that operates on lists—and apply it to the characters in a string. Now in Version 14.1 you can do this using StringApply:

Another little convenience in Version 14.1 is the function BitFlip, which, yes, flips a bit in the binary representation of a number:

When it comes to Boolean functions, a detail that’s been improved in Version 14.1 is the conversion to NAND representation. By default, functions like BooleanConvert have allowed Nand[p] (which is equivalent to Not[p]). But in Version 14.1 there’s now "BinaryNAND" which yields for example Nand[p, p] instead of just Nand[p] (i.e. Not[p]). So here’s a representation of Or in terms of Nand:

Making the Wolfram Compiler Easier to Use

Let’s say you have a piece of Wolfram Language code that you know you’re going to run a zillion times—so you want it to run absolutely as fast as possible. Well, you’ll want to make sure you’re doing the best algorithmic things you can (and making the best possible use of Wolfram Language superfunctions, etc.). And perhaps you’ll find it helpful to use things like DataStructure constructs. But ultimately if you really want your code to run absolutely as fast as your computer can make it, you’ll probably want to set it up so that it can be compiled using the Wolfram Compiler, directly to LLVM code and then machine code.

We’ve been developing the Wolfram Compiler for many years, and it’s becoming steadily more capable (and efficient). And for example it’s become increasingly important in our own internal development efforts. In the past, when we wrote critical inner-loop internal code for the Wolfram Language, we did it in C. But in the past few years we’ve almost completely transitioned instead to writing pure Wolfram Language code that we then compile with the Wolfram Compiler. And the result of this has been a dramatically faster and more reliable development pipeline for writing inner-loop code.

Ultimately what the Wolfram Compiler needs to do is to take the code you write and align it with the low-level capabilities of your computer, figuring out what low-level data types can be used for what, etc. Some of this can be done automatically (using all sorts of fancy symbolic and theorem-proving-like techniques). But some needs to be based on collaboration between the programmer and the compiler. And in Version 14.1 we’re adding several important ways to enhance that collaboration.

The first thing is that it’s now easy to get access to information the compiler has. For example, here’s the type declaration the compiler has for the built-in function Dimensions:

And here’s the source code of the actual implementation the compiler is using for Dimensions, calling its intrinsic low-level internal functions like CopyTo:

Compiler source code

A function like Map has a vastly more complex set of type declarations:

For types themselves, CompilerInformation lets you see their type hierarchy:

And for data structure types, you can do things like see the fields they contain, and the operations they support:

And, by the way, something new in Version 14.1 is the function OperationDeclaration which lets you declare operations to add to a data structure type you’ve defined.

Once you actually start running the compiler, a convenient new feature in Version 14.1 is a detailed progress monitor that lets you see what the compiler is doing at each step:

As we said, the key to compilation is figuring out how to align your code with the low-level capabilities of your computer. The Wolfram Language can do arbitrary symbolic operations. But many of those don’t align with low-level capabilities of your computer, and can’t meaningfully be compiled. Sometimes those failures to align are the result of sophistication that’s possible only with symbolic operations. But sometimes the failures can be avoided if you “unpack” things a bit. And sometimes the failures are just the result of programming mistakes. And now in Version 14.1 the Wolfram Compiler is starting to be able to annotate your code to show where the misalignments are happening, so you can go through and figure out what to do with them. (It’s something that’s uniquely possible because of the symbolic structure of the Wolfram Language and even more so of Wolfram Notebooks.)

Here’s a very simple example:

Misalignment error message

In compiled code, Sin expects a numerical argument, so a Boolean argument won’t work. Clicking the Source button lets you see where specifically something went wrong:

Error source

If you have several levels of definitions, the Source button will show you the whole chain:

Here’s a slightly more complicated piece of code, in which the specific place where there’s a problem is highlighted:

In a typical workflow you might start from pure Wolfram Language code, without Typed and other compilation information. Then you start adding such information, repeatedly trying the compilation, seeing what issues arise, and fixing them. And, by the way, because it’s completely efficient to call small pieces of compiled code within ordinary Wolfram Language code, it’s common to start by annotating and compiling the “innermost inner loops” in your code, and gradually “working outwards”.

But, OK, let’s say you’ve successfully compiled a piece of code. Most of the time it’ll handle certain cases, but not others (for example, it might work fine with machine-precision numbers, but not be capable of handling arbitrary precision). By default, compiled code that’s running is set up to generate a message and revert to ordinary Wolfram Language evaluation if it can’t handle something:

But in Version 14.1 there a new option CompilerRuntimeErrorAction that lets you specify an action to take (or, in general, a function to apply) whenever a runtime error occurs. A setting of None aborts the whole computation if there’s a runtime error:

Even Smoother Integration with External Languages

Let’s say there’s some functionality you want to use, but the only implementation you have is in a package in some external language, like Python. Well, it’s now basically seamless to work with such functionality directly in the Wolfram Language—plugging into the whole symbolic framework and functionality of the Wolfram Language.

As a simple example, here’s a function that uses the Python package faker to produce a random sentence (which of course would also be straightforward to do directly in Wolfram Language):

The first time you run RandomSentence, the progress monitor will show you all sorts of messy things happening under the hood, as Python versions get loaded, dependencies get set up, and so on. But the point is that it’s all automatic, and so you don’t have to worry about it. And in the end, out pops the answer:

And if you run the function again, all the setup will already have been done, and the answer will pop out immediately:

An important piece of automation here is the conversion of data types. One of the great things about the Wolfram Language is that it has fully integrated symbolic representations for a very wide range of things—from videos to molecules to IP addresses. And when there are standard representations for these things in a language like Python, we’ll automatically convert to and from them.

But particularly with more sophisticated packages, there’ll be a need to let the package deal with its own “external objects” that are basically opaque to the Wolfram Language, but can be handled as atomic symbolic constructs there.

For example, let’s say we’ve started a Python external package chess (and, yes, there’s a paclet in the Wolfram Paclet Repository that has considerably more chess functionality):

Now the state of a chessboard can be represented by an external object:

We can define a function to plot the board:

And now in Version 14.1 you can just pass your external object to the external function:

You can also directly extract attributes of the external object:

And you can call methods (here to make a chess move), changing the state of the external object:

Here’s a plot of a new board configuration:

This computes all legal moves from the current position, representing them as external objects:

Here are UCI string representations of these:

In what we’re doing here we’re immediately performing each external operation. But Version 14.1 introduces the construct ExternalOperation which lets you symbolically represent an external operation, and for example build up collections of such operations that can all be performed together in a single external evaluation. ExternalObject supports various built-in operations for each environment. So, for example, in Python we can use Call and GetAttribute to get the symbolic representation:

If we evaluate this, all these operations will get done together in the external environment:

Standalone Wolfram Language Applications!

Let’s say you’re writing an application in pretty much any programming language—and inside it you want to call Wolfram Language functionality. Well, you could always do that by using a web API served from the Wolfram Cloud. And you could also do it locally by running the Wolfram Engine. But in Version 14.1 there’s something new: a way of integrating a standalone Wolfram Language runtime right into your application. The Wolfram Language runtime is a dynamic library that you link into your program, and then call using a C-based API. How big is the runtime? Well, it depends on what you want to use in the Wolfram Language. Because we now have the technology to prune a runtime to include only capabilities needed for particular Wolfram Language code. And the result is that adding the Wolfram Language will often increase the disk requirements of your application only by a remarkably small amount—like just a few hundred megabytes or even less. And, by the way, you can distribute the Wolfram runtime as an integrated part of an application, with its users not needing their own licenses to run it.

OK, so how does creating a standalone Wolfram-enabled application actually work? There’s a lot of software engineering (associated with the Wolfram Language runtime, how it’s called, etc.) under the hood. But at the level of the application programmer you only have to deal with our Standalone Applications SDK—whose interface is rather simple.

As an example, here’s the C code part of a standalone application that uses the Wolfram Language to identify what (human) language a piece of text is in. The program here takes a string of text on its command line, then runs the Wolfram Language LanguageIdentify function on it, and then prints a string giving the result:

C code using Wolfram Language

If we ignore issues of pruning, etc. we can compile this program just with (and, yes, the file paths are necessarily a bit long):

Compiled C program

Now we can run the resulting executable directly from the command line—and it’ll act just like any other executable, even though inside it’s got all the power of a Wolfram Language runtime:

Command-line executable

If we look at the C program above, it basically begins just by starting the Wolfram Language runtime (using WLR_SDK_START_RUNTIME()). But then it takes the string (argv[1]) from the command line, embeds it in a Wolfram Language expression LanguageIdentify[string], evaluates this expression, and extracts a raw string from the result.

The functions, etc. that are involved here are part of the new Expression API supported by the Wolfram Language runtime dynamic library. The Expression API provides very clean capabilities for building up and taking apart Wolfram Language expressions from C. There are functions like wlr_Symbol("string") that form symbols, as well as macros like wlr_List(elem1, elem2, …) and wlr_E(head, arg1, arg2, …) that build up lists and general expressions. Then there’s the function wlr_Eval(expr) that calls the Wolfram Language evaluator. With functions like wlr_StringData(expr, &result, …) you can then extract content from expressions (here the characters in a string) and put it into C data structures.

How does the Expression API relate to WSTP? WSTP (“Wolfram Symbolic Transfer Protocol”) is our protocol for transferring symbolic expressions between processes. The Expression API, on the other hand, operates within a single process, providing the “glue” that connects C code to expressions in the Wolfram Language runtime.

One example of a real-world use of our new Standalone Applications technology is the LSPServer application that will soon be in full distribution. LSPServer started from a pure (though somewhat lengthy) Wolfram Language paclet that provides Language Server Protocol services for annotating Wolfram Language code in programs like Visual Studio Code. To build the LSPServer standalone application we just wrote a tiny C program that calls the paclet, then compiled this and linked it against our Standalone Applications SDK. Along the way (using tools that we’re planning to soon make available)—and based on the fact that only a small part of the full functionality of the Wolfram Language is needed to support LSPServer—we pruned the Wolfram Language runtime, in the end getting a complete LSPServer application that’s only about 170 MB in size, and that shows no outside signs of having Wolfram Language functionality inside.

And Yet More…

Is that all? Well, no. There’s more. Like new formatting of Root objects (yes, I was frustrated with the old one). Or like a new drag-and-drop-to-answer option for QuestionObject quizzes. Or like all the documentation we’ve added for new types of entities and interpreters.

In addition, there’s also the continual stream of new data that we’ve curated, or that’s flowed in real time into the Wolfram Knowledgebase. And beyond the core Wolfram Language itself, there’ve also been lots of functions added to the Wolfram Function Repository, lots of paclets added to the Wolfram Language Paclet Repository, not to mention new entries in the Wolfram Neural Net Repository, Wolfram Data Repository, etc.

Yes, as always it’s been a lot of work. But today it’s here, and we’re proud of it: Version 14.1!