MoreRSS

site iconStephen WolframModify

The creator of Mathematica, Wolfram|Alpha and the Wolfram Language; the author of A New Kind of Science.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Stephen Wolfram

Launching Version 14.2 of Wolfram Language & Mathematica: Big Data Meets Computation & AI

2025-01-24 03:00:09

The Drumbeat of Releases Continues…

Just under six months ago (176 days ago, to be precise) we released Version 14.1. Today I’m pleased to announce that we’re releasing Version 14.2, delivering the latest from our R&D pipeline.

This is an exciting time for our technology, both in terms of what we’re now able to implement, and in terms of how our technology is now being used in the world at large. A notable feature of these times is the increasing use of Wolfram Language not only by humans, but also by AIs. And it’s very nice to see that all the effort we’ve put into consistent language design, implementation and documentation over the years is now paying dividends in making Wolfram Language uniquely valuable as a tool for AIs—complementing their own intrinsic capabilities.

But there’s another angle to AI as well. With our Wolfram Notebook Assistant launched last month we’re using AI technology (plus a lot more) to provide what amounts to a conversational interface to Wolfram Language. As I described when we released Wolfram Notebook Assistant, it’s something extremely useful for experts and beginners alike, but ultimately I think its most important consequence will be to accelerate the ability to go from any field X to “computational X”—making use of the whole tower of technology we’ve built around Wolfram Language.

So, what’s new in 14.2? Under the hood there are changes to make Wolfram Notebook Assistant more efficient and more streamlined. But there are also lots of visible extensions and enhancements to the user-visible parts of the Wolfram Language. In total there are 80 completely new functions—along with 177 functions that have been substantially updated.

There are continuations of long-running R&D stories, like additional functionality for video, and additional capabilities around symbolic arrays. Then there are completely new areas of built-in functionality, like game theory. But the largest new development in Version 14.2 is around handling tabular data, and particularly, big tabular data. It’s a whole new subsystem for Wolfram Language, with powerful consequences throughout the system. We’ve been working on it for quite a few years, and we’re excited to be able to release it for the first time in Version 14.2.

Talking of working on new functionality: starting more than seven years ago we pioneered the concept of open software design, livestreaming our software design meetings. And, for example, since the release of Version 14.1, we’ve done 43 software design livestreams, for a total of 46 hours (I’ve also done 73 hours of other livestreams in that time). Some of the functionality that’s now in Version 14.2 we started work on quite a few years ago. But we’ve been livestreaming long enough that pretty much anything that’s now in Version 14.2 we designed live and in public on a livestream at some time or another. It’s hard work doing software design (as you can tell if you watch the livestreams). But it’s always exciting to see the fruits of those efforts come to fruition in the system we’ve been progressively building for so long. And so, today, it’s a pleasure to be able to release Version 14.2 and to let everyone use the things we’ve been working so hard to build.

Notebook Assistant Chat inside Any Notebook

Last month we released the Wolfram Notebook Assistant to “turn words into computation”—and help experts and novices alike make broader and deeper use of Wolfram Language technology. In Version 14.1 the primary way to use Notebook Assistant is through the separate “side chat” Notebook Assistant window. But in Version 14.2 “chat cells” have become a standard feature of any notebook available to anyone with a Notebook Assistant subscription.

Just type as the first character of any cell, and it’ll become a chat cell:

Chat cell

Now you can start chatting with the Notebook Assistant:

With the side chat you have a “separate channel” for communicating with the Notebook Assistant—that won’t, for example, be saved with your notebook. With chat cells, your chat becomes an integral part of the notebook.

We actually first introduced Chat Notebooks in the middle of 2023—just a few months after the arrival of ChatGPT. Chat Notebooks defined the interface, but at the time, the actual content of chat cells was purely from external LLMs. Now in Version 14.2, chat cells are not limited to separate Chat Notebooks, but are available in any notebook. And by default they make use of the full Notebook Assistant technology stack, which goes far beyond a raw LLM. In addition, once you have a Notebook Assistant + LLM Kit subscription, you can seamlessly use chat cells; no account with external LLM providers is needed.

The chat cell functionality in Version 14.2 inherits all the features of Chat Notebooks. For example, typing ~ in a new cell creates a chat break, that lets you start a “new conversation”. And when you use a chat cell, it’s able to see anything in your notebook up to the most recent chat break. (By the way, when you use Notebook Assistant through side chat it can also see what selection you’ve made in your “focus” notebook.)

By default, chat cells are “talking” to the Notebook Assistant. But if you want, you can also use them to talk to external LLMs, just like in our original Chat Notebook—and there’s a convenient menu to set that up. Of course, if you’re using an external LLM, you don’t have all the technology that’s now in the Notebook Assistant, and unless you’re doing LLM research, you’ll typically find it much more useful and valuable to use chat cells in their default configuration—talking to the Notebook Assistant.

Bring Us Your Gigabytes! Introducing Tabular

Lists, associations, datasets. These are very flexible ways to represent structured collections of data in the Wolfram Language. But now in Version 14.2 there’s another: Tabular. Tabular provides a very streamlined and efficient way to handle tables of data laid out in rows and columns. And when we say “efficient” we mean that it can routinely juggle gigabytes of data or more, both in core and out of core.

Let’s do an example. Let’s start off by importing some tabular data:

This is data on trees in New York City, 683,788 of them, each with 45 properties (sometimes missing). Tabular introduces a variety of new ideas. One of them is treating tabular columns much like variables. Here we’re using this to make a histogram of the values of the "tree_dbh" column in this Tabular:

You can think of a Tabular as being like an optimized form of a list of associations, where each row consists of an association whose keys are column names. Functions like Select then just work on Tabular:

Length gives the number of rows:

CountsBy treats the Tabular as a list of associations, extracting the value associated with the key "spc_latin" (“Latin species”) in each association, and counting how many times that value occurs ("spc_latin" here is short for #"spc_latin"&):

To get the names of the columns we can use the new function ColumnKeys:

Viewing Tabular as being like a list of associations we can extract parts—giving first a specification of rows, and then a specification of columns:

There are lots of new operations that we’ve been able to introduce now that we have Tabular. An example is AggregrateRows, which constructs a new Tabular from a given Tabular by aggregating groups of rows, in this case ones with the same value of "spc_latin", and then applying a function to those rows, in this case finding the mean value of "tree_dbh":

An operation like ReverseSortBy then “just works” on this table, here reverse sorting by the value of "meandbh":

Here we’re making an ordinary matrix out of a small slice of data from our Tabular:

And now we can plot the result, giving the positions of Virginia pine trees in New York City:

When should you use a Tabular, rather than, say a Dataset? Tabular is specifically set up for data that is arranged in rows and columns—and it supports many powerful operations that make sense for data in this “rectangular” form. Dataset is more general; it can have an arbitrary hierarchy of data dimensions, and so can’t in general support all the “rectangular” data operations of Tabular. In addition, by being specialized for “rectangular” data, Tabular can also be much more efficient, and indeed we’re making use of the latest type-specific methods for large-scale data handling.

If you use TabularStructure you can see some of what lets Tabular be so efficient. Every column is treated as data of a specific type (and, yes, the types are consistent with the ones in the Wolfram Language compiler). And there’s streamlined treatment of missing data (with several new functions added specifically to handle this):

What we’ve seen so far is Tabular operating with “in-core” data. But you can quite transparently also use Tabular on out-of-core data, for example data stored in a relational database.

Here’s an example of what this looks like:

It’s a tabular that points to a table in a relational database. It doesn’t by default explicitly display the data in the Tabular (and in fact it doesn’t even get it into memory—because it might be huge and might be changing quickly as well). But you can still specify operations just like on any other Tabular. This finds out what columns are there:

And this specifies an operation, giving the result as a symbolic out-of-core Tabular object:

You can “resolve” this, and get an explicit in-memory Tabular using ToMemory:

Manipulating Data in Tabular

Let’s say you’ve got a Tabular—like this one based on penguins:

There are lots of operations you can do that manipulate the data in this Tabular in a structured way—giving you back another Tabular. For example, you could just take the last 2 rows of the Tabular:

Or you could sample 3 random rows:

Other operations depend on the actual content of the Tabular. And because you can treat each row like an association, you can set up functions that effectively refer to elements by their column names:

Note that we can always use #[name] to refer to elements in a column. If name is an alphanumeric string then we can also use the shorthand #name. And for other strings, we can use #"name". Some functions let you just use "name" to indicate the function #["name"]:

So far we’ve talked only about arranging or selecting rows in a Tabular. What about columns? Here’s how we can construct a tabular that has just two of the columns from our original Tabular:

What if we don’t just want existing columns, but instead want new columns that are functions of these? ConstructColumns lets us define new columns, giving their names and the functions to be used to compute values in them:

(Note the trick of writing out Function to avoid having to put parentheses, as in "species"(StringTake[#species,1]&).)

ConstructColumns lets you take an existing Tabular and construct a new one. TransformColumns lets you transform columns in an existing Tabular, here replacing species names by their first letters:

TransformColumns also lets you add new columns, specifying the content of the columns just like in ConstructColumns. But where does TransformColumns put your new columns? By default, they go at the end, after all existing columns. But if you specifically list an existing column, that’ll be used as a marker to determine where to put the new column ("name"Nothing removes a column):

Everything we’ve seen so far operates separately on each row of a Tabular. But what if we want to “gulp in” a whole column to use in our computation—say, for example, computing the mean of a whole column, then subtracting it from each value. ColumnwiseValue lets you do this, by supplying to the function (here Mean) a list of all the values in whatever column or columns you specify:

ColumnwiseValue effectively lets you compute a scalar value by applying a function to a whole column. There’s also ColumnwiseThread, which lets you compute a list of values, that will in effect be “threaded” into a column. Here we’re creating a column from a list of accumulated values:

By the way, as we’ll discuss below, if you’ve externally generated a list of values (of the right length) that you want to use as a column, you can do that directly by using InsertColumns.

There’s another concept that’s very useful in practice in working with tabular data, and that’s grouping. In our penguin data, we’ve got an individual row for each penguin of each species. But what if we want instead to aggregate all the penguins of a given species, for example computing their average body mass? Well, we can do this with AggregateRows. AggregateRows works like ConstructColumns in the sense that you specify columns and their contents. But unlike ConstructColumns it creates new “aggregated” rows:

What is that first column here? The gray background of its entries indicates that it’s what we call a “key column”: a column whose entries (perhaps together with other key columns) can be used to reference rows. And later, we’ll see how you can use RowKey to indicate a row by giving a value from a key column:

But let’s go on with our aggregation efforts. Let’s say that we want to group not just by species, but also by island. Here’s how we can do that with AggregateRows:

In a sense what we have here is a table whose rows are specified by pairs of values (here “species” and “island”). But it’s often convenient to “pivot” things so that these values are used respectively for rows and for columns. And you can do that with PivotTable:

Note the —’s, which indicate missing values; apparently there are no Gentoo penguins on Dream island, etc.

PivotTable normally gives exactly the same data as AggregateRows, but in a rearranged form. One additional feature of PivotTable is the option IncludeGroupAggregates which includes All entries that aggregate across each type of group:

If you have multiple functions that you’re computing, AggregateRows will just give them as separate columns:

PivotTable can also deal with multiple functions—by creating columns with “extended keys”:

And now you can use RowKey and ExtendedKey to refer to elements of the resulting Tabular:

Getting Data into Tabular

We’ve seen some of the things you can do when you have data as a Tabular. But how does one get data into a Tabular? There are several ways. The first is just to convert from structures like lists and associations. The second is to import from a file, say a CSV or XLSX (or, for larger amounts of data, Parquet)—or from an external data store (S3, Dropbox, etc.). And the third is to connect to a database. You can also get data for Tabular directly from the Wolfram Knowledgebase or from the Wolfram Data Repository.

Here’s how you can convert a list of lists into a Tabular:

And here’s how you can convert back:

It works with sparse arrays too, here instantly creating a million-row Tabular

that takes 80 MB to store:

Here’s what happens with a list of associations:

You can get the same Tabular by entering its data and its column names separately:

By the way, you can convert a Tabular to a Dataset

and in this simple case you can convert it back to a Tabular too:

In general, though, there are all sorts of options for how to convert lists, datasets, etc. to Tabular objects—and ToTabular is set up to let you control these. For example, you can use ToTabular to create a Tabular from columns rather than rows:

How about external data? In Version 14.2 Import now supports a "Tabular" element for tabular data formats. So, for example, given a CSV file

CSV file

Import can immediately import it as a Tabular:

This works very efficiently even for huge CSV files with millions of entries. It also does well at automatically identifying column names and headers. The same kind of thing works with more structured files, like ones from spreadsheets and statistical data formats. And it also works with modern columnar storage formats like Parquet, ORC and Arrow.

Import transparently handles both ordinary files, and URLs (and URIs), requesting authentication if needed. In Version 14.2 we’re adding the new concept of DataConnectionObject, which provides a symbolic representation of remote data, essentially encapsulating all the details of how to get the data. So, for example, here’s a DataConnectionObject for an S3 bucket, whose contents we can immediately import:

(In Version 14.2 we’re supporting Amazon S3, Azure Blob Storage, Dropbox, IPFS—with many more to come. And we’re also planning support for data warehouse connections, APIs, etc.)

But what about data that’s too big—or too fast-changing—to make sense to explicitly import? An important feature of Tabular (mentioned above) is that it can transparently handle external data, for example in relational databases.

Here’s a reference to a large external database:

RelationalDatabase

This defines a Tabular that points to a table in the external database:

tab = Tabular

We can ask for the dimensions of the Tabular—and we see that it has 158 million rows:

Dimensions

The table we’re looking at happens to be all the line-oriented data in OpenStreetMap. Here are the first 3 rows and 10 columns:

ToMemory

Most operations on the Tabular will now actually get done in the external database. Here we’re asking to select rows whose “name” field contains "Wolfram":

Select

The actual computation is only done when we use ToMemory, and in this case (because there’s a lot of data in the database) it takes a little while. But soon we get the result, as a Tabular:

ToMemory

And we learn that there are 58 Wolfram-named items in the database:

Length

Another source of data for Tabular is the built-in Wolfram Knowledgebase. In Version 14.2 EntityValue supports direct output in Tabular form:

The Wolfram Knowledgebase provides lots of good examples of data for Tabular. And the same is true of the Wolfram Data Repository—where you can typically just apply Tabular to get data in Tabular form:

Cleaning Data for Tabular

In many ways it’s the bane of data science. Yes, data is in digital form. But it’s not clean; it’s not computable. The Wolfram Language has long been a uniquely powerful tool for flexibly cleaning data (and, for example, for advancing through the ten levels of making data computable that I defined some years ago).

But now, in Version 14.2, with Tabular, we have a whole new collection of streamlined capabilities for cleaning data. Let’s start by importing some data “from the wild” (and, actually, this example is cleaner than many):

(By the way, if there was really crazy stuff in the file, we might have wanted to use the option MissingValuePattern to specify a pattern that would just immediately replace the crazy stuff with Missing[].)

OK, but let’s start by surveying what came in here from our file, using TabularStructure:

We see that Import successfully managed to identify the basic type of data in most of the columns—though for example it can’t tell if numbers are just numbers or are representing quantities with units, etc. And it also identifies that some number of entries in some columns are “missing”.

As a first step in data cleaning, let’s get rid of what seems like an irrelevant "id" column:

Next, we see that the elements in the first column are being identified as strings—but they’re really dates, and they should be combined with the times in the second column. We can do this with TransformColumns, removing what’s now an “extra column” by replacing it with Nothing:

Looking at the various numerical columns, we see that they’re really quantities that should have units. But first, for convenience, let’s rename the last two columns:

Now let’s turn the numerical columns into columns of quantities with units, and, while we’re at it, also convert from °C to °F:

Here’s how we can now plot the temperature as a function of time:

There’s a lot of wiggling there. And looking at the data we see that we’re getting temperature values from several different weather stations. This selects data from a single station:

What’s the break in the curve? If we just scroll to that part of the tabular we’ll see that it’s because of missing data:

So what can we do about this? Well, there’s a powerful function TransformMissing that provides many options. Here we’re asking it to interpolate to fill in missing temperature values:

And now there are no gaps, but, slightly mysteriously, the whole plot extends further:

The reason is that it’s interpolating even in cases where basically nothing was measured. We can remove those rows using Discard:

And now we won’t have that “overhang” at the end:

Sometimes there’ll explicitly be data that’s missing; sometimes (more insidiously) the data will just be wrong. Let’s look at the histogram of pressure values for our data:

Oops. What are those small values? Presumably they’re wrong. (Perhaps they were transcription errors?) We can remove such “anomalous” values by using TransformAnomalies. Here we’re telling it to just completely trim out any row where the pressure was “anomalous”:

We can also get TransformAnomalies to try to “fix” the data. Here we’re just replacing any anomalous pressure by the previous pressure listed in the tabular:

You can also tell TransformAnomalies to “flag” any anomalous value and make it “missing”. But, if we’ve got missing values what then happens if we try to do computations on them? That’s where MissingFallback comes in. It’s fundamentally a very simple function—that just returns its first non-missing argument:

But even though it’s simple, it’s important in making it easy to handle missing values. So, for example, this computes a “northspeed”, falling back to 0 if data needed for the computation is missing:

The Structure of Tabular

We’ve said that a Tabular is “like” a list of associations. And, indeed, if you apply Normal to it, that’s what you’ll get:

But internally Tabular is stored in a much more compact and efficient way. And it’s useful to know something about this, so you can manipulate Tabular objects without having to “take them apart” into things like lists and associations. Here’s our basic sample Tabular:

What happens if we extract a row? Well, we get a TabularRow object:

If we apply Normal, we get an association:

Here’s what happens if we instead extract a column:

Now Normal gives a list:

We can create a TabularColumn from a list:

Now we can use InsertColumns to insert a symbolic column like this into an existing Tabular (including the "b" tells InsertColumns to insert the new column after the “b” column):

But what actually is a Tabular inside? Let’s look at the example:

TabularStructure gives us a summary of the internal structure here:

The first thing to notice is that everything is stated in terms of columns, reflecting the fact that Tabular is a fundamentally column-oriented construct. And part of what makes Tabular so efficient is then that within a column everything is uniform, in the sense that all the values are the same type of data. In addition, for things like quantities and dates, we factor the data so that what’s actually stored internally in the column is just a list of numbers, with a single copy of “metadata information” on how to interpret them.

And, yes, all this has a big effect. Like here’s the size in bytes of our New York trees Tabular from above:

But if we turn it into a list of associations using Normal, the result is about 14x larger:

OK, but what are those “column types” in the tabular structure? ColumnTypes gives a list of them:

These are low-level types of the kind used in the Wolfram Language compiler. And part of what knowing these does is that it immediately tells us what operations we can do on a particular column. And that’s useful both in low-level processing, and in things like knowing what kind of visualization might be possible.

When Import imports data from something like a CSV file, it tries to infer what type each column is. But sometimes (as we mentioned above) you’ll want to “cast” a column to a different type, specifying the “destination type” using Wolfram Language type description. So, for example, this casts column “b” to a 32-bit real number, and column “c” to units of meters:

By the way, when a Tabular is displayed in a notebook, the column headers indicate the types of data in the corresponding columns. So in this case, there’s a little in the first column to indicate that it contains strings. Numbers and dates basically just “show what they are”. Quantities have their units indicated. And general symbolic expressions (like column “f” here) are indicated with . (If you hover over a column header, it gives you more detail about the types.)

The next thing to discuss is missing data. Tabular always treats columns as being of a uniform type, but keeps an overall map of where values are missing. If you extract the column you’ll see a symbolic Missing:

But if you operate on the tabular column directly it’ll just behave as if the missing data is, well, missing:

By the way, if you’re bringing in data “from the wild”, Import will attempt to automatically infer the right type for each column. It knows how to deal with common anomalies in the input data, like NaN or null in a column of numbers. But if there are other weird things—like, say, notfound in the middle of a column of numbers—you can tell Import to turn such things into ordinary missing data by giving them as settings for the option MissingValuePattern.

There are a couple more subtleties to discuss in connection with the structure of Tabular objects. The first is the notion of extended keys. Let’s say we have the following Tabular:

We can “pivot this to columns” so that the values x and y become column headers, but “under” the overall column header “value”:

But what is the structure of this Tabular? We can use ColumnKeys to find out:

You can now use these extended keys as indices for the Tabular:

In this particular case, because the “subkeys” "x" and "y" are unique, we can just use those, without including the other part of the extended key:

Our final subtlety (for now) is somewhat related. It concerns key columns. Normally the way we specify a row in a Tabular object is just by giving its position. But if the values of a particular column happen to be unique, then we can use those instead to specify a row. Consider this Tabular:

The fruit column has the feature that each entry appears only once—so we can create a Tabular that uses this column as a key column:

Notice that the numbers for rows have now disappeared, and the key column is indicated with a gray background. In this Tabular, you can then reference a particular row using for example RowKey:

Equivalently, you can also use an association with the column name:

What if the values in a single column are not sufficient to uniquely specify a row, but several columns together are? (In a real-world example, say one column has first names, and another has last names, and another has dates of birth.) Well, then you can designate all those columns as key columns:

And once you’ve done that, you can reference a row by giving the values in all the key columns:

Tabular Everywhere

Tabular provides an important new way to represent structured data in the Wolfram Language. It’s powerful in its own right, but what makes it even more powerful is how it integrates with all the other capabilities in the Wolfram Language. Many functions just immediately work with Tabular. But in Version 14.2 hundreds have been enhanced to make use of the special features of Tabular.

Most often, it’s to be able to operate directly on columns in a Tabular. So, for example, given the Tabular

we can immediately make a visualization based on two of the columns:

If one of the columns has categorical data, we’ll recognize that, and plot it accordingly:

Another area where Tabular can immediately be used is machine learning. So, for example, this creates a classifier function that will attempt to determine the species of a penguin from other data about it:

Now we can use this classifier function to predict species from other data about a penguin:

We can also take the whole Tabular and make a feature space plot, labeling with species:

Or we could “learn the distribution of possible penguins”

and randomly generate 3 “fictitious penguins” from this distribution:

Algebra with Symbolic Arrays

One of the major innovations of Version 14.1 was the introduction of symbolic arrays—and the ability to create expressions involving vector, matrix and array variables, and to take derivatives of them. In Version 14.2 we’re taking the idea of computing with symbolic arrays a step further—for the first time systematically automating what has in the past been the manual process of doing algebra with symbolic arrays, and simplifying expressions involving symbolic arrays.

Let’s start by talking about ArrayExpand. Our longstanding function Expand just deals with expanding ordinary multiplication, effectively of scalars—so in this case it does nothing:

But in Version 14.2 we also have ArrayExpand which will do the expansion:

ArrayExpand deals with many generalizations of multiplication that aren’t commutative:

In an example like this, we really don’t need to know anything about a and b. But sometimes we can’t do the expansion without, for example, knowing their dimensions. One way to specify those dimensions is as a condition in ArrayExpand:

An alternative is to use an explicit symbolic array variable:

In addition to expanding generalized products using ArrayExpand, Version 14.2 also supports general simplification of symbolic array expressions:

The function ArraySimplify will specifically do simplification on symbolic arrays, while leaving other parts of expressions unchanged. Version 14.2 supports many kinds of array simplifications:

We could do these simplifications without knowing anything about the dimensions of a and b. But sometimes we can’t go as far without knowing these. For example, if we don’t know the dimensions we get:

But with the dimensions we can explicitly simplify this to an n×n identity matrix:

ArraySimplify can also take account of the symmetries of arrays. For example, let’s set up a symbolic symmetric matrix:

And now ArraySimplify can immediately resolve this:

The ability to do algebraic operations on complete arrays in symbolic form is very powerful. But sometimes it’s also important to look at individual components of arrays. And in Version 14.2 we’ve added ComponentExpand to let you get components of arrays in symbolic form.

So, for example this takes a 2-component vector and writes it out as an explicit list with two symbolic components:

Underneath, those components are represented using Indexed:

Here’s the determinant of a 3×3 matrix, written out in terms of symbolic components:

And here’s a matrix power:

Given 3D vectors and we can also for example form the cross product

and we can then go ahead and dot it into an inverse matrix:

Language Tune-Ups

As a daily user of the Wolfram Language I’m very pleased with how smoothly I find I can translate computational ideas into code. But the more we’ve made it easy to do, the more we can see new places where we can polish the language further. And in Version 14.2—like every version before it—we’ve added a number of “language tune-ups”.

A simple one—whose utility becomes particularly clear with Tabular—is Discard. You can think of it as a complement to Select: it discards elements according to the criterion you specify:

And along with adding Discard, we’ve also enhanced Select. Normally, Select just gives a list of the elements it selects. But in Version 14.2 you can specify other results. Here we’re asking for the “index” (i.e. position) of the elements that NumberQ is selecting:

Something that can be helpful in dealing with very large amounts of data is getting a bit vector data structure from Select (and Discard), that provides a bit mask of which elements are selected or not:

By the way, here’s how you can ask for multiple results from Select and Discard:

In talking about Tabular we already mentioned MissingFallback. Another function related to code robustification and error handling is the new function Failsafe. Let’s say you’ve got a list which contains some “failed” elements. If you map a function f over that list, it’ll apply itself to the failure elements just as to everything else:

But quite possibly f wasn’t set up to deal with these kinds of failure inputs. And that’s where Failsafe comes in. Because Failsafe[f][x] is defined to give f[x] if x is not a failure, and to just return the failure if it is. So now we can map f across our list with impunity, knowing it’ll never be fed failure input:

Talking of tricky error cases, another new function in Version 14.2 is HoldCompleteForm. HoldForm lets you display an expression without doing ordinary evaluation of the expression. But—like Hold—it still allows certain transformations to get made. HoldCompleteForm—like HoldComplete—prevents all these transformations. So while HoldForm gets a bit confused here when the sequence “resolves”

HoldCompleteForm just completely holds and displays the sequence:

Another piece of polish added in Version 14.2 concerns Counts. I often find myself wanting to count elements in a list, including getting 0 when a certain element is missing. By default, Counts just counts elements that are present:

But in Version 14.2 we’ve added a second argument that lets you give a complete list of all the elements you want to count—even if they happen to be absent from the list:

As a final example of language tune-up in Version 14.2 I’ll mention AssociationComap. In Version 14.0 we introduced Comap as a “co-” (as in “co-functor”, etc.) analog of Map:

In Version 14.2 we’re introducing AssociationComap—the “co-” version of AssociationMap:

Think of it as a nice way to make labeled tables of things, as in:

Brightening Our Colors; Spiffing Up for 2025

In 2014—for Version 10.0—we did a major overhaul of the default colors for all our graphics and visualization functions, coming up with what we felt was a good solution. (And as we’ve just noticed, somewhat bizarrely, it turned out that in the years that followed, many of the graphics and visualization libraries out there seemed to copy what we did!) Well, a decade has now passed, visual expectations (and display technologies) have changed, and we decided it was time to spiff up our colors for 2025.

Here’s what a typical plot looked like in Versions 10.0 through 14.1:

And here’s the same plot in Version 14.2:

By design, it’s still completely recognizable, but it’s got a little extra zing to it.

With more curves, there are more colors. Here’s the old version:

And here’s the new version:

Histograms are brighter too. The old:

And the new:

Here’s the comparison between old (“2014”) and new (“2025”) colors:

It’s subtle, but it makes a difference. I have to say that increasingly over the past few years, I’ve felt I had to tweak the colors in almost every Wolfram Language image I’ve published. But I’m excited to say that with the new colors that urge has gone away—and I can just use our default colors again!

LLM Streamlining & Streaming

We first introduced programmatic access to LLMs in Wolfram Language in the middle of 2023, with functions like LLMFunction and LLMSynthesize. At that time, these functions needed access to external LLM services. But with the release last month of LLM Kit (along with Wolfram Notebook Assistant) we’ve made these functions seamlessly available for everyone with a Notebook Assistant + LLM Kit subscription. Once you have your subscription, you can use programmatic LLM functions anywhere and everywhere in Version 14.2 without any further set up.

There are also two new functions: LLMSynthesizeSubmit and ChatSubmit. Both are concerned with letting you get incremental results from LLMs (and, yes, that’s important, at least for now, because LLMs can be quite slow). Like CloudSubmit and URLSubmit, LLMSynthesizeSubmit and ChatSubmit are asynchronous functions: you call them to start something that will call an appropriate handler function whenever a certain specified event occurs.

Both LLMSynthesizeSubmit and ChatSubmit support a whole variety of events. An example is "ContentChunkReceived": an event that occurs when there’s a chunk of content received from the LLM.

Here’s how one can use that:

The LLMSynthesizeSubmit returns a TaskObject, but then starts to synthesize text in response to the prompt you’ve given, calling the handler function you specified every time a chunk of text comes in. After a few moments, the LLM will have finished its process of synthesizing text, and if you ask for the value of c you’ll see each of the chunks it produced:

Let’s try this again, but now setting up a dynamic display for a string s and then running LLMSynthesizeSubmit to accumulate the synthesized text into this string:

ChatSubmit is the analog of ChatEvaluate, but asynchronous—and you can use it to create a full chat experience, in which content is streaming into your notebook as soon as the LLM (or tools called by the LLM) generate it.

Streamlining Parallel Computation: Launch All the Machines!

For nearly 20 years we’ve had a streamlined capability to do parallel computation in Wolfram Language, using functions like ParallelMap, ParallelTable and Parallelize. The parallel computation can happen on multiple cores on a single machine, or across many machines on a network. (And, for example, in my own current setup I have 7 machines right now with a total of 204 cores. )

In the past few years, partly responding to the increasing number of cores typically available on individual machines, we’ve been progressively streamlining the way that parallel computation is provisioned. And in Version 14.2 we’ve, yes, parallelized the provisioning of parallel computation. Which means, for example, that my 7 machines all start their parallel kernels in parallel—so that the whole process is now finished in a matter of seconds, rather than potentially taking minutes, as it did before:

Another new feature for parallel computation in Version 14.2 is the ability to automatically parallelize across multiple variables in ParallelTable. ParallelTable has always had a variety of algorithms for optimizing the way it splits up computations for different kernels. Now that’s been extended so that it can deal with multiple variables:

As someone who very regularly does large-scale computations with the Wolfram Language it’s hard to overstate how seamlessly important its parallel computation capabilities have been to me. Usually I’ll first figure out a computation with Map, Table, etc. Then when I’m ready to do the full version I’ll swap in ParallelMap, ParallelTable, etc. And it’s remarkable how much difference a 200x increase in speed makes (assuming my computation doesn’t have too much communication overhead).

(By the way, talking of communication overhead, two new functions in Version 14.2 are ParallelSelect and ParallelCases, which allow you to select and find cases in lists in parallel, saving communication overhead by sending only final results back to the master kernel. This functionality has actually been available for a while through Parallelize[ Select[ ] ] etc., but it’s streamlined in Version 14.2.)

Follow that ____! Tracking in Video

Let’s say we’ve got a video, for example of people walking through a train station. We’ve had the capability for some time to take a single frame of such a video, and find the people in it. But in Version 14.2 we’ve got something new: the capability to track objects that move around between frames of the video.

Let’s start with a video:



We could take an individual frame, and find image bounding boxes. But as of Version 14.2 we can just apply ImageBoundingBoxes to the whole video at once:

Then we can apply the data on bounding boxes to highlight people in the video—using the new HighlightVideo function:



But this just separately indicates where people are in each frame; it doesn’t connect them from one frame to another. In Version 14.2 we’ve added VideoObjectTracking to follow objects between frames:

Now if we use HighlightVideo, different objects will be annotated with different colors:



This picks out all the unique objects identified in the course of the video, and counts them:

“Where’s the dog?”, you might ask. It’s certainly not there for long:

And if we find the first frame where it is supposed to appear it does seem as if what’s presumably a person on the lower right has been mistaken for a dog:

And, yup, that’s what it thought was a dog:

Game Theory

“What about game theory?”, people have long asked. And, yes, there has been lots of game theory done with the Wolfram Language, and lots of packages written for particular aspects of it. But in Version 14.2 we’re finally introducing built-in system functions for doing game theory (both matrix games and tree games).

Here’s how we specify a (zero-sum) 2-player matrix game:

This defines payoffs when each player takes each action. We can represent this by a dataset:

An alternative is to “plot the game” using MatrixGamePlot:

OK, so how can we “solve” this game? In other words, what action should each player take, with what probability, to maximize their average payoff over many instances of the game? (It’s assumed that in each instance the players simultaneously and independently choose their actions.) A “solution” that maximizes expected payoffs for all players is called a Nash equilibrium. (As a small footnote to history, John Nash was a long-time user of Mathematica and what’s now the Wolfram Language—though many years after he came up with the concept of Nash equilibrium.) Well, now in Version 14.2, FindMatrixGameStrategies computes optimal strategies (AKA Nash equilibria) for matrix games:

This result means that for this game player 1 should play action 1 with probability and action 2 with probability , and player 2 should do these with probabilities and . But what are their expected payoffs? MatrixGamePayoff computes that:

It can get pretty hard to keep track of the different cases in a game, so MatrixGame lets you give whatever labels you want for players and actions:

These labels are then used in visualizations:

What we just showed is actually a standard example game—the “prisoner’s dilemma”. In the Wolfram Language we now have GameTheoryData as a repository of about 50 standard games. Here’s one, specified to have 4 players:

And it’s less trivial to solve this game, but here’s the result—with 27 distinct solutions:

And, yes, the visualizations keep on working, even when there are more players (here we’re showing the 5-player case, indicating the 50th game solution):

It might be worth mentioning that the way we’re solving these kinds of games is by using our latest polynomial equation solving capabilities—and not only are we able to routinely find all possible Nash equilibria (not just a single fixed point), but we’re also able to get exact results:

In addition to matrix games, which model games in which players simultaneously pick their actions just once, we’re also supporting tree games, in which players take turns, producing a tree of possible outcomes, ending with a specified payoff for each of the players. Here’s an example of a very simple tree game:

We can get at least one solution to this game—described by a nested structure that gives the optimal probabilities for each action of each player at each turn:

Things with tree games can get more elaborate. Here’s an example—in which other players sometimes don’t know which branches were taken (as indicated by states joined by dashed lines):

What we’ve got in Version 14.2 represents rather complete coverage of the basic concepts in a typical introductory game theory course. But now, in typical Wolfram Language fashion, it’s all computable and extensible—so you can study more realistic games, and quickly do lots of examples to build intuition.

We’ve so far concentrated on “classic game theory”, notably with the feature (relevant to many current applications) that all action nodes are the result of a different sequence of actions. However, games like tic-tac-toe (that I happened to recently study using multiway graphs) can be simplified by merging equivalent action nodes. Multiple sequences of actions may lead to the same game of tic-tac-toe, as is often the case for iterated games. These graph structures don’t fit into the kind of classic game theory trees we’ve introduced in Version 14.2—though (as my own efforts I think demonstrate) they’re uniquely amenable to analysis with the Wolfram Language.

Computing the Syzygies, and Other Advances in Astronomy

There are lots of “coincidences” in astronomy—situations where things line up in a particular way. Eclipses are one example. But there are many more. And in Version 14.2 there’s now a general function FindAstroEvent for finding these “coincidences”, technically called syzygies (“sizz-ee-gees”), as well as other “special configurations” of astronomical objects.

A simple example is the September (autumnal) equinox:

Roughly this is when day and night are of equal length. More precisely, it’s when the sun is at one of the two positions in the sky where the plane of the ecliptic (i.e. the orbital plane of the earth around the sun) crosses the celestial equator (i.e. the projection of the earth’s equator)—as we can see here (the ecliptic is the yellow line; the celestial equator the blue one):

As another example, let’s find the next time over the next century when Jupiter and Saturn will be closest in the sky:

They’ll get close enough to see their moons together:

There are an incredible number of astronomical configurations that have historically been given special names. There are equinoxes, solstices, equiluxes, culminations, conjunctions, oppositions, quadratures—as well as periapses and apoapses (specialized to perigee, perihelion, periareion, perijove, perikrone, periuranion, periposeideum, etc.). In Version 14.2 we support all these.

So, for example, this gives the next time Triton will be closest to Neptune:

A famous example has to do with the perihelion (closest approach to the Sun) of Mercury. Let’s compute the position of Mercury (as seen from the Sun) at all its perihelia in the first couple of decades of the nineteenth century:

We see that there’s a systematic “advance” (along with some wiggling):

So now let’s quantitatively compute this advance. We start by finding the times for the first perihelia in 1800 and 1900:

Now we compute the angular separation between the positions of Mercury at these times:

Then divide this by the time difference

and convert units:

Famously, 43 arcseconds per century of this is the result of deviations from the inverse square law of gravity introduced by general relativity—and, of course, accounted for by our astronomical computation system. (The rest of the advance is the result of traditional gravitational effects from Venus, Jupiter, Earth, etc.)

PDEs Now Also for Magnetic Systems

More than a decade and a half ago we made the commitment to make the Wolfram Language a full strength PDE modeling environment. Of course it helped that we could rely on all the other capabilities of the Wolfram Language—and what we’ve been able to produce is immeasurably more valuable because of its synergy with the rest of the system. But over the years, with great effort, we’ve been steadily building up symbolic PDE modeling capabilities across all the standard domains. And at this point I think it’s fair to say that we can handle—at an industrial scale—a large part of the PDE modeling that arises in real-world situations.

But there are always more cases for which we can build in capabilities, and in Version 14.2 we’re adding built-in modeling primitives for static and quasistatic magnetic fields. So, for example, here’s how we can now model an hourglass-shaped magnet. This defines boundary conditions—then solves the equations for the magnetic scalar potential:

We can then take that result, and, for example, immediately plot the magnetic field lines it implies:

Version 14.2 also adds the primitives to deal with slowly varying electric currents, and the magnetic fields they generate. All of this immediately integrates with our other modeling domains like heat transfer, fluid dynamics, acoustics, etc.

There’s much to say about PDE modeling and its applications, and in Version 14.2 we’ve added more than 200 pages of additional textbook-style documentation about PDE modeling, including some research-level examples.

New Features in Graphics, Geometry & Graphs

Graphics has always been a strong area for the Wolfram Language, and over the past decade we’ve also built up very strong computational geometry capabilities. Version 14.2 adds some more “icing on the cake”, particularly in connecting graphics to geometry, and connecting geometry to other parts of the system.

As an example, Version 14.2 adds geometry capabilities for more of what were previously just graphics primitives. For example, this is a geometric region formed by filling a Bezier curve:

And we can now do all our usual computational geometry operations on it:

Something like this now works too:

Something else new in Version 14.2 is MoleculeMesh, which lets you build computable geometry from molecular structures. Here’s a graphical rendering of a molecule:

And here now is a geometric mesh corresponding to the molecule:

We can then do computational geometry on this mesh:

Another new feature in Version 14.2 is an additional method for graph drawing that can make use of symmetries. If you make a layered graph from a symmetrical grid, it won’t immediately render in a symmetrical way:

But with the new "SymmetricLayeredEmbedding" graph layout, it will:

User Interface Tune-Ups

Making a great user interface is always a story of continued polishing, and we’ve now been doing that for the notebook interface for nearly four decades. In Version 14.2 there are several notable pieces of polish that have been added. One concerns autocompletion for option values.

We’ve long shown completions for options that have a discrete collection of definite common settings (such as All, Automatic, etc.). In Version 14.2 we’re adding “template completions” that give the structure of settings, and then let you tab through to fill in particular values. In all these years, one of the places I pretty much always find myself going to in the documentation is the settings for FrameLabel. But now autocompletion immediately shows me the structure of these settings:

Interface settings autocompletion

Also in autocompletion, we’ve added the capability to autocomplete context names, context aliases, and symbols that include contexts. And in all cases, the autocompletion is “fuzzy” in the sense that it’ll trigger not only on characters at the beginning of a name but on ones anywhere in the name—which means that you can just type characters in the name of a symbol, and relevant contexts will appear as autocompletions.

Another small convenience added in Version 14.2 is the ability to drag images from one notebook to any other notebook, or, for that matter, to any other application that can accept dragged images. It’s been possible to drag images from other applications into notebooks, but now you can do it the other way too.

Something else that’s for now specific to macOS is enhanced support for icon preview (as well as Quick Look). So now if you have a folder full of notebooks and you select Icon view, you’ll see a little representation of each notebook as an icon of its content:

Notebook icon preview

Under the hood in Version 14.2 there are also some infrastructural developments that will enable significant new features in subsequent versions. Some of these involve generalized support for dark mode. (Yes, one might initially imagine that dark mode would somehow be trivial, but when you start thinking about all the graphics and interface elements that involve colors, it’s clear it’s not. Though, for example, after significant effort we did recently release dark mode for Wolfram|Alpha.)

So, for example, in Version 14.2 you’ll find the new symbol LightDarkSwitched, which is part of the mechanism for specifying styles that will automatically switch for light and dark modes. And, yes, there is a style option LightDark that will switch modes for notebooks—and which is at least experimentally supported.

Related to light/dark mode is also the notion of theme colors: colors that are defined symbolically and can be switched together. And, yes, there’s an experimental symbol ThemeColor related to these. But the full deployment of this whole mechanism won’t be there until the next version.

The Beginnings of Going Native on GPUs

Many important pieces of functionality inside the Wolfram Language automatically make use of GPUs when they are available. And already 15 years ago we introduced primitives for low-level GPU programming. But in Version 14.2 we’re beginning the process of making GPU capabilities more readily available as a way to optimize general Wolfram Language usage. The key new construct is GPUArray, which represents an array of data that will (if possible) be stored so as to be immediately and directly accessible to your GPU. (On some systems, it will be stored in separate “GPU memory”; on others, such as modern Macs, it will be stored in shared memory in such a way as to be directly accessible by the GPU.)

In Version 14.2 we’re supporting an initial set of operations that can be performed directly on GPU arrays. The operations available vary slightly from one type of GPU to another. Over time, we expect to use or create many additional GPU libraries that will extend the set of operations that can be performed on GPU arrays.

Here is a random ten-million-element vector stored as a GPU array:

The GPU on the Mac on which I am writing this supports the necessary operations to do this purely in its GPU, giving back a GPUArray result:

Here’s the timing:

And here’s the corresponding ordinary (CPU) result:

In this case, the GPUArray result is about a factor of 2 faster. What factor you get will vary with the operations you’re doing, and the particular hardware you’re using. So far, the largest factors I’ve seen are around 10x. But as we build more GPU libraries, I expect this to increase—particularly when what you’re doing involves a lot of compute “inside the GPU”, and not too much memory access.

By the way, if you sprinkle GPUArray in your code it’ll normally never affect the results you get—because operations always default to running on your CPU if they’re not supported on your GPU. (Usually GPUArray will make things faster, but if there are too many “GPU misses” then all the “attempts to move data” may actually slow things down.) It’s worth realizing, though, that GPU computation is still not at all well standardized or uniform. Sometimes there may only be support for vectors, sometimes also matrices—and there may be different data types with different numerical precision supported in different cases.

And Even More…

In addition to all the things we’ve discussed here so far, there are also a host of other “little” new features in Version 14.2. But even though they may be “little” compared to other things we’ve discussed, they’ll be big if you happen to need just that functionality.

For example, there’s MidDate—that computes the midpoint of dates:

And like almost everything involving dates, MidDate is full of subtleties. Here it’s computing the week 2/3 of the way through this year:

In math, functions like DSolve and SurfaceIntegrate can now deal with symbolic array variables:

SumConvergence now lets one specify the range of summation, and can give conditions that depend on it:

A little convenience that, yes, I asked for, is that DigitCount now lets you specify how many digits altogether you want to assume your number has, so that it appropriately counts leading 0s:

Talking of conveniences, for functions like MaximalBy and TakeLargest we added a new argument that says how to sort elements to determine “the largest”. Here’s the default numerical order

and here’s what happens if we use “symbolic order” instead:

There are always so many details to polish. Like in Version 14.2 there’s an update to MoonPhase and related functions, both new things to ask about, and new methods to compute them:

In another area, in addition to major new import/export formats (particularly to support Tabular) there’s an update to "Markdown" import that gives results in plaintext, and there’s an update to "PDF" import that gives a mixed list of text and images.

And there are lots of other things too, as you can find in the “Summary of New and Improved Features in 14.2”. By the way, it’s worth mentioning that if you’re looking at a particular documentation page for a function, you can always find out what’s new in this version just by pressing show changes:

Show changes

Who Can Understand the Proof? A Window on Formalized Mathematics

2025-01-10 06:42:31

Related writings:
“Logic, Explainability and the Future of Understanding” (2018) »
“The Physicalization of Metamathematics and Its Implications for the Foundations of Mathematics” (2022) »
“Computational Knowledge and the Future of Pure Mathematics” (2014) »

The Simplest Axiom for Logic

Theorem (Wolfram with Mathematica, 2000):
The single axiom ((ab)•c)•(a•((ac)•a))c is a complete axiom system for Boolean algebra (and is the simplest possible)

For more than a century people had wondered how simple the axioms of logic (Boolean algebra) could be. On January 29, 2000, I found the answer—and made the surprising discovery that they could be about twice as simple as anyone knew. (I also showed that what I found was the simplest possible.)

It was an interesting result—that gave new intuition about just how simple the foundations of things can be, and for example helped inspire my efforts to find a simple underlying theory of physics.

But how did I get the result? Well, I used automated theorem proving (specifically, what’s now FindEquationalProof in Wolfram Language). Automated theorem proving is something that’s been around since at least the 1950s, and its core methods haven’t changed in a long time. But in the rare cases it’s been used in mathematics it’s typically been to confirm things that were already believed to be true. And in fact, to my knowledge, my Boolean algebra axiom is actually the only truly unexpected result that’s ever been found for the first time using automated theorem proving.

But, OK, so we know it’s true. And that’s interesting. But what about the proof? Does the proof, for example, show us why the result is true? Well, actually, in a quarter of a century, nobody (including me) has ever made much headway at all in understanding the proof (which, at least in the form we currently know it, is long and complicated). So is that basically inevitable—say as a consequence of computational irreducibility? Or is there some way—perhaps using modern AI—to “humanize” the proof to a point where one can understand it?

It is, I think, an interesting challenge—that gets at the heart of what one can (and can’t) expect to achieve with formalized mathematics. In what follows, I’ll discuss what I’ve been able to figure out—and how it relates to foundational questions about what mathematics is and how it can be done. And while I think I’ve been able to clarify some of the issues, the core problem is still out there—and I’d like to issue it here as a challenge:

Challenge: Understand the proof of the Theorem

What do I mean by “understand”? Inevitably, “understand” has to be defined in human terms. Something like “so a human can follow and reproduce it”—and, with luck, feel like saying “aha!” at some point, the kind of way they might on hearing a proof of the Pythagorean theorem (or, in logic, something like de Morgan’s law Not[And[p, q]]Or[Not[p], Not[q]]).

It should be said that it’s certainly not clear that such an understanding would ever be possible. After all, as we’ll discuss, it’s a basic metamathematical fact that out of all possible theorems almost none have short proofs, at least in terms of any particular way of stating the proofs. But what about an “interesting theorem” like the one we’re considering here? Maybe that’s different. Or maybe, at least, there’s some way of building out a “higher-level mathematical narrative” for a theorem like this that will take one through the proof in human-accessible steps.

In principle one could always imagine a somewhat bizarre scenario in which people would just rote learn chunks of the proof, perhaps giving each chunk some name (a bit like how people learned bArbArA and cElArEnt syllogisms in the Middle Ages). And in terms of these chunks there’d presumably then be a “human way” to talk about the proof. But learning the chunks—other than as some kind of recreational or devotional activity—doesn’t seem to make much sense unless there’s metamathematical structure that somehow connects the chunks to “general concepts” that are widely useful elsewhere.

But of course it’s still conceivable that there might be a “big theory” that would lead us to the theorem in an “understandable way”. And that could be a traditional mathematical theory, built up with precise, if potentially very abstract, constructs. But what about something more like a theory in natural science? In which we might treat our automatically generated proof as an object for empirical study—exploring its characteristics, trying to get intuition about it, and ultimately trying to deduce the analog of “natural laws” that give us a “human-level” way of understanding it.

Of course, for many purposes it doesn’t really matter why the theorem is true. All that matters is that it is true, and that one can deduce things on the basis of it. But as one thinks about the future of mathematics, and the future of doing mathematics, it’s interesting to explore to what extent it might or might not ultimately be possible to understand in a human-accessible way the kind of seemingly alien result that the theorem represents.

The Proof as We Know It

I first presented a version of the proof on two pages of my 2002 book A New Kind of Science, printing it in 4-point type to make it fit:

Axiom proof

Today, generating a very similar proof is a one-liner in Wolfram Language (as we’ll discuss below, the · dot here can be thought of as representing the Nand operation):

The proof involves 307 (mostly rather elaborate) steps. And here’s one page of it (out of about 30)—presented in the form of a computable Wolfram Language dataset:

Example proof steps page

What’s the basic idea of this proof? Essentially it’s to perform a sequence of purely structural symbolic operations that go from our axiom to known axioms of Boolean algebra. And the proof does this by proving a series of lemmas which can be combined to eventually give what we want:

The highlighted “targets” here are the standard Sheffer axioms for Boolean algebra from 1913:

And, yes, even though these are quite short, the intermediate lemmas involved in the proof get quite long—the longest involving 60 symbols (i.e. having LeafCount 60):

It’s as if to get to where it’s going, the proof ends up having to go through the wilds of metamathematical space. And indeed one gets a sense of this if one plots the sizes (i.e. LeafCount) of successive lemmas:

Here’s the distribution of these sizes, showing that while they’re often small, there’s a long tail (note, by the way, that if dot · appears n times in a lemma, the LeafCount will be 2n + 3):

So how are these lemmas related? Here’s a graph of their interdependence (with the size of each dot being proportional to the size of the lemma it represents):

Zooming in on the top we see more detail:

We start from our axiom, then derive a whole sequence of lemmas—as we’ll see later, always combining two lemmas to create a new one. (And, yes, we could equally well call these things theorems—but we generate so many of them it seems more natural to call them “lemmas”.)

So, OK, we’ve got a complicated proof. But how can we check that it’s correct? Well, from the symbolic representation of the proof in the Wolfram Language we can immediately generate a “proof function” that in effect contains executable versions of all the lemmas—implemented using simple structural operations:

And when you run this function, it applies all these lemmas and checks that the result comes out right:

And, yes, this is basically what one would do in a proof assistant system (like Lean or Metamath)—except that here the steps in the proof were generated purely automatically, without any human guidance (or effort). And, by the way, the fact that we can readily translate our symbolic proof representation into a function that we can run provides an operational manifestation of the equivalence between proofs and programs.

But let’s look back at our lemma-interdependence “proof graph”. One notable feature is that we see several nodes with high out-degree—corresponding to what we can think of as “pivotal lemmas” from which many other lemmas end up directly being proved. So here’s a list of the “most pivotal” lemmas in our proof:

Or, more graphically, here are the results for all lemmas that occur:

So what are the “pivotal lemmas”? a · b = b · a we readily recognize as commutativity. But the others—despite their comparative simplicity—don’t seem to correspond to things that have specifically shown up before in the mathematical literature (or, as we’ll discuss later, that’s at least what the current generation of LLMs tell us).

But looking at our proof graph something we can conclude is that a large fraction of the “heavy lifting” needed for the whole proof has already happened by the time we can prove a · b = b · a. So, for the sake of avoiding at least some of hairy detail in the full proof, in most of what follows, we’ll concentrate on the proof of a · b = b · a—which FindEquationalProof tells us we can accomplish in 104 steps, with a proof graph of the form

with the sizes of successive lemmas (in what is basically a breadth-first traversal of the proof graph) being:

The “Machine Code” of the Proof

It’s already obvious from the previous section that the proof as we currently know it is long, complicated, and fiddly—and in many ways reminiscent of something at a “machine-code” level. But to get a grounded sense of what’s going on in the proof, it’s useful to dive into the details—even if, yes, they can be seriously hard to wrap one’s head around.

At a fundamental level, the way the proof—say of a · b = b · a—works is by starting from our axiom, and then progressively deducing new lemmas from pairs of existing lemmas. In the simplest case, that deduction works by straightforward symbolic substitution. So, for example, let’s say we have the lemmas

and

Then it turns out that from these lemmas we can deduce:

Or, in other words, knowing that the first two lemmas hold for any a gives us enough information about · that the third lemma must inevitably also hold. So how do we derive this?

Our lemmas in effect define two-way equivalences: their left-hand sides are defined as equal to their right-hand sides, which means that if we see an expression that (structurally) matches one side of a lemma, we can always replace it by the other side of the lemma. And to implement this, we can write our second lemma explicitly as a rule—where to avoid confusion we’re using x rather than a:

But if we now look at our first lemma, we see that there’s part of it (indicated with a frame) that matches the left-hand side of our rule:

If we replace this part (which is at position {2,2}) using our rule we then get

which is precisely the lemma we wanted to deduce.

We can summarize what happened here as a fragment of our proof graph—in which a “substitution event” node takes our first two lemmas as input, and “outputs” our final lemma:

As always, the symbolic expressions we’re working with here can be represented as trees:

The substitution event then corresponds to a tree rewriting:

The essence of automated theorem proving is to find a particular sequence of substitutions etc. that get us from whatever axioms or lemmas we’re starting with, to whatever lemmas or theorems we want to reach. Or in effect to find a suitable “path” through the multiway graph of all possible substitutions etc. that can be made.

So, for example, in the particular case we’re considering here, this is the graph that represents all possible transformations that can occur through a single substitution event:

The particular transformation (or “path”) we’ve used to prove a · a = a · ((a · a) · a) is highlighted. But as we can see, there are many other possible lemmas that can be generated, or in other words that can be proved from the two lemmas we’ve given as input. Put another way, we can think of our input lemmas as implying or entailing all the other lemmas shown here. And, by analogy to the concept of a light cone in physics, we can view the collection of everything entailed by given lemmas or given events as the (future) “entailment cone” of those lemmas or events. A proof that reaches a particular lemma is then effectively a path in this entailment cone—analogous in physics to a world line that reaches a particular spacetime point.

If we continue building out the entailment cone from our original lemmas, then after two (substitution) events we get:

There are 21 lemmas generated here. But it turns out that beyond the lemma we already discussed there are only three (highlighted here) that appear in the proof we are studying here:

And indeed the main algorithmic challenge of theorem proving is to figure out which lemmas to generate in order to get a path to the theorem one’s trying to prove. And, yes, as we’ll discuss later, there are typically many paths that will work, and different algorithms will yield different paths and therefore different proofs.

But, OK, seeing how new lemmas can be derived from old by substitution is already quite complicated. But actually there’s something even more complicated we need to discuss: deriving lemmas not only by substitution but also by what we’ve called bisubstitution.

We can think of both substitution and bisubstitution as turning one lemma X == Y into a transformation rule (either X Y or Y X), and then applying this rule to another lemma, to derive a new lemma. In ordinary substitution, the left-hand side of the rule directly matches (in a Wolfram Language pattern-matching sense) a subexpression in the lemma we’re transforming. But the key point is that all the variables that appear in both our lemmas are really “pattern variables” (x_ etc. in Wolfram Language). So that means there’s another way that one lemma can transform another, in which in effect replacements are made not only in the lemma being transformed, but also in the lemma that’s doing the transforming.

The net effect, though, is still to take two lemmas and derive another, as in:

But in tracing through the details of our proof, we need to distinguish “substitution events” (shown yellowish) from “bisubstitution” ones (shown reddish). (In FindEquationalProof in Wolfram Language, lemmas produced by ordinary substitution are called “substitution lemmas”, while lemmas produced by bisubstitution are called “critical pair lemmas”.)

OK, so how does bisubstitution work? Let’s look at an example. We’re going to be transforming the lemma

using the lemma (which in this case happens to be our original axiom)

to derive the new lemma:

We start by creating a rule from the second lemma. In this case, the rule we need happens to be reversed relative to the way we wrote the lemma, and this means that (in the canonical form we’re using) it’s convenient to rename the variables that appear:

To do our bisubstitution we’re going to apply this rule to a subterm of our first lemma. We can write that first lemma with explicit pattern variables:

As always, the particular names of those variables don’t matter. And to avoid confusion, we’re going to rename them:

Now look at this subterm of this lemma (which is part {2,1,1,2} of the expression):

It turns out that with appropriate bindings for pattern variables this can be matched (or “unified”) with the left-hand side of our rule. This provides a way to find such bindings:

(Note that in these bindings things like c_ stand only for explicit expressions, like c_, not for expressions that the ordinary Wolfram Language pattern c_ would match.)

Now if we apply the bindings we’ve found to the left-hand side of our rule

and to the subterm we picked out from our lemma

we see that we get the same expression. Which means that with these bindings the subterm matches the left-hand side of our rule, and we can therefore replace this subterm with the right-hand side of the rule. To see all this in operation, we first apply the bindings we’ve found to the lemma we’re going to transform (and, as it happens, the binding for y_ is the only one that matters here):

Now we take this form and apply the rule at the position of the subterm we identified:

Renaming variables

we now finally get exactly the lemma that we were trying to derive:

And, yes, getting here was a pretty complicated process. But with the symbolic character of our lemmas, it’s one that is inevitably possible, and so can be used in our proof. And in the end, out of the 101 lemmas used in the proof, 47 were derived by ordinary substitution, while 54 were derived by bisubstitution.

And indeed the first few steps of the proof turn out to use only bisubstituion. An example is the first step—which effectively applies the original axiom to itself using bisubsitution:

And, yes, even this very first step is pretty difficult to follow.

If we start from the original axiom, there are 16 lemmas that can be derived purely by a single ordinary substitution (effectively of the axiom into itself)—resulting in the following entailment cone:

As it happens, though, none of the 16 new lemmas here actually get used in our proof. On the other hand, in the bisubstitution entailment cone

there are 24 new lemmas, and 4 of them get used in the proof—as we can see from the first level of the proof graph (here rotated for easier rendering):

At the next level of the entailment cone from ordinary substitution, there are 5062 new lemmas—none of which get used in the proof. But of the 31431 new lemmas in the (pure) bisubstitution entailment cone, 13 do get used:

At the next level, lemmas generated by ordinary substitution also start to get used:

Here’s another rendering of these first few levels of the proof graph:

Going to another couple of levels we’re starting to see quite a few independent chains of lemmas developing

which eventually join up when we assemble the whole proof graph:

A notable feature of this proof graph is that it has more bisubstitution events at the top, and more ordinary substitution events at the bottom. So why is that? Essentially it seems to be because bisubstitution events tend to produce larger lemmas, and ordinary substitution events tend to produce smaller ones—as we can see if we plot input and output lemma sizes for all events in the proof:

So in effect what seems to be happening is that the proof first has to “spread out in metamathematical space”, using bisubstitution to generate large lemmas “far out in metamathematical space”. Then later the proof has to “corral things back in”, using ordinary substitution to generate smaller lemmas. And for example, at the very end, it’s a substitution event that yields the final theorem we’re trying to prove:

And earlier in the graph, there’s a similar “collapse” to a small (and rather pivotal) lemma:

As the plot above indicates, ordinary substitution can lead to large lemmas, and indeed bisubstitution can also lead to smaller ones, as in

or slightly more dramatically:

But, OK, so this is some of what’s going on at a “machine-code” level inside the proof we have. Of course, given our axiom and the operations of substitution and bisubstitution there are inevitably a huge number of different possible proofs that could be given. The particular proof we’re considering is what the Wolfram Language FindEquationalProof gives. (In the Appendix, we’ll also look at results from some other automated theorem proving systems. The results will be very comparable, if usually a little lengthier.)

We won’t discuss the detailed (and rather elaborate) algorithms inside FindEquationalProof. But fundamentally what they’re doing is to try constructing certain lemmas, then to find sequences of lemmas that eventually form a “path” to what we’re trying to prove. And as some indication of what’s involved in this, here’s a plot of the number of “candidate lemmas” that are being maintained as possible when different lemmas in the proof are generated:

And, yes, for a while there’s roughly exponential growth, leveling off at just over a million when we get to the “pulling everything together” stage of the proof.

Unrolling the Proof

In what we’ve done so far, we’ve viewed our proof as working by starting from an axiom, then progressively building up lemmas, until eventually we get to the theorem we want. But there’s an alternative view that’s in some ways useful in getting a more direct, “mechanical” intuition about what’s going on in the proof.

Let’s say we’re trying to prove that our axiom implies that p · q = q · p. Well, then there must be some way to start from the expression p · q and just keep on judiciously applying the axiom until eventually we get to the expression q · p. And, yes, the number of axiom application steps required might be very large. But ultimately, if it’s true that the axiom implies p · q = q · p there must be a path that gets from p · q to q · p.

But before considering the case of our full proof, let’s start with something simpler. Let’s assume that we’ve already established the lemmas:

Then we can treat them as axioms, and ask a question like whether they imply the lemma

or, in our current approach, whether they can be used to form a path from a · a to a · (a · (a · a)).

Well, it’s not too hard to see that in fact there is such a path. Apply our second lemma to a · a to get:

But now this subterm

matches the left-hand of the first lemma, so that it can be replaced by the right-hand side of that lemma (i.e. by a · (a · a)), giving in the end the desired a · (a · (a · a)).

So now we can summarize this process as:

In what follows, it’ll be convenient to label lemmas. We’ll call our original axiom A1, we’ll call our successive lemmas generated by ordinary substitution Sn and the ones generated by bisubsitution Bn:

In our proof we’ll also use and to indicate whether we’re going to use the lemma (say X = Y) in the “forward direction” X Y or the “reverse direction” X Y. And with this labeling, the proof we just gave (which is for the lemma S23) becomes:

Each step here is a pure substitution, and requires no replacement in the rule (i.e. “axiom”) being used. But proofs like this can also be done with bisubstitution, where replacements are applied to the rule to get it in a form where it can directly be applied to transform an expression:

OK, so how about the first lemma in our full proof? Here’s a proof that its left-hand side can be transformed to its right-hand side just by judiciously applying the original axiom:

Here’s a corresponding proof for the second lemma:

Both these involve bisubstitution. Here’s a proof of the first lemma derived purely by ordinary substitution:

This proof is using not only the original axiom but also the lemma B5. Meanwhile, B5 can be proved using the original axiom together with B2:

But now, inserting the proof we just gave above for B2, we can give a proof of B5 just in terms of the original axiom:

And recursively continuing this unrolling process, we can then prove S1 purely using the original axiom:

What about the whole proof? Well, at the very end we have:

If we “unroll” one step we have

and after 2 steps:

In principle we could go on with this unrolling, in effect recursively replacing each rule by the sequence of transformations that represents its proof. Typically this process will, however, generate exponentially longer proof sequences. But say for lemma S5

the result is still very easily manageable:

We can summarize this result by in effect plotting the sizes of the intermediate expressions involved—and indicating what part of each expression is replaced at each step (with as above indicating “forward” use of the axiom A1 and “backward” A1 ):

For lemma B33

the unrolled proof is now 30 steps long

while for lemma S11

the unrolled proof is 88 steps long:

But here there is a new subtlety. Doing a direct substitution of the “proof paths” for the lemmas used to prove S11 in our original proof gives a proof of length 104:

But this proof turns out to be repetitive, with the whole gray section going from one copy to another of:

As an example of a larger proof, we can consider lemma B47:

And despite the simplicity of this lemma, our proof for it is 1008 steps long:

If we don’t remove repetitive sections, it’s 6805 steps:

Can we unroll the whole proof of a · b = b · a? We can get closer by considering lemma S36:

Its proof is 27105 steps long:

The distribution of expression sizes follows a roughly exponential distribution, with a maximum of 20107:

Plotting the expression sizes on a log scale one gets:

And what stands out most here is a kind of recursive structure—which is the result of long sequences that basically represent the analog of “subroutine calls” that go back and repeatedly prove lemmas that are needed.

OK, so what about the whole proof of a · b = b · a? Yes, it can be unrolled—in terms of 83,314 applications of the original axiom. The sequence of expression sizes is:

Or on a log scale:

The distribution of expression sizes now shows clear deviation from being exponential:

The maximum is 63245, which occurs just 81 steps after the exact midpoint of the proof. In other words, in the middle, the proof has wandered incredibly far out in metamathematical space (there are altogether CatalanNumber[63245] ≈ 1038070 possible expressions of the size it reaches).

The proof returns to small expressions just a few times; here are all the cases in which the size is below 10:

So, yes, it is possible to completely unroll the proof into a sequence of applications of the original axiom. But if one does this, it inevitably involves repeating lots of work. Being able to use intermediate lemmas in effect lets one “share common subparts” in the proof. So that one ends up with just 104 “rule applications”, rather than 83314. Not that it’s easy to understand those 104 steps…

Is There a Better Notation?

Looking at our proof—either in its original “lemma” form, or in its “unrolled” form—the most striking aspect of it is how complicated (and incomprehensible) it seems to be. But one might wonder whether much of that complexity is just the result of not “using the right notation”. In the end, we’ve got a huge number of expressions written in terms of · operations that we can interpret as Nand (or Nor). And maybe it’s a little like seeing the operation of a microprocessor down at the level of individual gates implementing Nands or Nors. And might there perhaps be an analog of a higher-level representation—with higher-level operations (even like arithmetic) that are more accessible to us humans?

It perhaps doesn’t help that Nand itself is a rather non-human construct. For example, not a single natural human language seems to have a word for Nand. But there are combinations of Nands that have more familiar interpretations:

But what combinations actually occur in our proof? Here are the most common subexpressions that appear in lemmas in the proof:

And, yes, we could give the most common of these special names. But it wouldn’t really help in “compressing” the proof—or making it easier to understand.

What about “upgrading” our “laws of inference”, i.e. the way that we can derive new lemmas from old? Perhaps instead of substitution and bisubstitution, which both take two lemmas and produce one more, we could set up more elaborate “tactics” that for example take in more input lemmas. We’ve seen that if we completely unroll the proof, it gets much longer. So perhaps there is a “higher-order” setup that for example dramatically shortens the proof.

One way one might identify this is by seeing commonly repeating structures in the subgraphs that lead to lemmas. But in fact these subgraphs are quite diverse:

What Are the Popular Lemmas?

A typical feature of human-written mathematical proofs is that they’re “anchored” by famous theorems or lemmas. They may have fiddly technical pieces. But usually there’s a backbone of “theorems people know”.

We have the impression that the proof we’re discussing here “spends most of its time wandering around the wilds of metamathematical space”. But perhaps it visits waypoints that are somehow recognizable, or at least should be. Or in other words, perhaps out in the metamathematical space of lemmas there are ones that are somehow sufficiently popular that they’re worth giving names to, and learning—and can then be used as “reference points” in terms of which our proof becomes simpler and more human accessible.

It’s a story very much like what happens with human language. There are things out there in the world, but when there’s a certain category of them that are somehow common or important enough, we make a word for them in our language, which we can then use to “compactly” refer to them. (It’s again the same story when it comes to computational language, and in particular the Wolfram Language, except that in that case it’s been my personal responsibility to come up with the appropriate definitions and names for functions to represent “common lumps of computation”.)

But, OK, so what are the “popular lemmas” of Nand proofs? One way to explore this is to enumerate statements that are “true about Nand”—then to look at proofs of these statements (say found with FindEquationalProof from our axiom) and see what lemmas show up frequently in them.

Enumerating statements “true about Nand, starting from the smallest, we get

where we have highlighted statements from this list that appear as lemmas in our proof.

Proving each of these statements from our original axiom, here are the lengths of proofs we find (for all 1341 distinct theorems with up to LeafCount 4 on each side):

A histogram shows that it’s basically a bimodal distribution

with the smallest “long-proof” theorem being:

In aggregate, all these proofs use about 200,000 lemmas. But only about 1200 of these are distinct. And we can plot which lemmas are used in which proofs—and we see that there are indeed many lemmas that are used across wide ranges of proofs, while there are a few others that are “special” to each proof (the diagonal stripe is associated with lemmas close to the statement being proved):

If we rank all distinct lemmas from most frequently to least frequently used, we get the following distribution of lemma usage frequencies across all our proofs:

It turns out that there is a “common core” of 49 lemmas that are used in every single one of the proofs. So what are these lemmas? Here’s a plot of the usage frequency of lemmas against their size—with the “common ones” populating the top line:

And at first this might seem surprising. We might have expected that short lemmas would be the most frequent, but instead we’re seeing long lemmas that always appear, the very longest being:

So why is this? Basically it’s that these long lemmas are being used at the beginning of every proof. They’re the result of applying bisubstitution to the original axiom, and in some sense they seem to be laying down a kind of net in metamathematical space that then allows more diverse—and smaller—lemmas to be derived.

But how are these “common core” popular lemmas distributed within proofs? Here are a few examples:

And what we see is that while, yes, the common core lemmas are always at the beginning, they don’t seem to have a uniform way of “plugging into” the rest of the proof. And it doesn’t, for example, seem as if there’s just some small set of (perhaps simple) “waypoint” lemmas that one can introduce that will typically shorten these proofs.

If one effectively allows all the common core lemmas to be used as axioms, then inevitably proofs will be shortened; for example, the proof of a · b = b · a—which only ends up using 5 of the common core lemmas—is now shortened to 51 lemmas:

It doesn’t seem to become easier to understand, though. And if it’s unrolled, it’s still 5013 steps.

Still, one can ask what happens if one just introduces particular “recognizable” lemmas as additional axioms. For example, if we include “commutativity” a · b = b · a then we find that, yes, we do manage to reduce the lengths of some proofs, but certainly not all:

Are there any other “pivotal” lemmas we could add? In particular, what about lemmas that can help with the length-200 or more proofs? It turns out that all of these proofs involve the lemma:

So what happens if we add this? Well, it definitely reduces proof lengths:

And sometimes it even seems like it brings proofs into “human range”. For example, a proof of

from our original axiom has length 56. Adding in commutativity reduces it to length 18. And adding our third lemma reduces it to just length 9—and makes it not even depend directly on the original axiom:

But despite the apparent simplicity here, the steps involved—particularly when bisubstitution is used—are remarkably hard to follow. (Note the use of a = a as a kind of “implicit axiom”—something that has actually also appeared, without comment, in many of our other proofs.)

Can We Get a Shorter Proof?

The proof that we’ve been studying can be seen in some ways as a rather arbitrary artifact. It’s the output of FindEquationalProof, with all its specific detailed internal algorithms and choices. In the Appendix, we’ll see that other automated theorem proving systems give very similar results. But we still might wonder whether actually the complexity of the proof as we’ve been studying it is just a consequence of the details of our automated theorem proving—and that in fact there’s a much shorter (and perhaps easier to understand) proof that exists.

One approach we could take—reminiscent of higher category theory—is to think about just simplifying the proof we have, effectively using proof-to-proof transformations. And, yes, this is technically difficult, though it doesn’t seem impossible. But what if there are “holes” in proof space? Then a “continuous deformation” of one proof into another will get stuck, and even if there is a much shorter proof, we’re liable to get “topologically stuck” before we find it.

One way to be sure we’re getting the shortest proof of a particular lemma is to explicitly find the first place that lemma appears in the (future) entailment cone of our original axiom. For example, as we saw above, a single substitution event leads to the entailment cone:

Every lemma produced here is, by construction, in principle derivable by a proof involving a single substitution event. But if we actually use FindEquationalProof to prove these lemmas, the proofs we get most involve 2 events (and in one case 4):

If we take another step in the entailment cone, we get a total of 5062 lemmas. From the way we generated them, we know that all these lemmas can in principle be reached by proofs of length 2. But if we run FindEquationalProof on them, we find a distribution of proof lengths:

And, yes, there is one lemma (with LeafCount 183) that is found only by a proof of length 15. But most often the proof length is 4—or about double what it could be.

If we generate the entailment cone for lemmas using bisubstitution rather than just ordinary substitution, there are slightly more cases where FindEquationalProof does worse at getting minimal proofs.

For example, the lemma

and 3 others can be generated by a single bisubstitution from the original axiom, but FindEquationalProof gives only proofs of length 4 for all of these.

What about unrolled proofs, in which one can generate an entailment cone by starting from a particular expression, and then applying the original axiom in all possible ways? For example, let’s say we start with:

Then applying bisubstitution with the original axiom once in all possible ways gives:

Applying bisubstitution a second time gives a larger entailment cone:

But now it turns out that—as indicated—one of the expressions in this cone is:

So this shows that the lemma

can in principle be reached with just two steps of “unrolled” proof:

And in this particular case, if we use FindEquationalProof and then unroll the resulting proof we also get a proof of length 3—but it goes through a different intermediate expression:

As it happens, this intermediate expression is also reached in the entailment cone that we get by starting from our “output” expression and then applying two bisubsitutions:

What Actually Is the “·”? Models and the Proof

We can think of logic (or Boolean algebra) as being associated with a certain collection of theorems. And what our axiom does is to provide something from which all theorems of logic (and nothing but theorems of logic) can be derived. At some level, we can think of it as just being about symbolic expressions. But in our effort to understand what’s going on—say with our proof—it’s sometimes useful to ask how we can “concretely” interpret these expressions.

For example, we might ask what the · operator actually is. And what kinds of things can our symbolic variables be? In effect we’re asking for what in model theory are called “models” of our axiom system. And in aligning with logic the most obvious model to discuss is one in which variables can be True or False, and the · represents either the logical operator Nand or the logical operator Nor.

The truth table, say for Nand, is:

And as expected, with this model for ·, we can confirm that our original axiom holds:

In general, though, our original axiom allows two size-2 models (that we can interpret as Nand and Nor):

It allows no size-3 models, and in fact in general allows only models of size 2n; for example, for size 4 its models are:

So what about a · b = b · a? What models does it allow? For size 2, it’s all 8 possible models with symmetric “multiplication tables”:

But the crucial point is that the 2 models for our original axiom system are part of these. In other words, at least for size-2 models, satisfying the original axiom system implies satisfying a · b = b · a.

And indeed any lemma derived from our axiom system must allow the models associated with our original axiom system. But it may also allow more—and sometimes many more. So here’s a map of our proof, showing how many models (out of 16 possible) each lemma allows:

Here are the results for size-3 models:

And, once again, these look complicated. We can think of models as defining—in some sense—what lemmas are “about”. So, for example, our original axiom is “about” Nand and Nor. The lemma a · b = b · a is “about” symmetric functions. And so on. And we might have hoped that we could gain some understanding of our proof by looking at how different lemmas that occur in it “sculpt” what is being talked about. But in fact we just seem to end up with complicated descriptions of sets that don’t seem to have any obvious relationship with each other.

What about a Higher-Level Abstraction?

If there’s one thing that stands out about our proof—and the analysis we’ve given of it here—it’s how fiddly and “in the weeds” it seems to be. But is that because we’re missing some big picture? Is there actually a more abstract way of discussing things, that gets to our result without having to go through all the details?

In the history of mathematics many of the most important themes have been precisely about finding such higher-level abstractions. We could start from the explicit symbolic axioms

or even

and start building up theorems much as we’ve done here. Or we could recognize that these are axioms for group theory, and then start using the abstract ideas of group theory to derive our theorems.

So is there some higher-level version of what we’re discussing here? Remember that the issue is not about the overall structure of Boolean algebra; rather it’s about the more metamathematical question of how one can prove that all of Boolean algebra can be generated from the axiom:

In the last few sections we’ve tried a few semi-empirical approaches to finding higher-level representations. But they haven’t gotten very far. And to get further we’re probably going to need a serious new idea.

And, if history is a guide, we’re going to need to come up with an abstraction that somehow “goes outside of the system” before “coming back”. It’s like trying to figure out the real roots of a cubic equation, and realizing that the best way to do this is to introduce complex numbers, even though the imaginary parts will cancel at the end.

In the direct exploration of our proof, it feels as if the intermediate lemmas we generate “wander off into the wilds of metamathematical space” before coming back to establish our final result. And if we were using a higher-level abstraction, we’d instead be “wandering off” into the space of that abstraction. But what we might hope is that—at least with the concepts we would use in discussing that abstraction—the path that would be involved would be “short enough to be accessible to human understanding”.

Will we be able to find such an abstraction? It’s a subtle question. Because in effect it asks whether we can reduce the computational effort needed for the proof—or, in other words, whether we can find a pocket of computational reducibility in what in general will be a computationally irreducible process. But it’s not a question that can really be answered just for our specific proof on it own. After all, our “abstraction” could in principle just involve introducing a primitive that represents our whole proof or a large part of it. But to make it what we can think of as a real abstraction we need something that spans many different specific examples—and, in our case, likely many axiomatic systems or symbolic proofs.

So is such an abstraction possible? In the history of mathematics the experience has been that after enough time (often measured in centuries) has passed, abstractions tend to be found. But at some level this has been self fulfilling. Because the areas that are considered to have remained “interesting for mathematics” tend to be just those where general abstractions have in fact been found.

In ruliology, though, the typical experience has been different. Because there it’s been routine to sample the computational universe of possible simple programs and encounter computational irreducibility. In the end it’s still inevitable that among the computational irreducibility there must be pockets of computational reducibility. But the issue is that these pockets of computational reducibility may not involve features of our system that we care about.

So is a proof of the kind we’re discussing here more like ruliology, or more like “typical mathematics”? Insofar as it’s a mathematical-style proof of a mathematical statement it feels more like typical mathematics. But insofar as it’s something found by the computational process of automated theorem proving it perhaps seems more ruliology.

But what might a higher-level abstraction for it look like? Figuring that out is probably tantamount to finding the abstraction. But perhaps one can at least expect that in some ways it will be metamathematical, and more about the structure and character of proofs than about their content. Perhaps it will be something related to the framework of higher category theory, or some form of meta-algebra. But as of now, we really don’t know—and we can’t even say that such an abstraction with any degree of generality is possible.

LLMs to the Rescue?

The unexpected success of LLMs in language generation and related tasks has led to the idea that perhaps eventually systems like LLMs will be able to “do everything”—including for example math. We already know—not least thanks to Wolfram Language—that lots of math can be done computationally. But often the computations are hard—and, as in the example of the proof we’re discussing here, incomprehensible to humans. So the question really is: can LLMs “humanize” what has to be done in math, turning everything into a human-accessible narrative? And here our proof seems like an excellent—if challenging—test case.

But what happens if we just ask a current LLM to generate the proof from scratch? It’s not a good picture. Very often the LLM will eagerly generate a proof, but it’ll be completely wrong, often with the same kind of mistakes that a student somewhat out of their depth might make. Here’s a typical response where an LLM simply assumes that the · operator is associative (which it isn’t in Boolean algebra) then produces a proof that on first blush looks at least vaguely plausible, but is in fact completely wrong:

Inadequate LLM proof

Coming up with an explanation for what went wrong is basically an exercise in “LLM psychology”. But in a first approximation one might say the following. LLMs are trained to “fill in what’s typical”, where “typical” is defined by what appears in the training set. But (absent some recent Wolfram Language and Wolfram|Alpha based technology of ours) what’s been available as a training set has been human-generated mathematical texts, where, yes, operators are often associative, and typical proofs are fairly short. And in the “psychology of LLMs” an LLM is much more likely to “do what’s typical” than to “rigorously follow the rules”.

If you press the LLM harder, then it might just “abdicate”, and suggest using the Wolfram Language as a tool to generate the proof. So what happens if we do that, then feed the finished proof to the LLM and ask it to explain? Well, typically it just does what LLMs do so well, and writes an essay:

LLM proof essay

So, yes, it does fine in “generally framing the problem”. But not on the details. And if you press it for details, it’ll typically eventually just start parroting what it was given as input.

How else might we try to get the LLM to help? One thing I’ve certainly wondered is how the lemmas in the proof relate to known theorems—perhaps in quite different areas of mathematics. It’s something one might imagine one would be able to answer by searching the literature of mathematics. But, for example, textual search won’t be sufficient: it has to be some form of semantic search based on the meaning or symbolic structure of lemmas, not their (fairly arbitrary) textual presentation. A vector database might be all one needs, but one can certainly ask an LLM too:

LLM semantic search results

It’s not extremely helpful, though, charmingly, it correctly identifies the source of our original axiom. I’ve tried similar queries for our whole set of lemmas across a variety of LLMs, with a variety of RAG systems. Often the LLM will talk about an interpretation for some lemma—but the lemma isn’t actual present in our proof. But occasionally the LLM will mention possible connections (“band theory”; “left self-distributive operations in quandles”; “Moufang loops”)—though so far none have seemed to quite hit the mark.

And perhaps this failure is itself actually a result—telling us that the lemmas that show up in our proof really are, in effect, out in the wilds of metamathematical space, probing places that haven’t ever been seriously visited before by human mathematics.

But beyond LLMs, what about more general machine learning and neural net approaches? Could we imagine using a neural net as a probe to find “exploitable regularities” in our proof? It’s certainly possible, but I suspect that the systematic algorithmic methods we’ve already discussed for finding optimal notations, popular lemmas, etc. will tend to do better. I suppose it would be one thing if our systematic methods had failed to even find a proof. Then we might have wanted something like neural nets to try to guess the right paths to follow, etc. But as it is, our systematic methods rather efficiently do manage to successfully find a proof.

Of course, there’s still the issue that we’re discussing here that the proof is very “non-human”. And perhaps we could imagine that neural nets, etc.—especially when trained on existing human knowledge—could be used to “form concepts” that would help us humans to understand the proof.

We can get at least a rough analogy for how this might work by looking at visual images produced by a generative AI system trained from billions of human-selected images. There’s a concept (like “a cube”) that exists somewhere in the feature space of possible images. But “around” that concept are other things—“out in interconcept space”—that we don’t (at least yet) explicitly have words for:

Interconcept space

And it’ll presumably be similar for math, though harder to represent in something like a visual way. There’ll be existing math concepts. But these will be embedded in a vast domain of “mathematical interconcept space” that we humans haven’t yet “colonized”. And what we can imagine is that—perhaps with the help of neural nets, etc.—we can identify a limited number of “points in interconcept space” that we can introduce as new concepts that will, for example, provide useful “waypoints” in understanding our proof.

But Why Is the Theorem True?

It’s a common human urge to think that anything that’s true must be true for a reason. But what about our theorem? Why is it true? Well, we’ve seen a proof. But somehow that doesn’t seem satisfactory. We want “an explanation we can understand”. But we know that in general we can’t always expect to get one.

It’s a fundamental implication of computational irreducibility that things can happen where the only way to “see how they happen” is just to “watch them happen”; there’s no way to “compress the explanation”.

Consider the following patterns. They’re all generated by cellular automata. And all live exactly 100 steps before dying out. But why?

In a few cases it seems like we can perhaps at least begin to imagine “narratively describing” a mechanism. But most of the time all we can say is basically that they “live 100 steps because they do”.

It’s a quintessential consequence of computational irreducibility. It might not be what we’d expect, or hope for. But it’s reality in the computational universe. And it seems very likely that our theorem—and its proof—is like this too. The theorem in effect “just happens to be true”—and if you run the steps in the proof (or find the appropriate path in the entailment cone) you’ll find that it is. But there’s no “narrative explanation”. No “understanding of why it’s true”.

Intuition and Automated Theorem Proving

We’ve been talking a lot about the proof of our theorem. But where did the theorem to prove come from in the first place? Its immediate origin was an exhaustive search I did of simple axiom systems, filtering for ones that could conceivably generate Boolean algebra, followed by testing each of the candidates using automated theorem proving.

But how did I even get the idea of searching for a simple axiom system for Boolean algebra? Based on the axiom systems for Boolean algebra known before—and the historical difficulty of finding them—one might have concluded that it was quite hopeless to find an axiom system for Boolean algebra by exhaustive search. But by 2000 I had nearly two decades of experience in exploring the computational universe—and I was well used to the remarkable phenomenon that even very simple computational rules can lead to behavior of great complexity. So the result was that when I came to think about axiom systems and the foundations of mathematics my intuition led me to imagine that perhaps the simplest axiom system for something like Boolean algebra might be simple enough to exhaustively search for.

And indeed discovering the axiom system we’ve discussed here helped further expand and deepen my intuition about the consequences of simple rules. But what about the proof? What intuition might one get from the proof as we now know it, and as we’ve discussed here?

There’s much intuition to be got from observing the world as it is. But for nearly half a century I’ve had another crucial source of intuition: observing the computational universe—and doing computational experiments. I was recently reflecting on how I came to start developing intuition in this way. And what it might mean for intuition I could now develop from things like automated theorem proving and AI.

Back in the mid-1970s my efforts in particle physics led me to start using computers to do not just numerical, but also algebraic computations. In numerical computations it was usual to just get a few numbers out, that perhaps one could plot to make a curve. But in algebraic computations one instead got out formulas—and often very ornate ones full of structure and detail. And for me it was routine to get not just one formula, but many. And looking at these formulas I started to develop intuition about them. What functions would they involve? What algebraic form would they take? What kind of numbers would they involve?

I don’t think I ever consciously realized that I was developing a new kind of computationally based intuition. But I soon began to take it for granted. And when—at the beginning of the 1980s—I started to explore the consequences of simple abstract systems like cellular automata it was natural to expect that I would get intuition from just “seeing” how they behaved. And here there was also another important element. Because part of the reason I concentrated on cellular automata was precisely because one could readily visualize their behavior on a computer.

I don’t think I would have learned much if I’d just been printing out “numerical summaries” of what cellular automata do. But as it was, I was seeing their behavior in full detail. And—surprising though what I saw was—I was soon able to start getting an intuition for what could happen. It wasn’t a matter of knowing what the value of every cell would be. But I started doing things like identifying four general classes of cellular automata, and then recognizing the phenomenon of computational irreducibility.

By the 1990s I was much more broadly exploring the computational universe—always trying to see what could happen there. And in almost all cases it was a story of defining simple rules, then running them, and making an explicit step-by-step visualization of what they do—and thereby in effect “seeing computation in action”.

In recent years—spurred by our Physics Project—I’ve increasingly explored not just computational processes, but also multicomputational ones. And although it’s more difficult I’ve made every effort to visualize the behavior of multiway systems—and to get intuition about what they do.

But what about automated theorem proving? In effect, automated theorem proving is about finding a particular path in a multiway system that leads to a theorem we want. We’re not getting to see “complete behavior”; we’re in effect just seeing one particular “solution” for how to prove a theorem.

And after one’s seen many examples, the challenge once again is to develop intuition. And that’s a large part of what I’ve been trying to do here. It’s crucial, I think, to have some way to visualize what’s happening—in effect because visual input is the most efficient way to get information into our brains. And while the visualizations we’ve developed here aren’t as direct and complete as, say, for cellular automaton evolution, I think they begin to give some overall sense of our proof—and other proofs like it.

In studying simple programs like cellular automata, the intuition I developed led me to things like my classification of cellular automaton behavior, as well as to bigger ideas like the Principle of Computational Equivalence and computational irreducibility. So having now exposed myself to automated theorem proving as I exposed myself to algebraic computation and the running of simple rules in the past, what general principles might I begin to see? And might they, for example, somehow make the fact that our proof works ultimately seem “obvious”?

In some ways yes, but in other ways no. Much as with simple programs, there are axiom systems so simple that, for example, the multiway systems they generate are highly regular. But beyond a low threshold, it’s common to get very complicated—and in many ways seemingly random—multiway system structures. Typically an infinite number of lemmas are generated, with little or no obvious regularity in their forms.

And one can expect that—following the ideas of universal computation—it’ll typically be possible to encode in any one such multiway system the behavior of any other multiway system. In terms of axioms what one’s saying is that if one sets up the right translation between theorems, one will be able to use any one such axiom system to generate the theorems of any other. But the issue is that the translation will often make major changes to the structure of the theorems, and in effect define not just a “mathematical translation” (like between geometry and algebra) but a metamathematical one (as one would need to get from Peano arithmetic to set theory).

And what this means is that it isn’t surprising that even a very simple axiom system can generate a complicated set of possible lemmas. But knowing this doesn’t immediately tell one whether those lemmas will align with some particular existing theory—like Boolean algebra. And in a sense that’s a much more detailed question.

At some metamathematical level it might not be a natural question. But at a “mathematical level” it is. And it’s what we have to address in connection with the theorem—and proof—we’re discussing here. Many aspects of the overall form and properties of the proof will be quite generic, and won’t depend on the particulars of the axiom system we’re using. But some will. And quite what intuition we may be able to get about these isn’t clear. And perhaps it’ll necessarily be fragmented and specific—in effect responding to the presence of computational irreducibility.

It’s perhaps worth commenting that LLMs—and machine learning in general—represent another potential source of intuition. That intuition may well be more about the general features of us as observers and thinkers. But such intuition is potentially critical in framing just what we can experience, not only in the natural world, but also in the mathematical and metamathematical worlds. And perhaps the apparent impotence of LLMs when faced with the proof we’ve been discussing already tells us something significant about the nature of “mathematical observers” like us.

So What Does It Mean for the Future of Mathematics?

Let’s say we never manage to “humanize” the proof we’ve been discussing here. Then in effect we’ll end up with a “black-box theorem”—that we can be sure is true—but we’ll never know quite how or why. So what would that mean for mathematics?

Traditionally, mathematics has tended to operate in a “white box” kind of way, trying to build narrative and understanding along with “facts”. And in this respect it’s very different from natural science. Because in natural science much of our knowledge has traditionally been empirical—derived from observing the world or experimenting on it—and without any certainty that we can “understand its origins”.

Automated theorem proving of the kind we’re discussing here—or, for that matter, pretty much any exploratory computational experimentation—aligns mathematics much more with natural science, deriving what’s true without an expectation of having a narrative explanation of why.

Could one imagine practicing mathematics that way? One’s already to some extent following such a path as soon as one introduces axiom systems to base one’s mathematics on. Where do the axiom systems come from? In the time of Euclid perhaps they were thought of as an idealization of nature. But in more modern times they are realistically much more the result of human choice and human aesthetics.

So let’s say we determine (given a particular axiom system) that some black-box theorem is true. Well, then we can just add it, just as we could another axiom. Maybe one day it’ll be possible to prove P≠NP or the Riemann Hypothesis from existing axioms of mathematics (if they don’t in fact turn out to be independent). And—black box or not—we can expect to add them to what we assume in subsequent mathematics we do, much as they’re routinely added right now, even though their status isn’t yet known.

But it’s one thing to add one or two “black-box theorems”. But what happens when black-box theorems—that we can think of as “experimentally determined”—start to dominate the landscape of mathematics?

Well, then mathematics will take on much more of the character of ruliology—or of an experimental science. When it comes to the applications of mathematics, this probably won’t make much difference, except that in effect mathematics will be able to become much more powerful. But the “inner experience” of mathematics will be quite different—and much less “human”.

If one indeed starts from axioms, it’s not at the outset obvious why everything in mathematics should not be mired in the kind of alien-seeming metamathematical complexity that we’ve encountered in the discussion of our proof here. But what I’ve argued elsewhere is that the fact that in our experience of doing mathematics it’s not is a reflection of how “mathematical observers like us” sample the raw metamathematical structure generated by axioms (or ultimately by the subaxiomatic structure of the ruliad).

The physics analogy I’ve used is that we succeed in doing mathematics at a “fluid dynamics level”, far above the detailed “molecular dynamics level” of things like the proof we’ve discussed here. Yes, we can ask questions—like ones about the structure of our proof—that probe the axiomatic “molecular dynamics level”. But it’s an important fact that in doing what we normally think of as mathematics we almost never have to; there’s a coherent way to operate purely at the “fluid dynamics level”.

Is it useful to “dip down” to the molecular dynamics? Definitely yes, because that’s where we can readily do computations—like those in our proof, or in general those going on in the internals of the Wolfram Language. But a key idea in the design of the Wolfram Language is to provide a computational language that can express concepts at a humanized “fluid dynamics” level—in effect bridging between the way humans can think and understand things, and the way raw computation can be done with them.

And it’s notable that while we’ve had great success over the years in defining “human-accessible” high-level representations for what amount to the “inputs” and “outputs” of computations, that’s been much less true of the “ongoing processes” of computation—or, for example, of the innards of proofs.

Is there a good “human-level” way to represent proofs? If the proofs are short, it’s not too difficult (and the step-by-step solutions technology of Wolfram|Alpha provides a good large-scale example of what can be done). But—as we’ve discussed—computational irreducibility implies that some proofs will inevitably be long.

If they’re not too long, then at least some parts of them might be constructed by human effort, say in a system like a proof assistant. But as soon as there’s much automation (whether with automated theorem proving or with LLMs) it’s basically inevitable that one will end up with things that at least approach what we’ve seen with the proof we’re discussing here.

What can then be done? Well, that’s the challenge. Maybe there is some way to simplify, abstract or otherwise “humanize” the proof we’ve been discussing. But I rather doubt it. I think this is likely one of those cases where we inevitably find ourselves face to face with computational irreducibility.

And, yes, there’s important science (particularly ruliology) to do on the structures we see. But it’s not mathematics as it’s traditionally been practiced. But that’s not to say that the results that come out of things like our proof won’t be useful for mathematics. They will be. But they make mathematics more like an experimental science—where what matters most is in effect the input and output rather than a “publishable” or human-readable derivation in between. And where the key issue in making progress is less in the innards of derivations than in defining clear computational ways to express input and output. Or, in effect, in capturing “human-level mathematics” in the primitives and structure of computational language.

Appendix: What about a Different Theorem Proving System?

The proof we’ve been discussing here was created using FindEquationalProof in the Wolfram Language. But what if we were to use a different automated theorem proving system? How different would the results be? In the spectrum of things that automated theorem proving systems do, our proof here is on the difficult end. And many existing automated theorem proving systems don’t manage to do it all. But some of the stronger ones do. And in the end—despite their different internal algorithms and heuristics—it’s remarkable how similar the results they give are to those from the Wolfram Language FindEquationalProof (differences in the way lemmas vs. inference steps, etc. are identified make detailed quantitative comparisons difficult):

Thanks

Thanks to Nik Murzin of the Wolfram Institute for his extensive help as part of the Wolfram Institute Empirical Metamathematics Project. Also Roger Germundsson, Sergio Sandoval, Adam Strzebonski, Michael Trott, Liubov Tupikina, James Wiles and Carlos Zapata for input. Thanks to Arnim Buch and Thomas Hillenbrand for their work in the 1990s on Waldmeister which is now part of FindEquationalProof (also to Jonathan Gorard for his 2017 work on the interface for FindEquationalProof). I was first seriously introduced to automated theorem proving in the late 1980s by Dana Scott, and have interacted with many people about it over the years, including Richard Assar, Bruno Buchberger, David Hillman, Norm Megill, Todd Rowland and Matthew Szudzik. (I’ve also interacted with many people about proof assistant, proof presentation and proof verification systems, both recently and in the past.)

Useful to the Point of Being Revolutionary: Introducing Wolfram Notebook Assistant

2024-12-10 02:38:15

Useful to the Point of Being Revolutionary: Introducing Wolfram Notebook Assistant

Note: As of today, copies of Wolfram Version 14.1 are being auto-updated to allow subscription access to the capabilities described here. [For additional installation information see here.]

Just Say What You Want! Turning Words into Computation

Nearly a year and a half ago—just a few months after ChatGPT burst on the scene—we introduced the first version of our Chat Notebook technology to integrate LLM-based chat into Wolfram Notebooks. For the past year and a half we’ve been building on those foundations. And today I’m excited to be able to announce that we’re releasing the fruits of those efforts: the first version of our Wolfram Notebook Assistant.

There are all sorts of gimmicky AI assistants out there. But Notebook Assistant isn’t one of them. It’s a serious, deep piece of new technology, and what’s more important, it’s really, really useful! In fact, I think it’s so useful as to be revolutionary. Personally, I thought I was a pretty efficient user of Wolfram Language—but Notebook Assistant has immediately made me not only significantly more efficient, but also more ambitious in what I try to do. I hadn’t imagined just how useful Notebook Assistant was going to be. But seeing it now I can say for sure that it’s going to raise the bar for what everyone can do. And perhaps most important of all, it’s going to open up computational language and computational thinking to a vast range of new people, who in the past assumed that those things just weren’t accessible to them.

Leveraging the decades of work we’ve done on the design and implementation of the Wolfram Language (and Wolfram|Alpha), Notebook Assistant lets people just say in their own words what they want to do; then it does its best to crispen it up and give a computational implementation. Sometimes it goes all the way and just delivers the answer. But even when there’s no immediate “answer” it does remarkably well at building up structures where things can be represented computationally and tackled concretely. People really don’t need to know anything about computational language—or computational thinking to get started; Notebook Assistant will take their ideas, rough as they may be, and frame them in computational language terms.

I’ve long seen Wolfram Language as uniquely providing the infrastructure and “notation” to enable “computational X” for all fields X. I’m excited to say that I think Notebook Assistant now bridges “the last mile” to let anyone—at almost any level—access the power of computational language, and “do computational X”. In its original conception, Wolfram Notebook Assistant was just intended to be “useful”. But it’s emerging as something much more than that; something positively revolutionary.

“I can’t believe it’ll do anything useful with that”, I’ll think. But then I’ll try it. And, very often, something amazing will happen. Something that gets me past some sticking point or over some confusion. Something that gives me an unexpected new building block—or new idea—for what I’m trying to do. And that uses the medium of our computational language to take me beyond where I would ever have reached before.

So how does one use Notebook Assistant? Once you’ve signed up you can just go to the toolbar of any notebook, and open a Notebook Assistant chat window:

Notebook Assistant chat window

Now tell Notebook Assistant what you want to do. The more precise and explicit you are, the better. But you don’t have to have thought things through. Just type what comes into your mind. Imagine you’ve been working in a notebook, and (somehow) you’ve got a picture of some cats. You wonder “How can I find the cats in this picture?” Well, just ask Notebook Assistant!

How can I find the cats in this picture?

Notebook Assistant gives some narrative text, and then a piece of Wolfram Language code—which you can just run in your notebook (by pressing Click to enlarge):

It seems a bit like magic. You say something vague, and Notebook Assistant turns it into something precise and computational—which you can then run. It’s not always as straightforward as in this example. But the important thing is that in practice (at least in my rather demanding experience) Notebook Assistant essentially always does spectacularly well at being useful—and at telling me things that move forward what I’m trying to do.

Big or Small, Just Try It!

Imagine that sitting next to you, you had someone very knowledgeable about Wolfram Language and about computational thinking in general. Think what you might ask them. That’s what you should ask Notebook Assistant. And if there’s one thing to communicate here, it’s “Just try it!” You might think what you’re thinking about is too vague, or too specific, or too technical. But just try asking Notebook Assistant. In my experience, you’ll be amazed at what it’s able to do, and how helpful it’s able to be.

Maybe you’re an experienced Wolfram Language user who “knows there must be a way to do something”, but can’t quite remember how. Just ask Notebook Assistant. And not only will it typically be able to find the function (or whatever) you need; it’ll also usually be able to create a code fragment that does the very specific thing you asked about. And, by the way, it’ll save you lots of typing (and debugging) by filling in those fiddly options and so on just how you need them. And even if it doesn’t quite nail it, it’ll have given a skeleton of what you need, that you can then readily edit. (And, yes, the fact that it’s realistic to edit it relies on the fact that Wolfram Language represents it in a way that humans can readily read as well as write.)

What if you’re a novice, who’s never used Wolfram Language before, and never really been exposed to computational thinking, or for that matter, “techie stuff” at all? Well, the remarkable thing is that Notebook Assistant will still be able to help you—a lot. You can ask it something very vague, that doesn’t even seem particularly computational. It does remarkably well at “computationalizing things”. Taking what you’ve said, and finding a way to address it computationally—and to lead you into the kind of computational thinking that’ll be needed for the particular thing you’re trying to do.

In what follows, we’ll see a whole range of different ways to use Notebook Assistant. In fact, even as I’ve been writing this, I’ve discovered quite a few new ways to use it that I’d never thought of before.

There are some general themes, though. The most important is the way Notebook Assistant pivotally relies on the Wolfram Language. In a sense, the main mission of Notebook Assistant is to make things computational. And the whole reason it can so successfully do that is that it has the Wolfram Language as its target. It’s leveraging the unique nature of the Wolfram Language as a full-scale computational language, able to coherently represent abstract and real-world things in a computational way.

One might think that the Wolfram Language would in the end be mainly an “implementation layer”—serving to make what Notebook Assistant produces runnable. But in reality it’s very, very much more than that. In particular, it’s basically the medium—the language—in which computational ideas are communicated. When Notebook Assistant generates Wolfram Language, it’s not just something for the computer to run; it’s also something for humans to read. Yes, Notebook Assistant can produce text, and that’s useful, especially for contextualizing things. But the most concentrated and poignant communication comes in the Wolfram Language it produces. Want the TL;DR? Just look at the Wolfram Language code!

Part of how Wolfram Language code manages to communicate so much so efficiently is that it’s precise. You can just mention the name of a function, and you know precisely what it does. You don’t have to “scaffold” it with text to make its meaning clear.

But there’s something else as well. With its symbolic character—and with all the coverage and consistency that we’ve spent so much effort on over the decades—the Wolfram Language is uniquely able to “communicate in fragments”. Any fragment of Wolfram Language code can be run, and more important, it can smoothly fit into a larger structure. And that means that even small fragments of code that Notebook Assistant generates can be used as building blocks.

It produces Wolfram Language code. You read the code (and it’s critical that it’s set up to be read). You figure out if it’s what you want. (And if it’s not, you edit it, or ask Notebook Assistant to do that.) Then you can use that code as a robust building block in whatever structure—large or small—that you might be building.

In practice, a critical feature is that you don’t have to foresee how Notebook Assistant is going to respond to what you asked. It might nail the whole thing. Or it might just take steps in the right direction. But then you just look at what it produced, and decide what to do next. Maybe in the end you’ll have to “break the problem down” to get Notebook Assistant to deal with it. But there’s no need to do that in advance—and Notebook Assistant will often surprise you by how far it’s able to get on its own.

You might imagine that Notebook Assistant would usually need you to break down what you’re asking into “pure computational questions”. But in effect it has good enough “general knowledge” that it doesn’t. And in fact it will usually do better the more context you give it about why you’re asking it to do something. (Is it for chemical engineering, or for sports analytics, or what?)

But how ambitious can what you ask Notebook Assistant be? What if you ask it something “too big”? Yes, it won’t be able to solve that 100-year-old problem or build a giant software system in its immediate output. But it does remarkably well at identifying pieces that it can say something about, and that can help you understand how to get started. So, as with many things about Notebook Assistant, you shouldn’t assume that it won’t be helpful; just try it and see what happens! And, yes, the more you use Notebook Assistant, the more you’ll learn just what kind of thing it does best, and how to get the most out of it.

So how should you ultimately think about Notebook Assistant? Mainly you should think of it like an very knowledgeable and hardworking expert. But at a more mundane level it can serve as a super-enhanced documentation lookup system or code completion system. It can also take something vague you might ask it, and somehow undauntedly find the “closest formalizable construct”—that it can then compute with.

An important feature is that it is—in human terms—almost infinitely patient and hardworking. Where a human might think: “it’s too much trouble to write out all those details”, Notebook Assistant just goes ahead and does it. And, yes, it saves you huge amounts of typing. But, more important, it makes it “cheap” to do things more perfectly and more completely. So that means you actually end up labeling those plot axes, or adding a comment to your code, or coming up with meaningful names for your variables.

One of the overarching points about Notebook Assistant is that it lowers the barrier to getting help. You don’t have to think carefully about formulating your question. You don’t have to go clicking through lots of links. And you don’t have to worry that it’s too trivial to waste a coworker’s time on the question. You can just ask Notebook Assistant. Oh, and it’ll give you a response immediately. (And you can go back and forth with it, and ask it to clarify and refine things.)

“How Can I Do That?”

At least for me it’s very common: you have something in your mind that you want to do, but you don’t quite know how to achieve it in the Wolfram Language. Well, now you can just ask Notebook Assistant!

I’ll show various examples here. It’s worth emphasizing that these examples typically won’t look exactly the same if you run them again. Notebook Assistant has a certain amount of “AI-style random creativity”—and it also routinely makes use of what you’ve done earlier in a session, etc. It also has to be said that Notebook Assistant will sometimes make mistakes—or will misunderstand what you’re asking it. But if you don’t like what it did, you can always press the button to generate a new response.

Let’s start off with a basic computational operation:

Click to enlarge

As an experienced user of Wolfram Language, a simple “do it with FoldList” would already have been enough. But Notebook Assistant goes all the way—generating specific code for exactly what I asked. Courtesy of Wolfram Language, the code is very short and easy to read. But Notebook Assistant does something else for one as well: it produces an example of the code in action—which lets one check that it really does what one wanted. Oh, and then it goes even further, and tells me about a function in the Wolfram Function Repository (that I, for one, had never heard of; wait did I write it?) that directly does the operation I want.

OK, so that was a basic computational operation. Now let’s try something a little more elaborate:

Click to enlarge

This involves several steps, but Notebook Assistant nails it, giving a nice example. (And, yes, it’s reading the Wolfram Language documentation, so often its examples are based on that.)

But even after giving an A+ result right at the top, Notebook Assistant goes on, talking about various options and extensions. And despite being (I think) quite an expert on what the Wolfram Language can do, I was frankly surprised by what it came up with; I didn’t know about these capabilities!

There’s an incredible amount of functionality built into the Wolfram Language (yes, four decades worth of it). And quite often things you want to do can be done with just a single Wolfram Language function. But which one? One of the great things about Notebook Assistant is that it’s very good at taking “raw thoughts”, sloppily worded, and figuring out what function you need. Like here, bam, “use LineGraph!”

Click to enlarge

You can ask Notebook Assistant “fairly basic” questions, and it’ll respond with nice, synthesized-on-the-spot “custom documentation”:

Click to enlarge

You can also ask it about obscure and technical things; it knows about every Wolfram Language function, with all its details and options:

Click to enlarge

Click to enlarge

Click to enlarge

Notebook Assistant is surprisingly good at writing quite minimal code that does sophisticated things:

Click to enlarge

If you ask it open-ended questions, it’ll often answer with what amount to custom-synthesized computational essays:

Click to enlarge

Notebook Assistant is pretty good at “pedagogically explaining what you can do”:

Click to enlarge

In everything we’ve seen so far, the workflow is that you ask Notebook Assistant something, then it generates a result, and then you use it. But everything can be much more interactive, and you can go back and forth with Notebook Assistant—say refining what you want it to do.

Here I had something in mind, but I was quite sloppy in describing it. And although Notebook Assistant came up with a reasonable interpretation of what I asked, it wasn’t really what I had in mind:

Click to enlarge

So I went back and edited what I asked (right there in the Notebook Assistant window), and tried again:

Click to enlarge

The result was better, but still not right. But all I had to do was to tell it to make a change, and lo and behold, I got what I was thinking of:

Click to enlarge

By the way, you can also perfectly well ask about deployment to the web:

Click to enlarge

And while I might have some minor quibbles (why use a string for the molecule name, not "Chemical"; why not use CloudPublish; etc.) what Notebook Assistant produces works, and provides an excellent scaffold for further development. And, as it often does, Notebook Assistant adds a kind of “by the way, did you know?” at the end, showing how one could use ARPublish to produce output for augmented reality.

Here’s one last example: creating a user interface element. I want to make a slider-like control that goes around (like an analog clock):

Click to enlarge

Well, actually, I had in mind something more minimal:

Click to enlarge

Impressive. Even if maybe it got that from some documentation or other example. But what if I wanted to tweak it? Well, actually, Notebook Assistant does seem to understand what it has:

Click to enlarge

“Can You Just Do That for Me?”

What we’ve seen so far are a few examples of asking Notebook Assistant to tell us how to do things. But you can also just ask Notebook Assistant to do things for you, in effect producing “finished goods”:

Click to enlarge

Pretty impressive! And it even just went ahead and made the picture. By the way, if I wanted the code packaged up into a single line, I can just ask for that:

Click to enlarge

Notebook Assistant can generate interactive content too. And—very usefully—you don’t have to give precise specifications up front: Notebook Assistant will automatically pick “sensible defaults” (that, yes, you can trivially edit later, or just tell Notebook Assistant to change it for you):

Click to enlarge

Here’s an example that requires putting together several different ideas and functions. But Notebook Assistant manages it just fine—and in fact the code it produces is interesting and clarifying to read:

Click to enlarge

Notebook Assistant knows about every area of Wolfram Language functionality—here synthetic geometry:

Click to enlarge

And here chemistry:

Click to enlarge

It also knows about things like the Wolfram Function Repository, here running a function from there that generates a video:

Click to enlarge

Here’s something that again leverages Notebook Assistant’s encyclopedic knowledge of Wolfram Language capabilities, now pulling in real-time data:

Click to enlarge

I can’t resist trying a few more examples:

Click to enlarge

Let’s try something involving more sophisticated math:

Click to enlarge

(I would have used RegularPolygon[5], and I don’t think DiscretizeRegion is necessary … but what Notebook Assistant did is still very impressive.)

Or here’s some more abstract math:

Click to enlarge

OK, so Notebook Assistant provides a very powerful way to go from words to computational results. So what then is the role of computational language and of “raw Wolfram Language”? First of all, it’s the Wolfram Language that makes everything we’ve seen here work; it’s what the words are being turned into so that they can be computed from. But there’s something much more than that. The Wolfram Language isn’t just for computers to compute with. It’s also for humans to think with. And it’s an incredibly powerful medium for that thinking. Like a great generalization of mathematical notation from the distant past, it provides a streamlined way to broadly formalize things in computational terms—and to systematically build things up.

Notebook Assistant is great for getting started with things, and for producing a first level of results. But words aren’t ultimately an efficient way say how to build up from there. You need the crisp, formal structure of computational language. In which even the tiny amounts of code you write can be incredibly powerful.

Now that I’ve been using Notebook Assistant for a while I think I can say that on quite a few occasions it’s helped me launch things, it’s helped me figure out details, and it’s helped me debug things that have gone wrong. But the backbone of my computational progress has been me writing Wolfram Language myself (though quite often starting from something Notebook Assistant wrote). Notebook Assistant is an important new part of the “on ramp” to Wolfram Language; but it’s raw Wolfram Language that lets one really zoom forward to build new structures and achieve what’s computationally possible.

“Where Do I Start?”

Computational thinking is an incredibly powerful approach. But sometimes it’s hard to get started with, particularly if you’re not used to it. And although one might not imagine it, Notebook Assistant can be very useful here, essentially helping one brainstorm about what direction to take.

I was explaining this to our head of Sales, and tried:

Click to enlarge

I really didn’t expect this to do anything terribly useful … and I was frankly amazed at what happened. Pushing my luck I tried:

Click to enlarge

Obviously this isn’t the end of the story, but it’s a remarkably good beginning—going from a vague request to something that’s set up to be thought about computationally.

Here’s another example. I’m trying to invent a good system for finding books in my library. I just took a picture of a shelf of books behind my desk:

Click to enlarge

Once again, a very impressive result. Not the final answer, but a surprisingly good start. That points me in the direction of image processing and segmentation. At first, it’s running too slowly, so it downsamples the image. Then it tells me I might need to tweak the parameters. So I just ask it to create a tool to do that:

Click to enlarge

And then:

Click to enlarge

It’s very impressive how much Notebook Assistant can help one go “from zero to computation”. And when one gets used to using it, it starts to be quite natural to just try it on all sorts of things one’s thinking about. But if it’s just “quick, tell me something to compute”, it’s usually harder to come up with anything.

And that reminds me of the very first time I ever saw a computer in real life. It was 1969 and I was 9 years old (and the computer was an IBM mainframe). The person who was showing me the computer asked me: “So what do you want to compute?” I really had no idea at that time “what one might compute”. Rather lamely I said “the weight of a dinosaur”. So, 55 years later, let’s try that again:

Click to enlarge

And let’s try going further:

Click to enlarge

“Tweak the Details for Me”

Something I find very useful with Notebook Assistant is having it “tweak the details” of something I’ve already generated. For example, let’s say I have a basic plot of a sine curve in a notebook:

Click to enlarge

Assuming I have that notebook in focus, Notebook Assistant will “see” what’s there. So then I can tell it to modify my sine curve—and what it will do is produce new code with extra details added:

Click to enlarge

That’s a good result. But as a Wolfram Language aficionado I notice that the code is a bit more complicated than it needs to be. So what can I do about it? Well, I can just ask Notebook Assistant to simplify it:

Click to enlarge

I can keep going, asking it to further “embellish” the plot:

Click to enlarge

Let’s push our luck and try going even further:

Click to enlarge

Oops. Something went wrong. No callouts, and a pink “error” box. I tried regenerating a few times. Often that helps. But this time it didn’t seem to. So I decided to give Notebook Assistant a suggestion:

Click to enlarge

And now it basically got it. And with a little more back and forth I can expect to get exactly what I want.

In the Wolfram Language, functions (like Plot) are set up to have good automatic defaults. But when you want, for example, to achieve some particular, detailed look, you often have to end up specifying all sorts of additional settings. And Notebook Assistant is very good at doing this, and in effect, patiently typing out all those option settings, etc.

“What Went Wrong? Fix It!”

Let’s say you wrote some Wolfram Language (or perhaps Notebook Assistant did it for you). And let’s say it doesn’t work. Maybe it just produces the wrong output. Or maybe it generates all sorts of messages when it runs. Either way, you can just ask the Assistant “What went wrong?”

Click to enlarge

Here the Assistant rather patiently and clearly explained the message that was generated, then suggested “correct code”:

Click to enlarge

The Assistant tends to be remarkably helpful in situations like this—even for an experienced Wolfram Language user like me. In a sense, though, it has an “unfair advantage”. Not only has it learned “what’s reasonable” from seeing large amounts of Wolfram Language code; it also has access to “internal information”—like a stream of telemetry about messages that were generated (as well as stack traces, etc.).

In general, Notebook Assistant is rather impressive at “spotting errors” even in long and sophisticated pieces of Wolfram Language code—and in suggesting possible fixes. And I can say that this is a way in which using Notebook Assistant has immediately saved me significant time in doing things with Wolfram Language.

“Improve My Code”

Notebook Assistant doesn’t just know how to write Wolfram Language code; it knows how to write good Wolfram Language code. And in fact if you give it even a sloppy “outline” of Wolfram Language code, the Assistant is usually quite good at making it clean and complete. And that’s important not only in being able to produce code that will run correctly; it’s also important in making code that’s clear enough that you can understand it (courtesy of the readability of good Wolfram Language code).

Here’s an example starting with a rather horrible piece of Wolfram Language code on the right:

Click to enlarge

The code on the right is quite buggy (it doesn’t initialize list, for example). But Notebook Assistant guesses what it’s supposed to do, and then makes nice “Wolfram style” versions, explaining what it’s doing.

If the code you’re dealing with is long and complicated, Notebook Assistant may (like a person) get confused. But you can always select a particular part, then ask Notebook Assistant specifically about that. And the symbolic nature—and coherence—of the Wolfram Language will typically mean that Notebook Assistant will be able to act “modularly” on the piece that you’ve selected.

Something I’ve found rather useful is to have Notebook Assistant refactor code for me. Here I’m starting from a sequence of separate inputs (yes, itself generated by Notebook Assistant) and I’m turning it into a single function:

Click to enlarge

Now we can use the function however we want:

Click to enlarge

Going the other way is useful too. And Notebook Assistant is surprisingly good at grokking what a piece of code is “about”, and coming up with reasonable names for variables, functions, etc.:

Click to enlarge

Yet another thing Notebook Assistant is good at is knowing all sorts of tricks to make code run faster:

Click to enlarge

“Explain That to Me”

“What does that piece of code actually do?” Good Wolfram Language code—like good prose or good mathematical formalism—can succinctly communicate ideas, in its case in computational terms, precisely grounded in the definition of the language. But (as with prose and math) you sometimes need a more detailed exploration. And providing narrative explanations of code is something else that Notebook Assistant is good at. Here it’s taking a single line of (rather elegant) Wolfram Language code and writing a whole essay about what the code is doing:

Click to enlarge

What if you have a long piece of code, and you just want to explain some small part of it? Well, since Notebook Assistant sees selections you make, you can just select one part of your code, and Notebook Assistant will know that’s what you want to explain.

“Fill in the Paperwork for Me”

The Wolfram Language is carefully designed to have built-in functions that just “do what you need”, without having to use idioms or set up repeated boilerplate. But there are situations where there’s inevitably a certain amount of “bureaucracy” to do. For example, let’s say you’re writing a function to deploy to the Function Repository. You enter the definition for the function into a Function Resource Definition Notebook. But now you have to fill in documentation, examples, etc. And in fact that’s often the part that typically takes the longest. But now you can ask Notebook Assistant to do it for you. Here I put the cursor in the Examples section:

Click to enlarge

It’s always a good idea to set up tests for functions you define. And this is another thing Notebook Assistant can help with:

Click to enlarge

The “Inspiration Button”

All the examples of interacting with Notebook Assistant that we’ve seen so far involve using the Notebook Assistant window, that you can open with the button on the notebook toolbar. But another method involves using the button in the toolbar, which we’ve been calling the “inspiration button”.

When you use the Notebook Assistant window, the Assistant will always try to figure out what you’re talking about. For example, if you say “Plot that” it’ll use what it knows about what notebook you’re using, and where you are in it, to try to work out what you mean by “that”. But when you use the button it’ll specifically try to “provide inspiration at your current selection”.

Let’s say you’ve typed Plot[Sin[x]. Press and it’ll suggest a possible completion:

Click to enlarge

After using that suggestion, you can keep going:

Click to enlarge

You can think of the button as providing a sophisticated meaning-aware autocomplete.

Click to enlarge

It also lets you do things like code simplification. Imagine you’ve written the (rather grotesque):

If you want to get rid of the For loops, just select them and press the button to get a much simpler version:

Click to enlarge

Want to go even further? Select that result and Notebook Assistant manages to get to a one-liner:

Click to enlarge

Magic Writing, Magic Coding

At some level it seems bizarre. Write a text cell that describes code to follow it. Start an Input cell, then press and Notebook Assistant will try to magically write the code!

You can go the other way as well. Start with the code, then start a CodeText cell above it, and it’ll “magically” write a caption:

If you start a heading cell, it’ll try to make up a heading:

Start a Text cell, and it’ll try to “magically” write relevant textual content:

You can go even further: just put the cursor underneath the existing content, and press —and Notebook Assistant will start suggesting how you can go on:

As I write this, of course I had to try it: what does Notebook Assistant think I should write next? Here’s what it suggests (and, yes, in this case, those aren’t such bad ideas):

Click to enlarge

The Practicalities of the Assistant

One of the objectives for Notebook Assistant is to have it provide “hassle-free” access to AI and LLM technology integrated into the Wolfram System. And indeed, once you’ve set up your subscription (within your Wolfram Account), everything “just works”. Under the hood, there’s all sorts of technology, servers, etc. But you don’t have to worry about any of that; you can just use Notebook Assistant as a built-in part of the Wolfram Notebook experience.

As you work with Notebook Assistant, you’ll get progressively better intuition about where it can best help you. (And, yes, we’ll be continually updating Notebook Assistant, so it’ll often be worth trying things again if a bit of time has passed.) Notebook Assistant—like any AI-based system—has definite human-like characteristics, including sometimes making mistakes. Often those mistakes will be obvious (e.g. code with incorrect syntax colored red); sometimes they may be more difficult to spot. But the great thing about Notebook Assistant is that it’s firmly anchored to the “solid ground” of Wolfram Language. And any time it writes Wolfram Language code that you can see does what you want, you can always confidently use it.

There are some things that will help Notebook Assistant do its best for you. Particularly important is giving it the best view of the “context” for what you ask it. Notebook Assistant will generally look at whatever has already been said in a particular chat. So if you’re going to change the subject, it’s best to use the button to start a new chat, so Notebook Assistant will focus on the new subject, and not get confused by what you (or it) said before.

When you open the Notebook Assistant chat window you’ll often want to talk about—or refer to—material in some other notebook. Generally Notebook Assistant will assume that the notebook you last used is the one that’s relevant—and that any selection you have in that notebook is the thing to concentrate on the most. If you want Notebook Assistant to focus exclusively on what you’re saying in the chat window, one way to achieve that is to start a blank notebook. Another approach is to use the menu, which provides more detailed control over what material Notebook Assistant will consider. (For now, it just deals with notebooks you have open—but external files, URLs, etc. are coming soon.)

Notebook Assistant will by default store all your chat sessions. You can see your chat history (with chats automatically assigned names by the Assistant) by pressing the History button. You can delete chats from your history here. You can also “pop out” chats with , creating standalone notebooks that you can save, send to other people, etc.

So what’s inside Notebook Assistant? It’s quite a tower of technology. The core of its “linguistic interface” is an LLM (actually, several different LLMs)—trained on extensive Wolfram Language material, and with access to a variety of tools, especially Wolfram Language evaluators. Also critical to Notebook Assistant is its access to a variety of RAGs based on vector databases, that it uses for immediate semantic search of material such as Wolfram Language documentation. Oh, and then there’s a lot of technology to connect Notebook Assistant to the symbolic internal structure of notebooks, etc.

So when you use Notebook Assistant, where is it actually running? Its larger LLM tasks are currently running on cloud servers. But a substantial part of its functionally is running right on your computer—using Wolfram Language (notably the Wolfram machine learning framework, vector database system, etc.) And because these things are running locally, the Assistant can request access to local information on your computer—as well as avoiding the latency of accessing cloud-based systems.

Chats in Your Main Notebook (Coming Soon)

Much of the time, you want your interactions with Notebook Assistant to be somehow “off on the side”—say in the Notebook Assistant window, or in the inspiration button menu. But sometimes you want your interactions to be right in your main notebook.

And for this you’ll soon (in Version 14.2) be able to use an enhanced version of the Chat Notebook technology that we developed last year, not just in a separate “Chat Notebook”, but fully integrated into any notebook.

At the beginning of a cell in any notebook, just press . You get a chat cell that communicates with Notebook Assistant:

Chat cell

And now the output from that chat cell is placed directly below in the notebook—so you can create a notebook that mixes standard notebook content with chat content.

It all works basically just like a fully integrated version of our Chat Notebook technology. (And this functionality is already available in Version 14.1 if you explicitly create a chat notebook with File > New > Chat Notebook.) As in Chat Notebooks, you use a chat break (with ~) to start a new chat within the same notebook. (In general, when you use a chat cell in an ordinary notebook to access Notebook Assistant, the assistant will see only material that occurs before the chat, and within the same chat block.)

Also Introducing: LLM Kit

In mid-2023 we introduced LLMFunction, LLMSynthesize and related functions (as well as ChatEvaluate, ImageSynthesize, etc.) to let you access LLM functionality directly within the Wolfram Language. Until now these functions required connection to an external LLM provider. But along with Notebook Assistant we’re introducing today LLM Kit—which allows you to access all LLM functionality in the Wolfram Language directly through a subscription within your Wolfram Account.

It’s all very easy: as soon as you enable your subscription, not only Notebook Assistant but also all LLM functionality will just work, going through our LLM service. (And, yes, Notebook Assistant is basically built on top of LLM Kit and the LLM service access it defines.)

When you’ve enabled your Notebook Assistant + LLM Kit subscription, this is what you’ll see in the Preferences panel:

Notebook Assistant + LLM Kit Preferences panel

Our LLM service is primarily aimed at “human speed” LLM usage, in other words, things like responding to what you ask the Notebook Assistant. But the service also seamlessly supports programmatic things like LLMFunction. And for anything beyond small-scale uses of LLMFunction, etc. you’ll probably want to upgrade from the basic “Essentials” subscription level to the “Pro” level. And if you want to go “industrial scale” in your LLM usage, you can do that by explicitly purchasing Wolfram Service Credits.

Everything is set up to be easy if you use our Wolfram LLM service—and that’s what Notebook Assistant is based on. But for Chat Notebooks and programmatic LLM functionality, our Wolfram Language framework also supports connection to a wide range of external LLM service providers. You have to have your own external subscription to whatever external service you want to use. But once you have the appropriate access key you’ll be able to set things up so that you can pick that LLM provider interactively in Chat Notebooks, programmatically through LLMConfiguration, or in the Preferences panel.

(By the way, we’re continually monitoring the performance of different LLMs on Wolfram Language generation; you can see weekly benchmark results at the Wolfram LLM Benchmark Project website—or get the data behind that from the Wolfram Data Repository.)

Opening Up the Ability to “Go Computational”

There’s really never been anything quite like it before: a way of automatically taking what can be quite vague human thoughts and ideas, and making them crisp and structured—by expressing them computationally. And, yes, this is made possible now by the unexpectedly effective linguistic interface that LLMs give us. But ultimately what makes it possible is that the LLMs have a target: the Wolfram Language in all its breadth and depth.

For me it’s an exciting moment. Because it’s a moment where everything we’ve been building these past four decades is suddenly much more broadly accessible. Expert users of Wolfram Language will be able to make use of all sorts of amazing nooks of functionality they never knew about. And people who’ve never used Wolfram Language before—or never even formulated anything computationally—will suddenly be able to do so.

And it’s remarkable what kinds of things one can “make computational”. Let’s say you ask Wolfram Notebook Assistant to make up a story. Like pretty much anything today with LLMs inside, it’ll dutifully do that:

Click to enlarge

But how can one make something like this computational? Well, just ask Notebook Assistant:

Click to enlarge

Click to enlarge

And what it does is rather remarkable: it uses Wolfram Language to create an interactive agent-based computational game version of the story!

Computation is the great paradigm of our times. And the development of “computational X” for all X seems destined to be the future of pretty much every field. The whole tower of ideas and technology that is the modern Wolfram Language was built precisely to provide the computational language that is needed. But now Notebook Assistant is dramatically broadening access to that—making it possible to get “computational language superpowers” using just ordinary (and perhaps even vague) natural language.

And even though I’ve now been living the computational language story for more than four decades Notebook Assistant keeps on surprising me with what it manages to make computational. It’s incredibly powerful to be able “go computational”. And even if you can’t imagine how it could work in what you’re doing, you should still just try it! Notebook Assistant may well surprise you—and in that moment show you a path to leverage the great power of the computational paradigm in ways that you’re never imagined.

Foundations of Biological Evolution: More Results & More Surprises

2024-12-06 07:13:27

Foundations of Biological Evolution: More Results & More Surprises

This is a follow-on to Why Does Biological Evolution Work? A Minimal Model for Biological Evolution and Other Adaptive Processes [May 3, 2024].

Even More from an Extremely Simple Model

A few months ago I introduced an extremely simple “adaptive cellular automaton” model that seems to do remarkably well at capturing the essence of what’s happening in biological evolution. But over the past few months I’ve come to realize that the model is actually even richer and deeper than I’d imagined. And here I’m going to describe some of what I’ve now figured out about the model—and about the often-surprising things it implies for the foundations of biological evolution.

The starting point for the model is to view biological systems in abstract computational terms. We think of an organism as having a genotype that’s represented by a program, that’s then run to produce its phenotype. So, for example, the cellular automaton rules on the left correspond to a genotype which are then run to produce the phenotype on the right (starting from a “seed” of a single red cell):

The key idea in our model is to adaptively evolve the genotype rules—say by making single “point mutations” to the list of outcomes from the rules:

At each step in the adaptive evolution we “accept” a mutation if it leads to a phenotype that has a higher—or at least equal—fitness relative to what we had before. So, for example, taking our fitness function to be the height (i.e. lifetime) of the phenotype pattern (with patterns that are infinite being assigned zero fitness), a sequence of (randomly chosen) adaptive evolution steps that go from the null rule to the rule above might be:

What if we make a different sequence of randomly chosen adaptive evolution steps? Here are a few examples of what happens—each in a sense “using a different idea” for how to achieve high fitness:

And, yes, one can’t help but be struck by how “lifelike” this all looks—both in the complexity of these patterns, and in their diversity. But what is ultimately responsible for what we’re seeing? It’s long been a core question about biological evolution. Are the forms it produces the result of careful “sculpting” by the environment (and by the fitness functions it implies)—or are their most important features somehow instead a consequence of something more intrinsic and fundamental that doesn’t depend on details of fitness functions?

Well, let’s say we pick a different fitness function—for example, not the height of a phenotype pattern, but instead its width (or, more specifically, the width of its bounding box). Here are some results of adaptive evolution in this case:

And, yes, the patterns we get are now ones that achieve larger “bounding box width”. But somehow there’s still a remarkable similarity to what we saw with a rather different fitness function above. And, for example, in both cases, high fitness, it seems, is normally achieved in a complicated and hard-to-understand way. (The last pattern is a bit of an exception; as can also happen in biology, this is a case where for once there’s a “mechanism” in evidence that we can understand.)

So what in the end is going on? As I discussed when I introduced the model a few months ago, it seems that the “dominant force” is not selection according to fitness functions, but instead the fundamental computational phenomenon of computational irreducibility. And what we’ll find here is that in fact what we see is, more than anything, the result of an interplay between the computational irreducibility of the process by which our phenotypes develop, and the computational boundedness of typical forms of fitness functions.

The importance of such an interplay is something that’s very much come into focus as a result of our Physics Project. And indeed it now seems that the foundations of both physics and mathematics are—more than anything—reflections of this interplay. And now it seems that’s true of biological evolution as well.

In studying our model, there are many detailed phenomena we’ll encounter—most of which seem to have surprisingly direct analogs in actual biological evolution. For example, here’s what happens if we plot the behavior of the fitness function for our first example above over the course of the adaptive evolution process:

We see a sequence of “plateaus”, punctuated by jumps in fitness that reflect some “breakthrough” being made. In the picture, each red dot represents the fitness associated with a genotype that was tried. Many fall below the line of “best results so far”. But there are also plenty of red dots that lie right on the line. And these correspond to genotypes that yield the same fitness that’s already been achieved. But here—as in actual biological evolution—it’s important that there can be “fitness-neutral evolution”, where genotypes change, but the fitness does not. Usually such changes of genotype yield not just the same fitness, but also the exact same phenotype. Sometimes, however, there can be multiple phenotypes with the same fitness—and indeed this happens at one stage in the example here

and at multiple stages in the second example we showed above:

The Multiway Graph of All Possible Evolutions

In the previous section we saw examples of the results of a few particular random sequences of mutations. But what if we were to look at all possible sequences of mutations? As I discussed when I introduced the model, it’s possible to construct a multiway graph that represents all possible mutation paths. Here’s what one gets for symmetric k = 2, r = 2 rules—starting from the null rule, and using height as a fitness function:

The way this graph is constructed, there are arrows from a given phenotype to all phenotypes with larger (finite) height that can be reached by a single mutation.

But what if our fitness function is width rather than height? Well, then we get a different multiway graph in which arrows go to phenotypes not with larger height but instead with larger width:

So what’s really going on here? Ultimately one can think of there being an underlying graph (that one might call the “mutation graph”) in which every edge represents a transformation between two phenotypes that can be achieved by a single mutation in the underlying genotype:

At this level, the transformations can go either way, so this graph is undirected. But the crucial point is that as soon as one imposes a fitness function, it defines a particular direction for each transformation (at least, each transformation that isn’t fitness neutral for this fitness function). And then if one starts, say, from the null rule, one will pick out a certain “evolution cone” subgraph of the original mutation graph.

So, for example, with width as the fitness function, the subgraph one gets is what’s highlighted here:

There are several subtleties here. First, we simplified the multiway graph by doing transitive reduction and drawing only the minimal edges necessary to define the connectivity of the graph. If we want to see all possible single-mutation transformations between phenotypes we need to do transitive completion, in which case for the width fitness function the multiway graph we get is:

But now there’s another subtlety. The edges in the multiway graph represent fitness-changing transformations. But there are also fitness-neutral transformations. And occasionally these can even lead to different (though equal-fitness) phenotypes, so that really each node in the graph above (say, the transitively reduced one) should sometimes be associated with multiple phenotypes

which can “fitness neutrally” transform into each other, as in:

But even this isn’t the end of the subtleties. Fitness-neutral sets typically contain many genotypes differing by changes of rule cases that don’t affect the phenotype they produce. But it may be that just one or a few of these genotypes are “primed” to be able to generate another phenotype with just one additional mutation. Or, in other words, each node in the multiway graph above represents a whole class of genotypes “equivalent under fitness-neutral transformations”, and when we draw an arrow it indicates that some genotype in that class can be transformed by a single mutation to some genotype in the class associated with a different phenotype:

But beyond the subtleties, the key point is that particular fitness functions in effect just define particular orderings on the underlying mutation graph. It’s somewhat like choices of reference frames or families of simultaneity surfaces in physics. Different choices of fitness function in effect define different ways in which the underlying mutation graph can be “navigated” by evolution over the course of time.

As it happens, the results are not so different between height and width fitness functions. Here’s a combined multiway graph, indicating transformations variously allowed by these different fitness functions:

Homing in on a small part of this graph, we see that there are different “flows” associated with maximizing height and maximizing width:

With a single fitness function that for any two phenotypes systematically treats one phenotype as fitter than another, the multiway graph must always define a definite flow. But as soon as one considers changing fitness functions in the course of evolution, it’s possible to get cycles in the multiway graph, as in the example above—so that, in effect, “evolution can repeat itself”.

Fitness Functions Based on Aspect Ratio

We’ve looked at fitness functions based on maximizing height and on maximizing width. But what if we try to combine these? Here’s a plot of the widths and heights of all phenotypes that occur in the symmetric k = 2, r = 2 case we studied above:

We could imagine a variety of ways to define “fitness frontiers” here. But as a specific example, let’s consider fitness functions that are based on trying to achieve specific aspect ratios—i.e. phenotypes that are as close as possible to a particular constant-aspect-ratio line in the plot above.

With the symmetric k = 2, r = 2 rules we’re using here, only a certain set of aspect ratios can ever be obtained:

The corresponding phenotypes (with their aspect ratios) are:

As we change the aspect ratio that we’re trying to achieve, the evolution multiway graph will change:

In all cases we’re starting from the null rule. For target aspect ratio 1.0 this rule itself already achieves that aspect ratio—so the multiway graph in that case is trivial. But in general, different aspect ratios yield evolution multiway graphs that are different subgraphs of the complete mutation graph we saw above.

So if we follow all possible paths of evolution, how close can we actually get to any given target aspect ratio? This plot shows what final aspect ratios can be achieved as a function of target aspect ratio:

And in a sense this is a summary of the effect of “developmental constraints” for “adaptive cellular automaton organisms” like this. If there were no constraints then for every target aspect ratio it’d be possible to get an “organism” with that aspect ratio—so in the plot there’d be a point lying on the red line. But in actuality the process of cellular automaton growth imposes constraints—that in particular allows only certain phenotypes, with certain aspect ratios, to exist. And beyond that, which phenotypes can actually be reached by adaptive evolution depends on the evolution multiway graph, with “different turns” on the graph leading to different fitness (i.e. different aspect ratio) phenotypes.

But what the plot above shows overall is that for a certain range of target aspect ratios, adaptive evolution is successfully able to get at least close to those aspect ratios. If the target aspect ratio gets out of that range, however, “developmental constraints” come in that prevent the target from being reached.

With “larger genomes”, i.e. rules with larger numbers of cases to specify, it’s possible to do better, and to more accurately achieve particular aspect ratios, over larger ranges of values. And indeed we can see some version of this effect even for symmetric k = 2, r = 2 rules by plotting aspect ratios that can be achieved as a function of the number of cases that need to be specified in the rule:

As an alternative visualization, we can plot the “best convergence to the target” as a function of the number of rule cases—and once again we see that larger numbers of rule cases let us get closer to target aspect ratios:

It’s worth mentioning that—just as we discussed for height and width fitness functions above—there are subtleties here associated with fitness-neutral sets. For example, here are sets of phenotypes that all have the specified aspect ratios—with phenotypes that can be reached by single point mutations being joined:

In the evolution multiway graphs above, we included only one phenotype for each fitness-neutral set. But here’s what we get for target aspect ratio 0.7 if we show all phenotypes with a given fitness:

Note that on the top line, we don’t just get the null rule. Instead, we get four phenotypes, all of which, like the null rule, have aspect ratio 1, and so are equally far from the target aspect ratio 0.7.

The picture above is only the transitively reduced graph. But if we include all possible transformations associated with single point mutations, we get instead:

Based on this graph, we can now make what amounts to a foliation, showing collections of phenotypes reached by a certain minimum number of mutations, progressively approaching our target aspect ratio (here 0.7):

Here’s what we get from the range of target aspect ratios shown above (where, as above, “terminal phenotypes” are highlighted):

In a sense these sequences show us what phenotypes can appear at progressive stages in the “fossil record” for different (aspect-ratio) fitness functions in our very simple model. The highlighted cases are “evolutionary dead ends”. The others can evolve further.

Unreachable Cases

Our model takes the process of adaptive evolution to never “go backwards”, or, in other words, to never evolve from a particular genotype to one with lower fitness. But this means that starting with a certain genotype (say the null rule) there may be genotypes (and hence phenotypes) that will never be reached.

With height as a fitness function, there are just two single (“orphan”) phenotypes that can’t be reached:

And with width as the fitness function, it turns out the very same phenotypes also can’t be reached:

But if we use a fitness function that, for example, tries to achieve aspect ratio 0.7, we get many more phenotypes that can’t be reached starting from the null rule:

In the original mutation graph all the phenotypes appear. But when we foliate (or, more accurately, order) that graph using a particular fitness function, some phenotypes become unreachable by evolutionarily-possible transformations—in a rough analogy to the way some events in physics can become unreachable in the presence of an event horizon.

Multiway Graphs for Larger Rule Spaces

So far we’ve discussed multiway graphs here only for symmetric k = 2, r = 2 rules. There are a total of 524,288 (= 219) possible such rules, producing 77 distinct phenotypes. But what about larger classes of rules? As an example, we can consider all k = 2, r = 2 rules, without the constraint of symmetry. There are 2,147,483,648 (= 231) possible such rules, and there turn out to be3137distinct phenotypes.

For the height fitness function, the complete multiway graph in this case is

or, annotated with actual phenotypes:

If instead we just show bounding boxes, it’s easier to see where long-lifetime phenotypes occur:

With a different graph layout the evolution multiway graph (with initial node indicated) becomes:

One subtlety here is that the null rule has no successors with single point mutation. When we were talking about symmetric k = 2, r = 2 rules, we took a “single point mutation” always to change both a particular rule case and its mirror image. But if we don’t have the symmetry requirement, a single point mutation really can just change a single rule case. And if we start from the null range and look at the results of changing just one bit (i.e. the output of just one rule case) in all possible ways we find that we either get the same pattern as with the null rule, or we get a pattern that grows without bound:

Or, put another way, we can’t get anywhere with single bit mutations starting purely from the null rule. So what we’ve done is instead to start our multiway graph from k = 2, r = 2 rule 20, which has two bits “on”, and gives phenotype:

But starting from this, just one mutation (together with a sequence of fitness-neutral mutations) is sufficient to give94phenotypes—or49after removing mirror images:

The total number of new phenotypes we can reach after successively more (non-fitness-neutral) mutations is

while the successive longest-lifetime patterns are:

And what we see here is that it’s in principle possible to achieve long lifetimes even with fairly few mutations. But when the mutations are done at random, it can still take a very large number of steps to successfully “random walk” to long lifetime phenotypes.

And out of a total of2407distinct phenotypes,984are “dead ends” where no further evolution is possible. Some of these dead ends have long lifetimes

but others have very short lifetimes:

There’s much more to explore in this multiway graph—and we’ll continue a bit below. But for now let’s look at another evolution multiway graph of accessible size: the one for symmetric k = 3,r = 1 rules. There are a total of 129,140,163 (= 317) possible such rules, that yield a total of14,778distinct phenotypes:

Showing only bounding boxes of patterns this becomes:

Unlike the k = 2, r = 2 case, we can now start this whole graph with the null rule. However, if we look at all possible symmetric k = 3, r = 1 rules, there turn out to be 6 “isolates” that can’t be reached from the null rule by adaptive evolution with the height fitness function:

Starting from the null rule, the number of phenotypes reached after successively more (non-fitness-neutral) mutations is

and the successive longest-lived of these phenotypes are:

Aspect Ratio Fitness

Just as we looked at fitness functions based on aspect ratio above for symmetric k = 2, r = 2 rules, so now we can do this for the whole space of all possible k = 2, r = 2 rules. Here’s a plot of the heights and widths of patterns that can be achieved with these rules:

These are the possible aspect ratios this implies:

And here’s their distribution (on a log scale):

The range of possible values extends much further than for symmetric k = 2, r = 2 rules: to rather than to . The patterns now with the largest aspect ratios are

while those with the smallest aspect ratios are:

Note that just as for symmetric k = 2, r = 2 rules, to reach a wider range of aspect ratios, more cases in the rule have to be specified:

So what happens if we use adaptive evolution to try to reach different possible target aspect ratios? Most of the time (at least up to aspect ratio ≈ 3) there’s some sequence of mutations that will do it—though often we can get stuck at a different aspect ratio:

If we look at the “best convergence” to a given target aspect ratio then we see that this improves as we increase the number of cases specified in the rule:

So what does the multiway graph look like for a fitness function associated with a particular aspect ratio? Here’s the result for aspect ratio 3:

The initial node involves patterns with aspect ratio 1—actually a fitness-neutral set of 263 of them. And as we go through the multiway graph, the aspect ratios get nearer to 3. The very closest they get, though, are for the patterns (whose locations are indicated on the graph):

But actually (as we saw in the lineup above), there is a rule that gives aspect ratio exactly 3:

But it turns out that this rule can’t be reached by adaptive evolution using single point mutations. In effect, adaptive evolution isn’t “strong enough” to achieve the exact aspect ratio we want; we can think of it as being “unpredictably prevented” by computationally irreducible “developmental constraints”.

OK, so what about the symmetric k = 3, r = 1 rules? Here’s how they’re distributed in width and height:

And, yes, in a typical “there are always surprises” story, there’s a strange height 265, width 173 pattern that shows up:

The overall possible aspect ratios are now

and their (log) distribution is:

The phenotypes with the largest aspect ratios are

while those with the smallest aspect ratios are:

Once again, to reach a larger range of aspect ratios, one has to specify more cases in the rule:

If we try to target a certain aspect ratio, there’s somewhat more of a tendency to get stuck than for k = 2, r = 2 rules—perhaps somewhat as a result of there now being fewer total rules (though more phenotypes) available:

Branching in the Multiway Evolution Graph

Looking at a typical multiway evolution graph such as

we see that different phenotypes can be quite separated in the graph—a bit like organisms on different branches of the tree of life in actual biology. But how can we characterize this separation? One approach is to compute the so-called dominator tree of the graph:

We can think of this as a way to provide a map of the least common ancestors of all nodes. The tree is set up so that given two nodes you just trace up the tree to find their common ancestor. Another interpretation of the tree is that it shows you what nodes you have no choice but to pass through in getting from the initial node to any given node—or, in other words, what phenotypes adaptive evolution has to produce on the way to a given phenotype.

Here’s another rendering of the tree:

We can think of this as the analog of the biological tree of life, with successive branchings picking out finer and finer “taxonomic domains” (analogous to kingdoms, phyla, etc.)

The tree also shows us something else: how significant different links or nodes are—and how much of the tree one would “lop off” if they were removed. Or, put a different way, how much would be achieved by blocking a certain link or node—as one might imagine doing to try to block the evolution of bacteria or tumor cells?

What if we look at larger multiway evolution graphs, like the complete k = 2, r = 2 one? Once again we can construct a dominator tree:

It’s notable that there’s tremendous variance in the “fan out” here, with the phenotypes with largest successor counts being the rather undistinguished:

But what if one’s specifically trying to reach, say, one of the maximum lifetime (length308) phenotypes? Well, then one has to follow the paths in a particular subgraph of the original multiway evolution graph

corresponding to the phenotype graph:

If one goes off this “narrow path” then one simply can’t reach the length-308 phenotype; one inevitably gets stuck in what amounts to another branch of the analog of the “tree of life”. So if one is trying to “guide evolution” to a particular outcome, this tells one that one needs to block off lots of “exit ramps”.

But what “fraction of the whole graph” is the subgraph that leads to the length-308 phenotype? The whole graph has2409vertices and 3878 edges, while the subgraph has 64 vertices and 119 edges, i.e. in both cases about 3%. A different measure is what fraction of all paths through the graph lead to the length-308 phenotype. The total number of paths is606,081,while the number leading to the length-308 phenotype is 1260, or about 0.2%. Does this tell us what the probability of reaching that phenotype will be if we just make a random sequence of mutations? Not quite, because in the multiway evolution graph many equivalencings have been done, notably for fitness-neutral sets. And if we don’t do such equivalencings, it turns out (as we’ll discuss below) that the corresponding number is significantly smaller—about 0.007%.

Exact-Match Fitness Functions

The fitness functions we’ve been considering so far look only at coarse features of phenotype patterns—like their height, width and aspect ratio. But what happens if we have a fitness function that’s maximal only for a phenotype that exactly matches a particular pattern?

As an example, let’s consider k = 2, r = 1 cellular automata with phenotypes grown for a specific number of steps—and with a fitness function that counts the number of cells that agree with ones in a target:

Let’s say we start with the null rule, then adaptively evolve by making single point mutations to the rule (here just 8 bits). With a target of the rule 30 pattern, this is the multiway graph we get:

And what we see is that after a grand tour of nearly a third of all possible rules, we can successfully reach the rule 30 pattern. But we can also get stuck at rule 86 and rule 190 patterns—even though their fitness values are much lower:

If we consider all possible k = 2, r = 1 cellular automaton patterns as targets, it turns out that these can always be reached by adaptive evolution from the null rule—though a little less than half the time there are other possible endpoints (here specified by rule numbers) at which the evolution process can get stuck:

So far we’ve been assuming that we have a fitness function that’s maximized by matching some pattern generated by a cellular automaton pattern. But what if we pick some quite different pattern to match against? Say our pattern is:

With k = 2, r = 1 rules (running with wraparound in a finite-size region), we can construct a multiway graph

and find out that the maximum fitness endpoints are the not-very-good approximations:

We can also get to these by applying random mutations:

But what if we try a larger rule space, say k = 2, r = 2 rules? Our approximations to the “A” image get a bit better:

Going to k = 2, r = 3 leads to slightly better (but not great) final approximations:

If we try to do the same thing with our target instead being

we get for example

while with target

we get (even less convincing) results like:

What’s going on here? Basically it’s that if we try to set up too intricate a fitness function, then our rule spaces won’t contain rules that successfully maximize it, and our adaptive evolution process will end up with a variety of not-very-good approximations.

How Fitness Builds Up

When one looks at an evolution process like

one typically has the impression that successive phenotypes are achieving greater fitness by somehow progressively “building on the ideas” of earlier ones. And to get a more granular sense of this we can highlight cells at each step that are using “newly added cases” in the rule:

We can think of new rule cases as a bit like new genes in biology. So what we’re seeing here is the analog of new genes switching on (or coming into existence) as we progress through the process of biological evolution.

Here’s what happens for some other paths of evolution:

What we see is quite variable. There are a few examples where new rule cases show up only at the end, as if a new “incrementally engineered” pattern was being “grafted on at the end”. But most of the time new rule cases show up sparsely dotted all over the pattern. And somehow those few “tweaks” lead to higher fitness—even though there’s no obvious reason why, and no obvious way to predict where they should be.

It’s interesting to compare this with actual biology, where it’s pretty common to see what appear to be “random gratuitous changes” between apparently very similar organisms. (And, yes, this can lead to all sorts of problems in things like comparing toxicity or drug effectiveness in model animals versus humans.)

There are many ways to consider quantitatively characterizing how “rule utilization” builds up. As just one example, here are plots for successive phenotypes along the evolution paths shown above of what stages in growth new rule cases show up:

But Is It Explainable?

Here are two “adaptively evolved” long-lifetime rules that we discussed at the beginning:

We can always run these rules and see what patterns they produce. But is there a way to explain what they do? And for example to analyze how they manage to yield lifetimes? Or is what we’re seeing in these rules basically “pure computational irreducibility” where the only way to tell what patterns they will generate—and how long they’ll live—is just explicitly to run them step by step?

The second rule here seems to have a bit more regularity than the first, so let’s tackle it first. Let’s look at the “blade” part. Once such an object—of any width—has formed, its behavior will basically be repetitive, and it’s easy to predict what will happen:

The left-hand edge moves by 1 position every 7 steps, and the right-hand edge by 4 positions every 12 steps. And since , however wide the initial configuration is, it’ll always die out, after a number of steps that’s roughly times the initial width.

But OK, how does a configuration like this get produced? Well, that’s far from obvious. Here’s what happens with a sequence of few-cell initial conditions …:

So, yes, it doesn’t always directly make the “blade”. Sometimes, for example, it instead makes things like these, some of which basically just become repetitive, and live forever:

And even if it starts with a “blade texture” unexpected things can happen:

There are repetitive patterns that can persist—and indeed the “blade” uses one of these:

Starting from a random initial condition one sees various kinds of behavior, with the blade being fairly common:

But none of this really makes much of a dent in “explaining” why with this rule, starting from a single red cell, we get a long-lived pattern. Yes, once the “blade” forms, we know it’ll take a while to come to a point. But beyond this little pocket of computational reducibility we can’t say much in general about what the rule does—or why, for example, a blade forms with this initial condition.

So what about our other rule? There’s no obvious interesting pocket of reducibility there at all. Looking at a sequence of few-cell initial conditions we get:

And, yes, there’s all sorts of different behavior that can occur:

The first of these patterns is basically periodic, simply shifting 2 cells to the left every 56 steps. The third one dies out after 369 steps, and the fourth one becomes basically periodic (with period 56) after 1023 steps:

If we start from a random initial condition we see a few places where things die out in a repeatable pattern. But mostly everything just looks very complicated:

As always happens, the rule supports regions of repetitive behavior, but they don’t normally extend far enough to introduce any significant computational reducibility:

So what’s the conclusion? Basically it’s that these rules—like pretty much all others we’ve seen here—behave in essentially computationally irreducible ways. Why do they have long lifetimes? All we can really say is “because they do”. Yes, we can always run them and see what happens. But we can’t make any kind of “explanatory theory”, for example of the kind we’re used to in mathematical approaches to physics.

Distribution in Morphospace

We can think of the pattern of growth seen in each phenotype as defining what we might call in biology its “morphology”. So what happens if we try to operate as “pure taxonomists”, laying out different phenotypes in “morphospace”? Here’s a result based on using machine learning and FeatureSpacePlot:

And, yes, this tends to group “visually similar” phenotypes together. But how does proximity in morphospace relate to proximity in genotypes? Here is the same arrangement of phenotypes as above, but now indicating the transformations associated with single mutations in genotype:

If for example we consider maximizing for height, only some of the phenotypes are picked out:

For width, a somewhat different set are picked out:

And here is what happens if our fitness function is based on aspect ratio:

In other words, different fitness functions “select out” different regions in morphospace.

We can also construct a morphospace not just for symmetric but for all k = 2, r = 2 rules:

The detailed pattern here is not particularly significant, and, more than anything, just reflects the method of dimension reduction that we’ve used. What is more meaningful, however, is how different fitness functions select out different regions in morphospace. This shows the results for fitness functions based on height and on width—with points colored according to the actual values of height and width for those phenotypes:

Here are the corresponding results for fitness functions based on different aspect ratios, where now the coloring is based on closeness to the target aspect ratio:

What’s the main conclusion here? We might have expected that different fitness functions would cleanly select visibly different parts of morphospace. But at least with our machine-learning-based way of laying out morphospace that’s not what we’re seeing. And it seems likely that this is actually a general result—and that there is no layout procedure that can make any “easy to describe” fitness function “geometrically simple” in morphospace. And once again, this is presumably a consequence of underlying computational irreducibility—and to the fact that we can’t expect any morphospace layout procedure to be able to provide a way to “untangle the irreducibility” that will work for all fitness functions.

Probabilities and the Time Course of Evolution

In what we’ve done so far, we’ve mostly been concerned with things like what sequences of phenotypes can ever be produced by adaptive evolution. But in making analogies to actual biological evolution—and particularly to how it’s captured in the fossil record—it’s also relevant to discuss time, and to ask not only what phenotypes can be produced, but also when, and how frequently.

For example, let’s assume there’s a constant rate of point mutations in time. Then starting from a given rule (like the null rule) there’ll be a certain rate at which transitions to other rules occur. Some of these transitions will lead to rules that are selected out. Others will be kept, but will yield the same phenotype. And still others will lead to transitions to different phenotypes.

We can represent this by a “phenotype transition diagram” in which the thickness of each outgoing edge from a given phenotype indicates the fraction of all possible mutations that lead to the transition associated with that edge:

Gray self-loops in this diagram represent transitions that lead back to the same phenotype (because they change cases in the rule that don’t matter). Pink self-loops correspond to transitions that lead to rules that are selected out. We don’t show rules that have been selected out here; instead we assume that in this case we just “wait at the original phenotype” and don’t make a transition.

We can annotate the whole symmetric k = 2, r = 2 multiway evolution graph with transition probabilities:

Underlying this graph is a matrix of transition probabilities between all 219 possible symmetric k = 2, r = 2 rules (where the structure reflects the fact that many rules transform to rules which differ only by one bit):

Keeping only distinct phenotypes and ordering by lifetime, we can then make a matrix of phenotype transition probabilities:

Treating the transitions as a Markov process, this allows us to compute the expected frequency of each phenotype as a function of time (i.e. number of mutations):

What’s basically happening here is that there’s steady evolution away from the single-cell phenotype. There are some intermediate phenotypes that come and go, but in the end, everything “flows” to the final (“leaf”) phenotypes on the multiway evolution graph—leading to a limiting “equilibrium” probability distribution:

Stacking the different curves, we get an alternative visualization of the evolution of phenotype frequencies:

If we were “running evolution” with enough separate individuals, these would be the limiting curves we’d get. If we reduced the number of individuals, we’d start to see fluctuations—and there’d be a certain probability, for example, for a particular phenotype to end up with zero individuals, and effectively go extinct.

So what happens with a different fitness function? Here’s the result using width instead of height:

And here are results for fitness functions based on a sequence of targets for aspect ratio:

And, yes, the fitness function definitely influences the time course of our adaptive evolution process.

So far we’ve been looking only at symmetric k = 2, r = 2 rules. If we look at the space of all possible k = 2, r = 2 rules, the behavior we see is similar. For example, here’s the time evolution of possible phenotypes based on our standard height fitness function:

And this is what we see if we look only at the longest-lifetime (i.e. largest-height) cases:

As the scale here indicates, such long-lived phenotypes are quite rare—though most still occur with nonzero frequency even after arbitrarily large times (which is an inevitable given that they appear as “maximal fitness” terminal nodes in the multiway graph).

And indeed if we plot the final frequencies of phenotypes against their lifetimes we see that there are a wide range of different cases:

The phenotypes with the highest “equilibrium” frequencies are

with some having fairly small lifetimes, and others larger.

The Macroscopic Flow of Evolution

In the previous section, we looked at the time course of evolution with various different—but fixed—fitness functions. But what if we had a fitness function that changes with time—say analogous to an environment for biological evolution that changes with time?

Here’s what happens if we have an aspect ratio fitness function whose target value increases linearly with time:

The behavior we see is quite complex, with certain phenotypes “winning for a while” but then dying out, often quite precipitously—with others coming to take their place.

If instead the target aspect ratio decreases with time, we see rather different behavior:

(The discontinuous derivatives here are basically associated with the sudden appearance of new phenotypes at particular target aspect ratio values.)

It’s also possible to give a “shock to the system” by suddenly changing the target aspect ratio:

And what we see is that sometimes this shock leads to fewer surviving phenotypes, and sometimes to more.

We can think of a changing fitness function as being something that applies a “macroscopic driving force” to our system. Things happen quickly down at the level of individual mutation and selection events—but the fitness function defines overall “goals” for the system that in effect change only slowly. (It’s a bit like a fluid where there are fast molecular-scale processes, but typically slow changes of macroscopic parameters like pressure.)

But if the fitness function defines a goal, how well does the system manage to meet it? Here’s a comparison between an aspect ratio goal (here, linearly increasing) and the distribution of actual aspect ratios achieved, with the darker curve indicating the mean aspect ratio obtained by a weighted average over phenotypes, and the lighter blue area indicating the standard deviation:

And, yes, as we might have expected from earlier results, the system doesn’t do particularly well at achieving the goal. Its behavior is ultimately not “well sculpted” by the forces of a fitness function; instead it is mostly dominated by the intrinsic (computationally irreducible) dynamics of the underlying adaptive evolution process.

One important thing to note however is that our results depend on the value of a parameter: essentially the rate at which underlying mutations occur relative to the rate of change of the fitness function. In the picture above 5000 mutations occur over the time the fitness function goes from minimum to maximum value. This is what happens if we change the number of mutations that occur (or, in effect, the “mutation rate”):

Generally—and not surprisingly—adaptive evolution does better at achieving the target when the mutation rate is higher, though in both the cases shown here, nothing gets terribly close to the target.

In their general character our results here seem reminiscent of what one might expect in typical studies of continuum systems, say based on differential equations. And indeed one can imagine that there might be “continuum equations of adaptive evolution” that govern situations like the ones we’ve seen here. But it’s important to understand that it’s far from self evident that this is possible. Because underneath everything is a multiway evolution graph with a definite and complicated structure. And one might think that the details of this structure would matter to the overall “continuum evolution process”. And indeed sometimes they will.

But—as we have seen throughout our Physics Project—underlying computational irreducibility leads to a certain inevitable simplicity when looking at phenomena perceived by computationally bounded observers. And we can expect that something similar can happen with biological evolution (and indeed adaptive evolution in general). Assuming that our fitness functions (and their process of change) are computationally bounded, then we can expect that their “aggregate effects” will follow comparatively simple laws—which we can perhaps think of as laws for the “flow of evolution” in response to external input.

Can Evolution Be Reversed?

In the previous section we saw that with different fitness functions, different time series of phenotypes appear, with some phenotypes, for example, sometimes “going extinct”. But let’s say evolution has proceeded to a certain point with a particular fitness function—and certain phenotypes are now present. Then one question we can ask is whether it’s possible to “reverse” that evolution, and revert to phenotypes that were present before. In other words, if we change the fitness function, can we make evolution “go backwards”?

We’ve often discussed a fitness function based on maximizing total (finite) lifetime. But what if, after using this fitness function for a while, we “reverse it”, now minimizing total lifetime?

Consider the multiway evolution graph for symmetric k = 2, r = 2 rules starting from the null rule, with the fitness function yet again being maximizing lifetime:

But what if we now say the fitness function minimizes lifetime? If we start from the longest-lifetime phenotype we get the “lifetime minimization” multiway graph:

We can compare this “reversed graph” to the “forward graph” based on all paths from the null rule to the maximum-lifetime rule:

And in this case we see that the phenotypes that occur are almost the same, with the exception of the fact that can appear in the reverse case.

So what happens when we look at all k = 2, r = 2 rules? Here’s the “reverse graph” starting from the longest-lifetime phenotype:

A total of 345 phenotypes appear here eventually leading all the way back to . In the overall “forward graph” (which has to start from rather than ) a total of2409phenotypes appear, though (as we saw above) only 64 occur in paths that eventually lead to the maximum lifetime phenotype:

And what we see here is that the forward and reverse graphs look quite different. But could we perhaps construct a fitness function for the reverse graph that will successfully corral the evolution process to precisely retrace the steps of the forward graph?

In general, this isn’t something we can expect to be able to do. Because to do so would in effect require “breaking the computational irreducibility” of the system. It would require having a fitness function that can in essence predict every detail of the evolution process—and in so doing be in a position to direct it. But to achieve this, the fitness function would in a sense have to be computationally as sophisticated as the evolution process itself.

It’s a variant of an argument we’ve used several times here. Realistic fitness functions are computationally bounded (and in practice often very coarse). And that means that they can’t expect to match the computational irreducibility of the underlying evolution process.

There’s an analogy to the Second Law of thermodynamics. Just as the microscopic collisions of individual molecules are in principle easy to reverse, so potentially are individual transitions in the evolution graph. But putting many collisions or many transitions together leads to a process that is computationally sophisticated enough that the fairly coarse means at our disposal can’t “decode” and reverse it.

Put another way, there is in practice a certain inevitable irreversibility to both molecular dynamics and biological evolution. Yes, with enough computational effort—say carefully controlling the fitness function for every individual organism—it might in principle be possible to precisely “reverse evolution”. But in practice the kinds of fitness functions that exist in nature—or that one can readily set up in a lab—are computationally much too weak. And as a result one can’t expect to be able to get evolution to precisely retrace its steps.

Random or Selected? Can One Tell?

Given only a genotype, is there a way to tell whether it’s “just random” or whether it’s actually the result of some long and elaborate process of adaptive evolution? From the genotype one can in principle use the rules it defines to “grow” the corresponding phenotype—and then look at whether it has an “unusually large” fitness. But the question is whether it’s possible to tell anything directly from the genotype, without going through the computational effort of generating the phenotype.

At some level it’s like asking, whether, say, from a cellular automaton rule, one can predict the ultimate behavior of the cellular automaton. And a core consequence of computational irreducibility is that one can’t in general expect to do this. Still, one might imagine that one could at least make a “reasonable guess” about whether a genotype is “likely” to have been chosen “purely randomly” or to have been “carefully selected”.

To explore this, we can look at the genotypes for symmetric k = 2, r = 2 rules, say ordered by their lifetime-based fitness—with black and white here representing “required” rule cases, and gray representing undetermined ones (which can all independently be either black or white):

On the right is a summary of how many white, black and undetermined (gray) outcomes are present in each genotype. And as we have seen several times, to achieve high fitness all or almost all of the outcomes must be determined—so that in a sense all or almost all of the genome is “being used”. But we still need to ask whether, given a certain actual pattern of outcomes, we can successfully guess whether or not a genotype is the result of selection.

To get more of a sense of this, we can look at plots of the probabilities for different outcomes for each case in the rule, first (trivially) for all combinatorially possible genotypes, then for all genotypes that give viable (i.e. in our case, finite-lifetime) phenotypes, and then for “selected genotypes”:

Certain cases are always completely determined for all viable genomes—but rather trivially so, because, for example, if then the pattern generated will expand at maximum speed forever, and so cannot have a finite lifetime.

So what happens for all k = 2, r = 2 rules? Here are the actual genomes that lead to particular fitness levels:

And now here are the corresponding probabilities for different outcomes for each case in the rule:

And, yes, given a particular setup we could imagine working out from results like these at least an approximation to the likelihood for a given randomly chosen genome to be a selected one. But what’s true in general? Is there something that can be determined with bounded computational effort (i.e. without explicitly computing phenotypes and their fitnesses) that gives a good estimate of whether a genome is selected? There are good reasons to believe that computational irreducibility will make this impossible.

It’s a different story, of course, if one’s given a “fully computed” phenotype. But at the genome level—without that computation—it seems unlikely that one can expect to distinguish random from “selected-somehow” genotypes.

Adaptive Evolution of Initial Conditions

In making our idealized model of biological evolution we’ve focused (as biology seems to) on the adaptive evolution of the genotype—or, in our case, the underlying rule for our cellular automata. But what if instead of changing the underlying rule, we change the initial condition used to “grow each organism”?

For example, let’s say that we start with the “single cell” we’ve been using so far, but then at each step in adaptive evolution we change the value of one cell in the initial condition (say within a certain distance of our original cell)—then keep any initial condition that does not lead to a shorter lifetime:

The sequence of lifetimes (“fitness values”) obtained in this process of adaptive evolution is

and the “breakthrough” initial conditions are:

The basic setup is similar to what we’ve seen repeatedly in the adaptive evolution of rules rather than initial conditions. But one immediate difference is that, at least in the example we’ve just seen, changing initial conditions does not as obviously “introduce new ideas” for how to increase lifetime; instead, it gives more of an impression of just directly extending “existing ideas”.

So what happens more generally? Rules with k = 2, r = 1 tend to show either infinite growth or no growth—with finite lifetimes arising only from direct “erosion” of initial conditions (here for rules 104 and 164):

For k = 2, r = 2 rules the story is more complicated, even in the symmetric case. Here are the sequences of longest lifetime patterns obtained with all possible progressively wider initial conditions with various rules:

Again, there is a certain lack of “fundamentally new ideas” in evidence, though there are definitely “mechanisms” that get progressively extended with larger initial conditions. (One notable regularity is that the maximum lifetimes of patterns often seem roughly proportional to the width of initial condition allowed.)

Can adaptive evolution “discover more”? Typically, when it’s just modifying initial conditions in a fixed region, it doesn’t seem so—again it seems to be more about “extending existing mechanisms” than introducing new ones:

2D Cellular Automata

Everything we’ve done so far has been for 1D cellular automata. So what happens if we go to 2D? In the end, the story is going to be very similar to 1D—except that the rule spaces even for quite minimal neighborhoods are vastly larger.

With k = 2 colors, it turns out that with a 5-cell neighborhood one can’t “escape from the null rule” by single point mutations. The issue is that any single case one adds in the rule will either do nothing, or will lead only to unbounded growth. And even with a 9-cell neighborhood one can’t get rules that show growth that is neither limited nor infinite with a single-cell initial condition. But with a initial condition this is possible, and for example here is a sequence of phenotype patterns generated by adaptive evolution using lifetime as a fitness function:

Here’s what these patterns look like when “viewed from above”:

And here’s how the fitness progressively increases in this case:

There are a total of 2512 ≈ 10154 possible 9-neighbor rules, and in this vast rule space it’s easy for adaptive evolution to find rules with long finite lifetimes. (By the way, I’ve no idea what the absolute maximum “busy beaver” lifetime in this space is.)

Just as in 1D, there’s a fair amount of variation in the behavior one sees. Here are some examples of the “final rules” for various instances of adaptive evolution:

In a few cases one can readily “see the mechanism” for the lifetime—say associated with collisions between localized structures. But mostly, as in the other examples we’ve seen, there’s no realistic “narrative explanation” for how these rules achieve long yet finite lifetimes.

The Turing Machine Case

OK, so we’ve now looked at 2D as well as 1D cellular automata. But what about systems that aren’t cellular automata at all? Will we still see the same core phenomena of adaptive evolution that we’ve identified in cellular automata? The Principle of Computational Equivalence would certainly lead one to expect that we would. But to check at least one example let’s look at Turing machines.

Here’s a Turing machine with s = 3 states for its head, and k = 2 colors for cells on its tape:

The Turing machine is set up to halt if it ever reaches a case in the rule where the output is . Starting from a blank initial condition, this particular Turing machine halts after 19 steps.

So what happens if we try to adaptively evolve Turing machines with long lifetimes (i.e. that take many steps to halt)? Say we start from a “null rule” that halts in all cases, and then we make a sequence of single point mutations in the rule, keeping ones that don’t lead the Turing machine to halt in fewer steps than before. Here’s an example where the adaptive evolution eventually reaches a Turing machine that takes 95 steps to halt:

The sequence of (“breakthrough”) mutations involved here is

corresponding to a fitness curve of the form:

And, yes, all of this is very analogous to what we’ve seen in cellular automata. But one difference is that with Turing machines there are routinely much larger jumps in halting times. And the basic reason for this is just that Turing machines have much less going on at any particular step than typical cellular automata do—so it can take them much longer to achieve some particular state, like a halting state.

Here’s an example of adaptive evolution in the space of s = 3, k = 3 Turing machines—and in this case the final halting time is long enough that we’ve had to squash the image vertically (by a factor of 5):

The fitness curve in this case is best viewed on a logarithmic scale:

But while the largest-lifetime cellular automata that we saw above typically seemed to have very complex behavior, the largest-lifetime Turing machine here seems, at least on the face of it, to operate in a much more “systematic” and “mechanical” way. And indeed this becomes even more evident if we compress our visualization by looking only at steps on which the Turing machine head reverses its direction:

Long-lifetime Turing machines found by adaptive evolution are not always so simple, though they still tend to show more regularity than long-lifetime cellular automata:

But—presumably because Turing machines are “less efficient” than cellular automata—the very longest possible lifetimes can be very large. It’s not clear whether rules with such lifetimes can be found by adaptive evolution—not least because even to evaluate the fitness function for any particular candidate rule could take an unbounded time. And indeed among s = 3, k = 3 rules the very longest possible is about 1017 steps—achieved by the rule

with the following “very pedantic behavior”:

So what about multiway evolution graphs? There are a total of 20,736 s = 2, k = 2 Turing machines with halting states allowed. From these there are 37 distinct finite-lifetime phenotypes:

Just as in other cases we’ve investigated, there are fitness-neutral sets such as:

Taking just one representative from each of these 18 sets, we can then construct a multiway evolution graph for 2,2 Turing machines with lifetime as our fitness function:

Here’s the analogous result for 3,2 Turing machines—with2250distinct phenotypes, and a maximum lifetime of 21 steps (and the patterns produced by the machines just show by “slabs”):

We could pick other fitness functions (like maximum pattern width, number of head reversals, etc.) But the basic structure and consequences of adaptive evolution seem to work very much the same in Turing machines as in cellular automata—much as we expect from the Principle of Computational Equivalence.

Multiway Turing Machines

Ordinary Turing machines (as well as ordinary cellular automata) in effect always follow a single path of history, producing a definite sequence of states based on their underlying rule. But it’s also possible to study multiway Turing machines in which many paths of history can be followed. Consider for example the rule:

The case in this rule has two possible outcomes—so this is a multiway system, and to represent its behavior we need a multiway graph:

From a biological point of view, we can potentially think of such a multiway system as an idealized model for a process of adaptive evolution. So now we can ask: can we evolve this evolution? Or, in other words, can we apply adaptive evolution to systems like multiway Turing machines?

As an example, let’s assume that we make single point mutation changes to just one case in a multiway Turing machine rule:

Many multiway Turing machines won’t halt, or at least won’t halt on all their branches. But for our fitness function let’s assume we require multiway Turing machines to halt on all branches (or at least go into loops that revisit the same states), and then let’s take the fitness to be the total number of nodes in the multiway graph when everything has halted. (And, yes, this is a direct generalization of our lifetime fitness function for ordinary Turing machines.)

So with this setup here are some examples of sequences of “breakthroughs” in adaptive evolution processes:

Breakthrough sequences

But what about looking at all possible paths of evolution for multiway Turing machines? Or, in other words, what about making a multiway graph of the evolution of multiway Turing machines?

Here’s an example of what we get by doing this (showing at each node just a single example of a fitness-neutral set):

So what’s really going on here? We’ve got a multiway graph of multiway graphs. But it’s worth understanding that the inner and outer multiway graphs are a bit different. The outer one is effectively a rulial multiway graph, in which different parts correspond to following different rules. The inner one is effectively a branchial multiway graph, in which different parts correspond to different ways of applying a particular rule. Ultimately, though, we can at least in principle expect to encode branchial transformations as rulial ones, and vice versa.

So we can think of the adaptive evolution of multiway Turing machines as a first step in exploring “higher-order evolution”: the evolution of evolution, etc. And ultimately in exploring inevitable limits of recursive evolution in the ruliad—and how these might relate to the formation of observers in the ruliad.

Some Conclusions

What does all this mean for the foundations of biological evolution? First and foremost, it reinforces the idea of computational irreducibility as a dominant force in biology. One might have imagined that what we see in biology must have been “carefully sculpted” by fitness constraints (say imposed by the environment). But what we’ve found here suggests that instead much of what we see is actually just a direct reflection of computational irreducibility. And in the end, more than anything else, what biological evolution seems to be doing is to “recruit” lumps of irreducible computation, and set them up so as to achieve “fitness objectives”.

It is, as I recently discovered, very similar to what happens in machine learning. And in both cases this picture implies that there’s a limit to the kind of explanations one can expect to get. If one asks why something has the form it does, the answer will often just be: “because that’s the lump of irreducible computation that happened to be picked up”. And there isn’t any reason to think that there’ll be a “narrative explanation” of the kind one might hope for in traditional science.

The simplicity of models makes it possible to study not just particular possible paths of adaptive evolution, but complete multiway graphs of all possible paths. And what we’ve seen here is that fitness functions in effect define a kind of traversal order or (roughly) foliation for such multiway graphs. If such foliations could be arbitrarily complex, then they could potentially pick out specific outcomes for evolution—in effect successfully “sculpting biology” through the details of natural selection and fitness functions.

But the point is that fitness functions and resulting foliations of multiway evolution graphs don’t get arbitrarily complex. And even as the underlying processes by which phenotypes develop are full of computational irreducibility, the fitness functions that are applied are computationally bounded. And in effect the complexity that is perhaps the single most striking immediate feature of biological systems is therefore a consequence of the interplay between the computational boundedness of selection processes, and the computational irreducibility of underlying processes of growth and development.

All of this relies on the fundamental idea that biological evolution—and biology—are at their core computational phenomena. And given this interpretation, there’s then a remarkable unification that’s emerging.

It begins with the ruliad—the abstract object corresponding to the entangled limit of all possible computational processes. We’ve talked about the ruliad as the ultimate foundation for physics, and for mathematics. And we now see that we can think of it as the ultimate foundation for biology too.

In physics what’s crucial is that observers like us “parse” the ruliad in certain ways—and that these ways lead us to have a perception of the ruliad that follows core known laws of physics. And similarly, when observers like us do mathematics, we can think of ourselves as “extracting that mathematics” from the way we parse the ruliad. And now what we’re seeing is that biology emerges because of the way selection from the environment, etc. “parses” the ruliad.

And what makes this view powerful is that we have to assume surprisingly little about how selection works to still be able to deduce important things about biology. In particular, if we assume that the selection operates in a computationally bounded way, then just from the inevitable underlying computational irreducibility “inherited” from the ruliad, we immediately know that biology must have certain features.

In physics, the Second Law of thermodynamics arises from the interplay of underlying computational irreducibility of mechanical processes involving many particles or other objects, and our computational boundedness as observers. We have the impression that “randomness is increasing” because as computationally bounded observers we can’t “decrypt” the underlying computational irreducibility.

What’s the analog of this in biology? Much as we can’t expect to “say what happens” in a system that follows the Second Law, so we can’t expect to “explain by selection” what happens in a biological system. Or, put another way, much of what we see in biology is just the way it is because of computational irreducibility—and try as we might it won’t be “explainable” by some fitness criterion that we can describe.

But that doesn’t mean that we can’t expect to deduce “general laws of biology”, much as there are general laws about gases whose detailed structure follows the Second Law. And in what we’ve done here we can begin to see some hints of what those general laws might look like.

They’ll be things like bulk statements about possible paths of evolution, and the effect of changing the constraints on them—a bit like laws of fluid mechanics but now applied to the rulial space of possible genotypes. But if there’s one thing that’s clear it’s that the minimal model we’ve developed of biological evolution has remarkable richness and potential. In the past it’s been possible to say things about what amounts to the pure combinatorics of evolution; now we can start talking in a structured way about what evolution actually does. And in doing this we go in the direction of finally giving biology a foundation as a theoretical science.

There’s So Much More to Study!

Even though this is my second long piece about my minimal model of biological evolution, I’ve barely scratched the surface of what can be done with it. First and foremost there are many detailed connections to be made with actual phenomena that have been observed—or could be observed—in biology. But there are also many things to be investigated directly about the model itself—and in effect much ruliology to be done on it. And what’s particularly notable is how accessible a lot of that ruliology is. (And, yes, you can click any picture here to get the Wolfram Language code that generates it.) What are some obvious things to do? Here are few. Investigate other fitness functions. Other rule spaces. Other initial conditions. Other evolution strategies. Investigate evolving both rules and initial conditions. Investigate different kinds of changes of fitness functions during evolution. Investigate the effect of having a much larger rule space. Investigate robustness (or not) to perturbations.

In what I’ve done here, I’ve effectively aggregated identical genotypes (and phenotypes). But one could also investigate what happens if one in effect “traces every individual organism”. The result will be abstract structures that generalize the multiway systems we’ve shown here—and that are associated with higher levels of abstract formalism capable of describing phenomena that in effect go “below species”.

For historical notes see here »

Thanks

Thanks to Wolfram Institute fellows Richard Assar and Nik Murzin for their help, as well as to the supporters of the new Wolfram Institute initiative in theoretical biology. Thanks also to Brad Klee for his help. Related student projects were done at our Summer Programs this year by Brian Mboya, Tadas Turonis, Ahama Dalmia and Owen Xuan.

Since writing my first piece about biological evolution in March, I’ve had occasion to attend two biology conferences: SynBioBeta and WISE (“Workshop on Information, Selection, and Evolution” at the Carnegie Institution). I thank many attendees at both conferences for their enthusiasm and input. Curiously, before the WISE conference in October 2024 the last conference I had attended on biological evolution was more than 40 years earlier: the June 1984 Mountain Lake Conference on Evolution and Development.

On the Nature of Time

2024-10-09 05:41:58

The Computational View of Time

Time is a central feature of human experience. But what actually is it? In traditional scientific accounts it’s often represented as some kind of coordinate much like space (though a coordinate that for some reason is always systematically increasing for us). But while this may be a useful mathematical description, it’s not telling us anything about what time in a sense “intrinsically is”.

We get closer as soon as we start thinking in computational terms. Because then it’s natural for us to think of successive states of the world as being computed one from the last by the progressive application of some computational rule. And this suggests that we can identify the progress of time with the “progressive doing of computation by the universe”.

But does this just mean that we are replacing a “time coordinate” with a “computational step count”? No. Because of the phenomenon of computational irreducibility. With the traditional mathematical idea of a time coordinate one typically imagines that this coordinate can be “set to any value”, and that then one can immediately calculate the state of the system at that time. But computational irreducibility implies that it’s not that easy. Because it says that there’s often essentially no better way to find what a system will do than by explicitly tracing through each step in its evolution.

In the pictures on the left there’s computational reducibility, and one can readily see what state will be after any number of steps t. But in the pictures on the right there’s (presumably) computational irreducibility, so that the only way to tell what will happen after t steps is effectively to run all those steps:

And what this implies is that there’s a certain robustness to time when viewed in these computational terms. There’s no way to “jump ahead” in time; the only way to find out what will happen in the future is to go through the irreducible computational steps to get there.

There are simple idealized systems (say with purely periodic behavior) where there’s computational reducibility, and where there isn’t any robust notion of the progress of time. But the point is that—as the Principle of Computational Equivalence implies—our universe is inevitably full of computational irreducibility which in effect defines a robust notion of the progress of time.

The Role of the Observer

That time is a reflection of the progress of computation in the universe is an important starting point. But it’s not the end of the story. For example, here’s an immediate issue. If we have a computational rule that determines each successive state of a system it’s at least in principle possible to know the whole future of the system. So given this why then do we have the experience of the future only “unfolding as it happens”?

It’s fundamentally because of the way we are as observers. If the underlying system is computationally irreducible, then to work out its future behavior requires an irreducible amount of computational work. But it’s a core feature of observers like us that we are computationally bounded. So we can’t do all that irreducible computational work to “know the whole future”—and instead we’re effectively stuck just doing computation alongside the system itself, never able to substantially “jump ahead”, and only able to see the future “progressively unfold”.

In essence, therefore, we experience time because of the interplay between our computational boundedness as observers, and the computational irreducibility of underlying processes in the universe. If we were not computationally bounded, we could “perceive the whole of the future in one gulp” and we wouldn’t need a notion of time at all. And if there wasn’t underlying computational irreducibility there wouldn’t be the kind of “progressive revealing of the future” that we associate with our experience of time.

A notable feature of our everyday perception of time is that it seems to “flow only in one direction”—so that for example it’s generally much easier to remember the past than to predict the future. And this is closely related to the Second Law of thermodynamics, which (as I’ve argued at length elsewhere) is once again a result of the interplay between underlying computational irreducibility and our computational boundedness. Yes, the microscopic laws of physics may be reversible (and indeed if our system is simple—and computationally reducible—enough of this reversibility may “shine through”). But the point is that computational irreducibility is in a sense a much stronger force.

Imagine that we prepare a state to have orderly structure. If its evolution is computationally irreducible then this structure will effectively be “encrypted” to the point where a computationally bounded observer can’t recognize the structure. Given underlying reversibility, the structure is in some sense inevitably “still there”—but it can’t be “accessed” by a computationally bounded observer. And as a result such an observer will perceive a definite flow from orderliness in what is prepared to disorderliness in what is observed. (In principle one might think it should be possible to set up a state that will “behave antithermodynamically”—but the point is that to do so would require predicting a computationally irreducible process, which a computationally bounded observer can’t do.)

One of the longstanding confusions about the nature of time has to do with its “mathematical similarity” to space. And indeed ever since the early days of relativity theory it’s seemed convenient to talk about “spacetime” in which notions of space and time are bundled together.

But in our Physics Project that’s not at all how things fundamentally work. At the lowest level the state of the universe is represented by a hypergraph which captures what can be thought of as the “spatial relations” between discrete “atoms of space”. Time then corresponds to the progressive rewriting of this hypergraph.

And in a sense the “atoms of time” are the elementary “rewriting events” that occur. If the “output” from one event is needed to provide “input” to another, then we can think of the first event as preceding the second event in time—and the events as being “timelike separated”. And in general we can construct a causal graph that shows the dependencies between different events.

So how does this relate to time—and spacetime? As we’ll discuss below, our everyday experience of time is that it follows a single thread. And so we tend to want to “parse” the causal graph of elementary events into a series of slices that we can view as corresponding to “successive times”. As in standard relativity theory, there typically isn’t a unique way to assign a sequence of such “simultaneity surfaces”, with the result that there are different “reference frames” in which the identifications of space and time are different.

The complete causal graph bundles together what we usually think of as space with what we usually think of as time. But ultimately the progress of time is always associated with some choice of successive events that “computationally build on each other”. And, yes, it’s more complicated because of the possibilities of different choices. But the basic idea of the progress of time as “the doing of computation” is very much the same. (In a sense time represents “computational progress” in the universe, while space represents the “layout of its data structure”.)

Very much as in the derivation of the Second Law (or of fluid mechanics from molecular dynamics), the derivation of Einstein’s equations for the large-scale behavior of spacetime from the underlying causal graph of hypergraph rewriting depends on the fact that we are computationally bounded observers. But even though we’re computationally bounded, we still have to “have something going on inside”, or we wouldn’t record—or sense—any “progress in time”.

It seems to be the essence of observers like us—as captured in my recent Observer Theory—that we equivalence many different states of the world to derive our internal perception of “what’s going on outside”. And at some rough level we might imagine that we’re sensing time passing by the rate at which we add to those internal perceptions. If we’re not adding to the perceptions, then in effect time will stop for us—as happens if we’re asleep, anesthetized or dead.

It’s worth mentioning that in some extreme situations it’s not the internal structure of the observer that makes perceived time stop; instead it’s the underlying structure of the universe itself. As we’ve mentioned, the “progress of the universe” is associated with successive rewriting of the underlying hypergraph. But when there’s been “too much activity in the hypergraph” (which physically corresponds roughly to too much energy-momentum), one can end up with a situation in which “there are no more rewrites that can be done”—so that in effect some part of the universe can no longer progress, and “time stops” there. It’s analogous to what happens at a spacelike singularity (normally associated with a black hole) in traditional general relativity. But now it has a very direct computational interpretation: one’s reached a “fixed point” at which there’s no more computation to do. And so there’s no progress to make in time.

Multiple Threads of Time

Our strong human experience is that time progresses as a single thread. But now our Physics Project suggests that at an underlying level time is actually in effect multithreaded, or, in other words, that there are many different “paths of history” that the universe follows. And it is only because of the way we as observers sample things that we experience time as a single thread.

At the level of a particular underlying hypergraph the point is that there may be many different updating events that can occur, and each sequence of such updating event defines a different “path of history”. We can summarize all these paths of history in a multiway graph in which we merge identical states that arise:

But given this underlying structure, why is it that we as observers believe that time progresses as a single thread? It all has to do with the notion of branchial space, and our presence within branchial space. The presence of many paths of history is what leads to quantum mechanics; the fact that we as observers ultimately perceive just one path is associated with the traditionally-quite-mysterious phenomenon of “measurement” in quantum mechanics.

When we talked about causal graphs above, we said that we could “parse” them as a series of “spacelike” slices corresponding to instantaneous “states of space”—represented by spatial hypergraphs. And by analogy we can similarly imagine breaking multiway graphs into “instantaneous slices”. But now these slices don’t represent states of ordinary space; instead they represent states of what we call branchial space.

Ordinary space is “knitted together” by updating events that have causal effects on other events that can be thought of as “located at different places in space”. (Or, said differently, space is knitted together by the overlaps of the elementary light cones of different events.) Now we can think of branchial space as being “knitted together” by updating events that have effects on events that end up on different branches of history.

(In general there is a close analogy between ordinary space and branchial space, and we can define a multiway causal graph that includes both “spacelike” and “branchlike” directions—with the branchlike direction supporting not light cones but what we can call entanglement cones.)

So how do we as observers parse what’s going on? A key point is that we are inevitably part of the system we’re observing. So the branching (and merging) that’s going on in the system at large is also going on in us. So that means we have to ask how a “branching mind” will perceive a branching universe. Underneath, there are lots of branches, and lots of “threads of history”. And there’s lots of computational irreducibility (and even what we can call multicomputational irreducibility). But computationally bounded observers like us have to equivalence most of those details to wind up with something that “fits in our finite minds”.

We can make an analogy to what happens in a gas. Underneath, there are lots of molecules bouncing around (and behaving in computationally irreducible ways). But observers like us are big compared to molecules, and (being computationally bounded) we don’t get to perceive their individual behavior, but only their aggregate behavior—from which we extract a thin set of computationally reducible “fluid-dynamics-level” features.

And it’s basically the same story with the underlying structure of space. Underneath, there’s an elaborately changing network of discrete atoms of space. But as large, computationally bounded observers we can only sample aggregate features in which many details have been equivalenced, and in which space tends to seem continuous and describable in basically computationally reducible ways.

So what about branchial space? Well, it’s basically the same story. Our minds are “big”, in the sense that they span many individual branches of history. And they’re computationally bounded so they can’t perceive the details of all those branches, but only certain aggregated features. And in a first approximation what then emerges is in effect a single aggregated thread of history.

With sufficiently careful measurements we can sometimes see “quantum effects” in which multiple threads of history are in evidence. But at a direct human level we always seem to aggregate things to the point where what we perceive is just a single thread of history—or in effect a single thread of progression in time.

It’s not immediately obvious that any of these “aggregations” will work. It could be that important effects we perceive in gases would depend on phenomena at the level of individual molecules. Or that to understand the large-scale structure of space we’d continually be having to think about detailed features of atoms of space. Or, similarly, that we’d never be able to maintain a “consistent view of history”, and that instead we’d always be having to trace lots of individual threads of history.

But the key point is that for us to stay as computationally bounded observers we have to pick out only features that are computationally reducible—or in effect boundedly simple to describe.

Closely related to our computational boundedness is the important assumption we make that we as observers have a certain persistence. At every moment in time, we are made from different atoms of space and different branches in the multiway graph. Yet we believe we are still “the same us”. And the crucial physical fact (that has to be derived in our model) is that in ordinary circumstances there’s no inconsistency in doing this.

So the result is that even though there are many “threads of time” at the lowest level—representing many different “quantum branches”—observers like us can (usually) successfully still view there as being a single consistent perceived thread of time.

But there’s another issue here. It’s one thing to say that a single observer (say a single human mind or a single measuring device) can perceive history to follow a single, consistent thread. But what about different human minds, or different measuring devices? Why should they perceive any kind of consistent “objective reality”?

Essentially the answer, I think, is that they’re all sufficiently nearby in branchial space. If we think about physical space, observers in different parts of the universe will clearly “see different things happening”. The “laws of physics” may be the same—but what star (if any) is nearby will be different. Yet (at least for the foreseeable future) for all of us humans it’s always the same star that’s nearby.

And so it is, presumably, in branchial space. There’s some small patch in which we humans—with our shared origins—exist. And it’s presumably because that patch is small relative to all of branchial space that all of us perceive a consistent thread of history and a common objective reality.

There are many subtleties to this, many of which aren’t yet fully worked out. In physical space, we know that effects can in principle spread at the speed of light. And in branchial space the analog is that effects can spread at the maximum entanglement speed (whose value we don’t know, though it’s related by Planck unit conversions to the elementary length and elementary time). But in maintaining our shared “objective” view of the universe it’s crucial that we’re not all going off in different directions at the speed of light. And of course the reason that doesn’t happen is that we don’t have zero mass. And indeed presumably nonzero mass is a critical part of being observers like us.

In our Physics Project it’s roughly the density of events in the hypergraph that determines the density of energy (and mass) in physical space (with their associated gravitational effects). And similarly it’s roughly the density of events in the multiway graph (or in branchial graph slices) that determines the density of action—the relativistically invariant analog of energy—in branchial space (with its associated effects on quantum phase). And though it’s not yet completely clear how this works, it seems likely that once again when there’s mass, effects don’t just “go off at the maximum entanglement speed in all directions”, but instead stay nearby.

There are definitely connections between “staying at the same place”, believing one is persistent, and being computationally bounded. But these are what seem necessary for us to have our typical view of time as a single thread. In principle we can imagine observers very different from us—say with minds (like the inside of an idealized quantum computer) capable of experiencing many different threads of history. But the Principle of Computational Equivalence suggests that there’s a high bar for such observers. They need not only to be able to deal with computational irreducibility but also multicomputational irreducibility, in which one includes both the process of computing new states, and the process of equivalencing states.

And so for observers that are “anything like us” we can expect that once again time will tend to be as we normally experience it, following a single thread, consistent between observers.

(It’s worth mentioning that all of this only works for observers like us “in situations like ours”. For example, at the “entanglement horizon” for a black hole—where branchially-oriented edges in the multiway causal graph get “trapped”—time as we know it in some sense “disintegrates”, because an observer won’t be able to “knit together” the different branches of history to “form a consistent classical thought” about what happens.)

Time in the Ruliad

In what we’ve discussed so far we can think of the progress of time as being associated with the repeated application of rules that progressively “rewrite the state of the universe”. In the previous section we saw that these rules can be applied in many different ways, leading to many different underlying threads of history.

But so far we’ve imagined that the rules that get applied are always the same—leaving us with the mystery of “Why those rules, and not others?” But this is where the ruliad comes in. Because the ruliad involves no such seemingly arbitrary choices: it’s what you get by following all possible computational rules.

One can imagine many bases for the ruliad. One can make it from all possible hypergraph rewritings. Or all possible (multiway) Turing machines. But in the end it’s a single, unique thing: the entangled limit of all possible computational processes. There’s a sense in which “everything can happen somewhere” in the ruliad. But what gives the ruliad structure is that there’s a definite (essentially geometrical) way in which all those different things that can happen are arranged and connected.

So what is our perception of the ruliad? Inevitably we’re part of the ruliad—so we’re observing it “from the inside”. But the crucial point is that what we perceive about it depends on what we are like as observers. And my big surprise in the past few years has been that assuming even just a little about what we’re like as observers immediately implies that what we perceive of the ruliad follows the core laws of physics we know. In other words, by assuming what we’re like as observers, we can in effect derive our laws of physics.

The key to all this is the interplay between the computational irreducibility of underlying behavior in the ruliad, and our computational boundedness as observers (together with our related assumption of our persistence). And it’s this interplay that gives us the Second Law in statistical mechanics, the Einstein equations for the structure of spacetime, and (we think) the path integral in quantum mechanics. In effect what’s happening is that our computational boundedness as observers makes us equivalence things to the point where we are sampling only computationally reducible slices of the ruliad, whose characteristics can be described using recognizable laws of physics.

So where does time fit into all of this? A central feature of the ruliad is that it’s unique—and everything about it is “abstractly necessary”. Much as given the definition of numbers, addition and equality it’s inevitable that one gets 1 + 1 = 2, so similarly given the definition of computation it’s inevitable that one gets the ruliad. Or, in other words, there’s no question about whether the ruliad exists; it’s just an abstract construct that inevitably follows from abstract definitions.

And so at some level this means that the ruliad inevitably just “exists as a complete thing”. And so if one could “view it from outside” one could think of it as just a single timeless object, with no notion of time.

But the crucial point is that we don’t get to “view it from the outside”. We’re embedded within it. And, what’s more, we must view it through the “lens” of our computational boundedness. And this is why we inevitably end up with a notion of time.

We observe the ruliad from some point within it. If we were not computationally bounded then we could immediately compute what the whole ruliad is like. But in actuality we can only discover the ruliad “one computationally bounded step at a time”—in effect progressively applying bounded computations to “move through rulial space”.

So even though in some abstract sense “the whole ruliad is already there” we only get to explore it step by step. And that’s what gives us our notion of time, through which we “progress”.

Inevitably, there are many different paths that we could follow through the ruliad. And indeed every mind (and every observer like us)—with its distinct inner experience—presumably follows a different path. But much as we described for branchial space, the reason we have a shared notion of “objective reality” is presumably that we are all very close together in rulial space; we form in a sense a tight “rulial flock”.

It’s worth pointing out that not every sampling of the ruliad that may be accessible to us conveniently corresponds to exploration of progressive slices of time. Yes, that kind of “progression in time” is characteristic of our physical experience, and our typical way of describing it. But what about our experience, say, of mathematics?

The first point to make is that just as the ruliad contains all possible physics, it also contains all possible mathematics. If we construct the ruliad, say from hypergraphs, the nodes are now not “atoms of space”, but instead abstract elements (that in general we call emes) that form pieces of mathematical expressions and mathematical theorems. We can think of these abstract elements as being laid out now not in physical space, but in some abstract metamathematical space.

In our physical experience, we tend to remain localized in physical space, branchial space, etc. But in “doing mathematics” it’s more as if we’re progressively expanding in metamathematical space, carving out some domain of “theorems we assume are true”. And while we could identify some kind of “path of expansion” to let us define some analog of time, it’s not a necessary feature of the way we explore the ruliad.

Different places in the ruliad in a sense correspond to describing things using different rules. And by analogy to the concept of motion in physical space, we can effectively “move” from one place to another in the ruliad by translating the computations done by one set of rules to computations done by another. (And, yes, it’s nontrivial to even have the possibility of “pure motion”.) But if we indeed remain localized in the ruliad (and can maintain what we can think of as our “coherent identity”) then it’s natural to think of there being a “path of motion” along which we progress “with time”. But when we’re just “expanding our horizons” to encompass more paradigms and to bring more of rulial space into what’s covered by our minds (so that in effect we’re “expanding in rulial space”), it’s not really the same story. We’re not thinking of ourselves as “doing computation in order to move”. Instead, we’re just identifying equivalences and using them to expand our definition of ourselves, which is something that we can at least approximate (much like in “quantum measurement” in traditional physics) as happening “outside of time”. Ultimately, though, everything that happens must be the result of computations that occur. It’s just that we don’t usually “package” these into what we can describe as a definite thread of time.

So What in the End Is Time?

From the paradigm (and Physics Project ideas) that we’ve discussed here, the question “What is time?” is at some level simple: time is what progresses when one applies computational rules. But what’s critical is that time can in effect be defined abstractly, independent of the details of those rules, or the “substrate” to which they’re applied. And what makes this possible is the Principle of Computational Equivalence, and the ubiquitous phenomenon of computational irreducibility that it implies.

To begin with, the fact that time can robustly be thought of as “progressing”, in effect in a linear chain, is a consequence of computational irreducibility—because computational irreducibility is what tells us that computationally bounded observers like us can’t in general ever “jump ahead”; we just have to follow a linear chain of steps.

But there’s something else as well. The Principle of Computational Equivalence implies that there’s in a sense just one (ubiquitous) kind of computational irreducibility. So when we look at different systems following different irreducible computational rules, there’s inevitably a certain universality to what they do. In effect they’re all “accumulating computational effects” in the same way. Or in essence progressing through time in the same way.

There’s a close analogy here with heat. It could be that there’d be detailed molecular motion that even on a large scale worked noticeably differently in different materials. But the fact is that we end up being able to characterize any such motion just by saying that it represents a certain amount of heat, without getting into more details. And that’s very much the same kind of thing as being able to say that such-and-such an amount of time has passed, without having to get into the details of how some clock or other system that reflects the passage of time actually works.

And in fact there’s more than a “conceptual analogy” here. Because the phenomenon of heat is again a consequence of computational irreducibility. And the fact that there’s a uniform, “abstract” characterization of it is a consequence of the universality of computational irreducibility.

It’s worth emphasizing again, though, that just as with heat, a robust concept of time depends on us being computationally bounded observers. If we were not, then we’d able to break the Second Law by doing detailed computations of molecular processes, and we wouldn’t just describe things in terms of randomness and heat. And similarly, we’d be able to break the linear flow of time, either jumping ahead or following different threads of time.

But as computationally bounded observers of computationally irreducible processes, it’s basically inevitable that—at least to a good approximation—we’ll view time as something that forms a single one-dimensional thread.

In traditional mathematically based science there’s often a feeling that the goal should be to “predict the future”—or in effect to “outrun time”. But computational irreducibility tells us that in general we can’t do this, and that the only way to find out what will happen is just to run the same computation as the system itself, essentially step by step. But while this might seem like a letdown for the power of science, we can also see it as what gives meaning and significance to time. If we could always jump ahead then at some level nothing would ever fundamentally be achieved by the passage of time (or, say, by the living of our lives); we’d always be able to just say what will happen, without “living through” how we got there. But computational irreducibility gives time and the process of it passing a kind of hard, tangible character.

So what does all this imply for the various classic issues (and apparent paradoxes) that arise in the way time is usually discussed?

Let’s start with the question of reversibility. The traditional laws of physics basically apply both forwards and backwards in time. And the ruliad is inevitably symmetrical between “forward” and “backward” rules. So why is it then that in our typical experience time always seems to “run in the same direction”?

This is closely related to the Second Law, and once again it’s consequence of our computational boundedness interacting with underlying computational irreducibility. In a sense what defines the direction of time for us is that we (typically) find it much easier to remember the past than to predict the future. Of course, we don’t remember every detail of the past. We only remember what amounts to certain “filtered” features that “fit in our finite minds”. And when it comes to predicting the future, we’re limited by our inability to “outrun” computational irreducibility.

Let’s recall how the Second Law works. It basically says that if we set up some state that’s “ordered” or “simple” then this will tend to “degrade” to one that’s “disordered” or “random”. (We can think of the evolution of the system as effectively “encrypting” the specification of our starting state to the point where we—as computationally bounded observers—can no longer recognize its ordered origins.) But because our underlying laws are reversible, this degradation (or “encryption”) must happen when we go both forwards and backwards in time:

But the point is that our “experiential” definition of the direction of time (in which the “past” is what we remember, and the “future” is what we find hard to predict) is inevitably aligned with the “thermodynamic” direction of time we observe in the world at large. And the reason is that in both cases we’re defining the past to be something that’s computationally bounded (while the future can be computationally irreducible). In the experiential case the past is computationally bounded because that’s what we can remember. In the thermodynamic case it’s computationally bounded because those are the states we can prepare. In other words, the “arrows of time” are aligned because in both cases we are in effect “requiring the past to be simpler”.

So what about time travel? It’s a concept that seems natural—and perhaps even inevitable—if one imagines that “time is just like space”. But it becomes a lot less natural when we think of time in the way we’re doing here: as a process of applying computational rules.

Indeed, at the lowest level, these rules are by definition just sequentially applied, producing one state after another—and in effect “progressing in one direction through time”. But things get more complicated if we consider not just the raw, lowest-level rules, but what we might actually observe of their effects. For example, what if the rules lead to a state that’s identical to one they’ve produced before (as happens, for example, in a system with periodic behavior)? If we equivalence the state now and the state before (so we represent both as a single state) then we can end up with a loop in our causal graph (a “closed timelike curve”). And, yes, in terms of the raw sequence of applying rules these states can be considered different. But the point is that if they are identical in every feature then any observer will inevitably consider them the same.

But will such equivalent states ever actually occur? As soon as there’s computational irreducibility it’s basically inevitable that the states will never perfectly match up. And indeed for the states to contain an observer like us (with “memory”, etc.) it’s basically impossible that they can match up.

But can one imagine an observer (or a “timecraft”) that would lead to states that match up? Perhaps somehow it could carefully pick particular sequences of atoms of space (or elementary events) that would lead it to states that have “happened before”. And indeed in a computationally simple system this might be possible. But as soon as there’s computational irreducibility, this simply isn’t something one can expect any computationally bounded observer to be able to do. And, yes, this is directly analogous to why one can’t have a “Maxwell’s demon” observer that “breaks the Second Law”. Or why one can’t have something that carefully navigates the lowest-level structure of space to effectively travel faster than light.

But even if there can’t be time travel in which “time for an observer goes backwards”, there can still be changes in “perceived time”, say as a result of relativistic effects associated with motion. For example, one classic relativistic effect is time dilation, in which “time goes slower” when objects go faster. And, yes, given certain assumptions, there’s a straightforward mathematical derivation of this effect. But in our effort to understand the nature of time we’re led to ask what its physical mechanism might be. And it turns out that in our Physics Project it has a surprisingly direct—and almost “mechanical”—explanation.

One starts from the fact that in our Physics Project space and everything in it is represented by a hypergraph which is continually getting rewritten. And the evolution of any object through time is then defined by these rewritings. But if the object moves, then in effect it has to be “re-created at a different place in space”—and this process takes up a certain number of rewritings, leaving fewer for the intrinsic evolution of the object itself, and thus causing time to effectively “run slower” for it. (And, yes, while this is a qualitative description, one can make it quite formal and precise, and recover the usual formulas for relativistic time dilation.)

Something similar happens with gravitational fields. In our Physics Project, energy-momentum (and thus gravity) is effectively associated with greater activity in the underlying hypergraph. And the presence of this greater activity leads to more rewritings, causing “time to run faster” for any object in that region of space (corresponding to the traditional “gravitational redshift”).

More extreme versions of this occur in the context of black holes. (Indeed, one can roughly think of spacelike singularities as places where “time ran so fast that it ended”.) And in general—as we discussed above—there are many “relativistic effects” in which notions of space and time get mixed in various ways.

But even at a much more mundane level there’s a certain crucial relationship between space and time for observers like us. The key point is that observers like us tend to “parse” the world into a sequence of “states of space” at successive “moments in time”. But the fact that we do this depends on some quite specific features of us, and in particular our effective physical scale in space as compared to time.

In our everyday life we’re typically looking at scenes involving objects that are perhaps tens of meters away from us. And given the speed of light that means photons from these objects get to us in less than a microsecond. But it takes our brains milliseconds to register what we’ve seen. And this disparity of timescales is what leads us to view the world as consisting of a sequence of states of space at successive moments in time.

If our brains “ran” a million times faster (i.e. at the speed of digital electronics) we’d perceive photons arriving from different parts of a scene at different times, and we’d presumably no longer view the world in terms of overall states of space existing at successive times.

The same kind of thing would happen if we kept the speed of our brains the same, but dealt with scenes of a much larger scale (as we already do in dealing with spacecraft, astronomy, etc.).

But while this affects what it is that we think time is “acting on”, it doesn’t ultimately affect the nature of time itself. Time remains that computational process by which successive states of the world are produced. Computational irreducibility gives time a certain rigid character, at least for computationally bounded observers like us. And the Principle of Computational Equivalence allows there to be a robust notion of time independent of the “substrate” that’s involved: whether us as observers, the everyday physical world, or, for that matter, the whole universe.

Nestedly Recursive Functions

2024-09-28 01:50:59

Nestedly Recursive Functions

Yet Another Ruliological Surprise

Integers. Addition. Subtraction. Maybe multiplication. Surely that’s not enough to be able to generate any serious complexity. In the early 1980s I had made the very surprising discovery that very simple programs based on cellular automata could generate great complexity. But how widespread was this phenomenon?

At the beginning of the 1990s I had set about exploring this. Over and over I would consider some type of system and be sure it was too simple to “do anything interesting”. And over and over again I would be wrong. And so it was that on the night of August 13, 1993, I thought I should check what could happen with integer functions defined using just addition and subtraction.

I knew, of course, about defining functions by recursion, like Fibonacci:

But could I find something like this that would have complex behavior? I did the analog of what I have done so many times, and just started (symbolically) enumerating possible definitions. And immediately I saw cases with nested functions, like:

(For some reason I wanted to keep the same initial conditions as Fibonacci: f[1] = f[2] = 1.) What would functions like this do? My original notebook records the result in this case:

Nestedly recursive function

But a few minutes later I found something very different: a simple nestedly recursive function with what seemed like highly complex behavior:

Simple nestedly recursive function with complex behavior

I remembered seeing a somewhat similarly defined function discussed before. But the behavior I’d seen reported for that function, while intricate, was nested and ultimately highly regular. And, so far as I could tell, much like with rule 30 and all the other systems I’d investigated, nobody had ever seen serious complexity in simple recursive functions.

It was a nice example. But it was one among many. And when I published A New Kind of Science in 2002, I devoted just four pages (and 7 notes) to “recursive sequences”—even though the gallery I made of their behavior became a favorite page of mine:

Recursive sequences gallery

A year after the book was published we held our first Wolfram Summer School, and as an opening event I decided to do a live computer experiment—in which I would try to make a real-time science discovery. The subject I chose was nestedly recursive functions. It took a few hours. But then, yes, we made a discovery! We found that there was a nestedly recursive function simpler than the ones I’d discussed in A New Kind of Science that already seemed to have very complex behavior:

Over the couple of decades that followed I returned many times to nestedly recursive functions—particularly in explorations I did with high school and other students, or in suggestions I made for student projects. Then recently I used them several times as “intuition-building examples” in various investigations.

I’d always felt my work with nestedly recursive functions was unfinished. Beginning about five years ago—particularly energized by our Physics Project—I started looking at harvesting seeds I’d sown in A New Kind of Science and before. I’ve been on quite a roll, with a few pages or even footnotes repeatedly flowering into rich book-length stories. And finally—particularly after my work last year on “Expression Evaluation and Fundamental Physics”—I decided it was time to try to finish my exploration of nestedly recursive functions.

Our modern Wolfram Language tools—as well as ideas from our Physics Project—provided some new directions to explore. But I still thought I pretty much knew what we’d find. And perhaps after all these years I should have known better. Because somehow in the computational universe—and in the world of ruliology—there are always surprises.

And here, yet again, there was indeed quite a surprise.

The Basic Idea

Consider the definition (later we’ll call this “P312”)

which we can also write as:

The first few values for f[n] generated from this definition are:

Continuing further we get:

But how are these values actually computed? To see that we can make an “evaluation graph” in which we show how each value of f[n] is computed from ones with smaller values of n, here starting from f[20]:

The gray nodes represent initial conditions: places where f[n] was sampled for n ≤ 0. The two different colors of edges correspond to the two different computations done in evaluating each f[n]:

Continuing to f[30] we get:

But what’s the structure of this graph? If we pull out the “red” graph on its own, we can see that it breaks into two path graphs, that consist of the sequences of the f[n] for odd and even n, respectively:

The “blue” graph, on the other hand, breaks into four components—each always a tree—leading respectively to the four different initial conditions:

And for example we can now plot f[n], showing which tree each f[n] ends up being associated with:

We’ll be using this same basic setup throughout, though for different functions. We’ll mostly consider recursive definitions with a single term (i.e. with a single “outermost f”, not two, as in Fibonacci recurrences).

The specific families of recursive functions we’ll be focusing on are:

And with this designation, the function we just introduced is P312.

A Closer Look at P312 ( f[n_] := 3 + f[n – f[n – 2]] )

Let’s start off by looking in more detail at the function we just introduced. Here’s what it does up to n = 500:

It might seem as if it’s going to go on “seemingly randomly” forever. But if we take it further, we get a surprise: it seems to “resolve itself” to something potentially simpler:

What’s going on? Let’s plot this again, but now showing which “blue graph tree” each value is associated with:

And now what we see is that the f[–3] and f[–2] trees stop contributing to f[n] when n is (respectively) 537 and 296, and these trees are finite (and have sizes 53 and 15):

The overall structures of the “remaining” trees—here shown up to f[5000]—eventually start to exhibit some regularity:

We can home in on this regularity by arranging these trees in layers, starting from the root, then plotting the number of nodes in each successive layer:

Looking at these pictures suggests that there should be some kind of more-or-less direct “formula” for f[n], at least for large n. They also suggest that such a formula should have some kind of mod-6 structure. And, yes, there does turn out to be essentially a “formula”. Though the “formula” is quite complicated—and reminiscent of several other “strangely messy” formulas in other ruliological cases—like Turing machine 600720 discussed in A New Kind of Science or combinator s[s[s]][s][s][s][s].

Later on, we’ll see the much simpler recursive function P111 (f[n_] := 1 + f[nf[n 1]]). The values for this function form a sequence in which successive blocks of length k have value k:

P312 has the same kind of structure, but much embellished. First, it has 6 separate riffled (“mod”) subsequences. Each subsequence then consists of a sequence of blocks. Given a value n, this computes which subsequence this is on, which block for that subsequence it’s in, and where it is within that block:

So, for example, here are results for multiples of 1000:

For n = 1000 we’re not yet in the “simple” regime, we can’t describe the sequence in any simple way, and our “indices” calculation is meaningless. For n = 2000 it so happens that we are at block 0 for the mod-1 subsequence. And the way things are set up, we just start by giving exactly the form of block 0 for each mod. So for mod 1 the block is:

But now n = 2000 has offset 16 within this block, so the final value of f[2000] is simply the 16th value from this list, or 100. f[2001] is then simply the next element within this block, or 109. And so on—until we reach the end of the block.

But what if we’re not dealing with block 0? For example, according to the table above, f[3000] is determined by mod-3 block 1. It turns out there’s a straightforward, if messy, way to compute any block b (for mod m):

So now we have a way to compute the value, say of f[3000], effectively just by “evaluating a formula”:

And what’s notable is that this evaluation doesn’t involve any recursion. In other words, at the cost of “messiness” we’ve—somewhat surprisingly—been able to unravel all the recursion in P312 to arrive at a “direct formula” for the value of f[n] for any n.

So what else can we see about the behavior of f[n] for P312? One notable feature is its overall growth rate. For large n, it turns out that (as can be seen by substituting this form into the recursive definition and taking a limit):

One thing this means is that our evaluation graph eventually has a roughly conical form:

This can be compared to the very regular cone generated by P111 (which has asymptotic value ):

If one just looks at the form of the recursive definition for P312 it’s far from obvious “how far back” it will need to probe, or, in other words, what values of f[n] one will need to specify as initial conditions. As it turns out, though, the only values needed are f[–3], f[–2], f[–1] and f[0].

How can one see this? In 3 + f[nf[n – 2]] it’s only the outer f that can probe “far back” values. But how far it actually goes back depends on how much larger f[n – 2] gets compared to n. Plotting f[n – 2] and n together we have:

And the point is that only for very few values of n does f[n – 2] exceed n—and it’s these values that probe back. Meanwhile, for larger n, there can never be additional “lookbacks”, because f[n] only grows like .

So does any P312 recursion always have the same lookback? So far, we’ve considered specifically the initial condition f[n] = 1 for all n ≤ 0. But what if we change the value of f[0]? Here are plots of f[n] for different cases:

And it turns out that with f[0] = z, the lookback goes to –z for z ≥ 3, and to z – 4 for 1 ≤ z ≤ 2.

(If z ≤ 0 the function f[n] is basically not defined, because the recursion is trying to compute f[n] from f[n], f[n + 1], etc., so never “makes progress”.)

The case f[0] = 2 (i.e. z = 2) is the one that involves the least lookback—and a total of 3 initial values. Here is the evaluation graph in this case:

By comparison, here is the evaluation graph for the case f[0] = 5, involving 6 initial values:

If we plot the value of f[n] as a function of f[0] we get the following:

For n f[0], f[n] always has simple behavior, and is essentially periodic in n with period 3:

And it turns out that for any specified initial configuration of values, there is always only bounded lookback—with the bound apparently being determined by the largest of the initial values f[ninit].

So what about the behavior of f[n] for large n? Just like in our original f[0] = 1 case, we can construct “blue graph trees” rooted at each of the initial conditions. In the case f[0] = 1 we found that of the 4 trees only two continue to grow as n increases. As we vary f[0], the number of “surviving trees” varies quite erratically:

What if instead of just changing f[0], and keeping all other f[–k] = 1, we set f[n] = s for all n ≤ 0? The result is somewhat surprising:

For s ≥ 2, the behavior turns out to be simple—and similar to the behavior of P111.

So what can P312 be made to do if we change its initial conditions? With f[n] = 2 for n f[0] the behavior remains “tame”, but as f[0] increases it starts showing its typical complexity:

One question to ask is what set of values f[n] takes on. Given that the initial values have certain residues mod 3, all subsequent values must have the same residues. But apart from this constraint, it seems that all values for f[n] are obtained—which is not surprising given that f[n] grows only like .

The “P Family”: f[n_] := a + f[n – b f[n – c]]

P312 is just one example of the “P family” of sequences defined by:

Here is the behavior of some other Pabc sequences:

And here are their evaluation graphs:

P312 is the first “seriously complex” example.

P111 (as mentioned earlier) has a particularly simple form

which corresponds to the simple formula:

The evaluation graph in this case is just:

Only a single initial condition f[0] = 1 is used, and there is only a single “blue graph tree” with a simple form:

Another interesting case is P123:

Picking out only odd values of n we get:

This might look just like the behavior of P111. But it’s not. The lengths of the successive “plateaus” are now

with differences:

But this turns out to be exactly a nested sequence generated by joining together the successive steps in the evolution of the substitution system:

P123 immediately “gets into its final behavior”, even for small n. But—as we saw rather dramatically with P312—there can be “transient behavior” that doesn’t “resolve” until n is large. A smaller case of this phenomenon occurs with P213. Above n = 68 it shows a simple “square root” pattern of behavior, basically like P111. But for smaller n it’s a bit more complicated:

And in this case the transients aren’t due to “blue graph trees” that stop growing. Instead, there are only two trees (associated with f[0] and f[–1]), but both of them soon end up growing in very regular ways:

The “T Family”: f[n_] := a f[n – b f[n – c]]

What happens if our outermost operation is not addition, but multiplication?

Here are some examples of the behavior one gets. In each case we’re plotting on a log scale—and we’re not including T1xx cases, which are always trivial:

We see that some sequences have regular and readily predictable behavior, but others do not. And this is reflected in the evaluation graphs for these functions:

The first “complicated case” is T212:

The evaluation graph for f[50] in this case has the form:

And something that’s immediately notable is that in addition to “looking back” to the values of f[0] and f[–1], this also looks back to the value of f[24]. Meanwhile, the evaluation graph for f[51] looks back not only to f[0] and f[–1] but also to f[–3] and f[–27]:

How far back does it look in general? Here’s a plot showing which lookbacks are made as a function of n (with the roots of the “blue graph trees” highlighted):

There’s alternation between behaviors for even and odd n. But apart from that, additional lookbacks are just steadily added as n increases—and indeed the total number of lookbacks seems to follow a simple pattern:

But—just for once—if one looks in more detail, it’s not so simple. The lengths of the successive “blocks” are:

So, yes, the lookbacks are quite “unpredictable”. But the main point here is that—unlike for the P family—the number of lookbacks isn’t limited. In a sense, to compute T212 for progressively larger n, progressively more information about its initial conditions is needed.

When one deals with ordinary, unnested recurrence relations, one’s always dealing with a fixed lookback. And the number of initial conditions then just depends on the lookback. (So, for example, the Fibonacci recurrence has lookback 2, so needs two initial conditions, while the standard factorial recurrence has lookback 1, so needs only one initial condition.)

But for the nested recurrence relation T212 we see that this is no longer true; there can be an unboundedly large lookback.

OK, but let’s look back at the actual T212 sequence. Here it is up to larger values of n:

Or, plotting each point as a dot:

Given the recursive definition of f[n], the values of f[n] must always be powers of 2. This shows where each successive power of 2 is first reached as a function of n:

Meanwhile, this shows the accumulated average of f[n] as a function of n:

This is well fit by 0.38 Log[n], implying that, at least with this averaging, f[n] asymptotically approximates n0.26. And, yes, it is somewhat surprising that what seems like a very “exponential” recursive definition should lead to an f[n] that increases only like a power. But, needless to say, this is the kind of surprise one has to expect in the computational universe.

It’s worth noticing that f[n] fluctuates very intensely as a function of n. The overall distribution of values is very close to exponentially distributed—for example with the distribution of logarithmic values of f[n] for n between 9 million and 10 million being:

What else can we say about this sequence? Let’s say we reduce mod 2 the powers of 2 for each f[n]. Then we get a sequence which starts:

This is definitely not “uniformly random”. But if one look at blocks of sequential values, one can plot at what n each of the 2b possible configurations of a length-b block first appears:

And eventually it seems as if all length-b blocks for any given b will appear.

By the way, whereas in the P family, there were always a limited number of “blue graph trees” (associated with the limited number of initial conditions), for T212 the number of such trees increases with n, as more initial conditions are used. So, for example, here are the trees for f[50] and f[51]:

We’ve so far discussed T212 only with the initial condition f[n] = 1 for n ≤ 0. The fact that f[n] is always a power of 2 relies on every initial value also being a power of 2. But here’s what happens, for example, if f(n) = 2s for n ≤ 0:

In general, one can think of T212 as transforming an ultimately infinite sequence of initial conditions into an infinite sequence of function values, with different forms of initial conditions potentially giving very different sequences of function values:

(Note that not all choices of initial conditions are possible; some lead to “f[n] = f[n]” or f[n] = f[n + 1]” situations, where the evaluation of the function can’t “make progress”.)

The “Summer School” Sequence T311 (f[n_] := 3 f[n – f[n – 1]])

Having explored T212, let’s now look at T311—the original one-term nestedly recursive function discovered at the 2003 Wolfram Summer School:

Here’s its basic behavior:

And here is its evaluation graph—which immediately reveals a lot more lookback than T212:

Plotting lookbacks as a function of n we get:

Much as with T212, the total number of lookbacks varies with n in the fairly simple way (~ 0.44 n):

Continuing the T311 sequence further, it looks qualitatively very much like T212:

And indeed T311—despite its larger number of lookbacks—seems to basically behave like T212. In a story typical of the Principle of Computational Equivalence, T212 seems to have already “filled out the computational possibilities”, so T311 “doesn’t have anything to add”.

The “S Family”: f[n_] := n – f[f[n – a] – b]

As another (somewhat historically motivated) example of nestedly recursive functions, consider what we’ll call the “S family”, defined by:

Let’s start with the very minimal case S10 (or “S1”):

Our standard initial condition f[n] = 1 for n ≤ 0 doesn’t work here, because it implies that f[1] = 1 – f[1]. But if we take f[n] = 1 for n ≤ 1 we get:

Meanwhile, with f[n] = 1 for n ≤ 3 we get:

The first obvious feature of both these results is their overall slope: 1/ϕ ≈ 0.618, where ϕ is the golden ratio. It’s not too hard to see why one gets this slope. Assume that for large n we can take f[n] = σ n. Then substitute this form into both sides of the recursive definition for the S family to get σ n == n – σ (σ (na) – b). For large n all that survives is the condition for the coefficients of n

which has solution σ = 1/ϕ.

Plotting f[n] – n/ϕ for the case f[n] = 1 for n ≤ 1 we get:

The evaluation graph is this case has a fairly simple form

as we can see even more clearly with a different graph layout:

It’s notable that only the initial condition f[1] = 1 is used—leading to a single “blue graph tree” that turns out to have a very simple “Fibonacci tree” form (which, as we’ll discuss below, has been known since the 1970s):

From this it follows that f[n] related to the “Fibonacci-like” substitution system

and in fact the sequence of values of f[n] can be computed just as:

And indeed it turns out that in this case f[n] is given exactly by:

What about when f[n] = 1 not just for n ≤ 1 but beyond? For n ≤ 2 the results are essentially the same as for n ≤ 1. But for n ≤ 3 there’s a surprise: the behavior is considerably more complicated—as we can see if we plot f[n] – n/ϕ:

Looking at the evaluation graph in this case we see that the only initial conditions sampled are f[1] = 1 and f[3] = 1 (with f[2] only being reached if one specifically starts with f[2]):

And continuing the evaluation graph we see a mixture of irregularity and comparative regularity:

The plot of f[n] has a strange “hand-drawn” appearance, with overall regularity but detailed apparent randomness. The most obvious large-scale feature is “bursting” behavior (interspersed in an audio rendering with an annoying hum). The bursts all seem to have approximately (though not exactly) the same structure—and get systematically larger. The lengths of successive “regions of calm” between bursts (characterized by runs with Abs[f[n] – n/ϕ]

What happens to S1 with other initial conditions? Here are a few examples:

So how does Sa depend on a? Sometimes there’s at least a certain amount of clear regularity; sometimes it’s more complicated:

As is very common, adding the parameter b in the definition doesn’t seem to lead to fundamentally new behavior—though for b > 0 the initial condition f[n] = 1, n ≤ 0 can be used:

In all cases, only a limited number of initial conditions are sampled (bounded by the value of a + b in the original definition). But as we can see, the behavior can either be quite simple, or can be highly complex.

More Complicated Rules

Highly complex behavior arises even from very simple rules. It’s a phenomenon one sees all over the computational universe. And we’re seeing it here in nestedly recursive functions. But if we make the rules (i.e. definitions) for our functions more complicated, will we see fundamentally different behavior, or just more of the same?

The Principle of Computational Equivalence (as well as many empirical observations of other systems) suggests that it’ll be “more of the same”: that once one’s passed a fairly low threshold the computational sophistication—and complexity—of behavior will no longer change.

And indeed this is what one sees in nestedly recursive functions. But below the threshold different kinds of things can happen with different kinds of rules.

There are several directions in which we can make rules more complicated. One that we won’t discuss here is to use operations (conditional, bitwise, etc.) that go beyond arithmetic. Others tend to involve adding more instances of f in our definitions.

An obvious way to do this is to take f[n_] to be given by a sum of terms, “Fibonacci style”. There are various specific forms one can consider. As a first example—that we can call ab—let’s look at:

The value of a doesn’t seem to matter much. But changing b we see:

12 has unbounded lookback (at least starting with f[n] = 1 for n ≤ 0), but for larger b, 1b has bounded lookback. In both 13 and 15 there is continuing large-scale structure (here visible in log plots)

though this does not seem to be reflected in the corresponding evaluation graphs:

As another level of Fibonacci-style definition, we can consider ab:

But the typical behavior here does not seem much different from what we already saw with one-term definitions involving only two f’s:

(Note that aa is equivalent to a. Cases like 13 lead after a transient to pure exponential growth.)

A somewhat more unusual case is what we can call abc:

Subtracting overall linear trends we get:

For 111 using initial conditions f[1] = f[2] = 1 and plotting f[n] – n/2 we get

which has a nested structure that is closely related to the result of concatenating binary digit sequences of successive integers:

But despite the regularity in the sequence of values, the evaluation graph for this function is not particularly simple:

So how else might we come up with more complicated rules? One possibility is that instead of “adding f’s by adding terms” we can add f’s by additional nesting. So, for example, we can consider what we can call S31 (here shown with initial condition f[n] = 1 for n ≤ 3):

We can estimate the overall slope here by solving for x in x == 1 – x3 to get ≈ 0.682. Subtracting this off we get:

We can also consider deeper nestings. At depth d the slope is the solution to x == 1 – xd. Somewhat remarkably, in all cases the only initial conditions probed are f[1] = 1 and f[3] = 1:

As another example of “higher nesting” we can consider the class of functions (that we call a):

Subtracting a constant 1/ϕ slope we get:

The evaluation graph for 1 is complicated, but has some definite structure:

What happens if we nest even more deeply, say defining a functions:

With depth-d nesting, we can estimate the overall slope of f[n] by solving for x in

or

so that for the d = 3 case here the overall slope is the real root of or about 0.544. Subtracting out this overall slope we get:

And, yes, the sine-curve-like form of 5 is very odd. Continuing 10x longer, though, things are “squaring off”:

What happens if we continue nesting deeper? stays fairly tame:

However, already allows for more complicated behavior:

And for different values of a there are different regularities:

There are all sorts of other extensions and generalizations one might consider. Some involve alternate functional forms; others involve introducing additional functions, or allowing multiple arguments to our function f.

An Aside: The Continuous Case

In talking about recursive functions f[n] we’ve been assuming—as one normally does—that n is always an integer. But can we generalize what we’re doing to functions f[x] where x is a continuous real number?

Consider for example a continuous analog of the Fibonacci recurrence:

This produces a staircase-like function whose steps correspond to the usual Fibonacci numbers:

Adjusting the initial condition produces a slightly different result:

We can think of these as being solutions to a kind of “Fibonacci delay equation”—where we’ve given initial conditions not at discrete points, but instead on an interval.

So what happens with nestedly recursive functions? We can define an analog of S1 as:

Plotting this along with the discrete result we get:

In more detail, we get

where now the plateaus occur at the (“Wythoff numbers”) .

Changing the initial condition to be x ≤ 3 we get:

Removing the overall slope by subtracting x/ϕ gives:

One feature of the continuous case is that one can continuously change initial conditions—though the behavior one gets typically breaks into “domains” with discontinuous boundaries, as in this case where we’re plotting the value of f[x] as a function of x and the “cutoff” in the initial conditions f[x], x:

So what about other rules? A rule like P312 (f[n_] := 3 + f[nf[n – 2]]) given “constant” initial conditions effectively just copies and translates the initial interval, and gives a simple order-0 interpolation of the discrete case. With initial condition f[x] = x some segments get “tipped”:

All the cases we’ve considered here don’t “look back” to negative values, in either the discrete or continuous case. But what about a rule like T212 (f[n_] := 2 f[n – 1 f[n – 2]]) that progressively “looks back further”? With the initial condition f[x] = 1 for x ≤ 0, one gets the same result as in the discrete case:

But if one uses the initial condition f[x ] = Abs[x – 1] for x ≤ 0 (the Abs[x 1] is needed to avoid ending up with f[x] depending on f[y] for y > x) one instead has

yielding the rather different result:

Continuing for larger x (on a log scale) we get:

Successively zooming in on one of the first “regions of noise” we see that it ultimately consists just of a large number of straight segments:

What’s going on here? If we count the number of initial conditions that are used for different values of x we see that this has discontinuous changes, leading to disjoint segments in f[x]:

Plotting over a larger range of x values the number of initial conditions used is:

And plotting the actual values of those initial conditions we get:

If we go to later, “more intense” regions of noise, we see more fragmentation—and presumably in the limit x ∞ we get the analog of an essential singularity in f[x]:

For the S family, with its overall n/ϕ trend, even constant initial conditions—say for S1—already lead to tipping, here shown compared to the discrete case:

How Do You Actually Compute Recursive Functions?

Let’s say we have a recursive definition—like the standard Fibonacci one:

How do we actually use this to compute the value of, say, f[7]? Well, we can start from f[7], then use the definition to write this as f[6] + f[5], then write f[6] as f[5] + f[4], and so on. And we can represent this using a evaluation graph, in the form:

But this computation is in a sense very wasteful; for example, it’s independently computing f[3] five separate times (and of course getting the same answer each time). But what if we just stored each f[n] as soon as we compute, and then just retrieve that stored (“cached”) value whenever we need it again?

In the Wolfram Language, it’s a very simple change to our original definition:

And now our evaluation graph becomes much simpler:

And indeed it’s this kind of minimal evaluation graph that we’ve been using in everything we’ve discussed so far.

What’s the relationship between the “tree” evaluation graph, and this minimal one? The tree graph is basically an “unrolled” version of the minimal graph, in which all the possible paths that can be taken from the root node to the initial condition nodes have been treed out.

In general, the number of edges that come out of a single node in a evaluation graph will be equal to the number of instances of the function f that appear on the right-hand side of the recursive definition we’re using (i.e. 2 in the case of the standard Fibonacci definition). So this means that if the maximum length of path from the root to the initial conditions is s, the maximum number of nodes that can appear in the “unrolled” graph is 2s. And whenever there are a fixed set of initial conditions (i.e. if there’s always the same lookback), the maximum path length is essentially n—implying in the end that the maximum possible number of nodes in the unrolled graph will be 2n.

(In the actual case of the Fibonacci recurrence, the number of nodes in the unrolled graph is, or about 1.6n.)

But if we actually evaluate f[7]—say in the Wolfram Language—what is the sequence of f[n]’s that we’ll end up computing? Or, in effect, how will the evaluation graph be traversed? Here are the results for the unrolled and minimal evaluation graphs—i.e. without and with caching:

Particularly in the first case this isn’t the only conceivable result we could have gotten. It’s the way it is here because of the particular “leftmost innermost” evaluation order that the Wolfram Language uses by default. In effect, we’re traversing the graph in a depth-first way. In principle we could use other traversal orders, leading to f[n]’s being evaluated in different orders. But unless we allow other operations (like f[3] + f[3] 2 f[3]) to be interspersed with f evaluations, we’ll still always end up with the same number of f evaluations for a given evaluation graph.

But which is the “correct” evaluation graph? The unrolled one? Or the minimal one? Well, it depends on the computational primitives we’re prepared to use. With a pure stack machine, the unrolled graph is the only one possible. But if we allow (random-access) memory, then the minimal graph becomes possible.

OK, so what happens with nestedly recursive functions? Here, for example, are unrolled and minimal graphs for T212:

Here are the sequences of f[n]’s that are computed:

And here’s a comparison of the number of nodes (i.e. f evaluations) from unrolled and minimal evaluation graphs (roughly 1.2n and 0.5 n, respectively):

Different recursive functions lead to different patterns of behavior. The differences are less obvious in evaluation graphs, but can be quite obvious in the actual sequence of f[n]’s that are evaluated:

But although looking at evaluation sequences from unrolled evaluation graphs can be helpful as a way of classifying behavior, the exponentially more steps involved in the unrolled graph typically makes this impractical in practice.

Primitive Recursive or Not?

Recursive functions have a fairly long history, that we’ll be discussing below. And for nearly a hundred years there’s been a distinction made between “primitive recursive functions” and “general recursive functions”. Primitive recursive functions are basically ones where there’s a “known-in-advance” pattern of computation that has to be done; general recursive functions are ones that may in effect make one have to “search arbitrarily far” to get what one needs.

In Wolfram Language terms, primitive recursive functions are roughly ones that can be constructed directly using functions like Nest and Fold (perhaps nested); general recursive functions can also involve functions like NestWhile and FoldWhile.

So, for example, with the Fibonacci definition

the function f[n] is primitive recursive and can be written, say, as:

Lots of the functions one encounters in practice are similarly primitive recursive—including most “typical mathematical functions” (Plus, Power, GCD, Prime, …). And for example functions that give the results of n steps in the evolution of a Turing machine, cellular automaton, etc. are also primitive recursive. But functions that for example test whether a Turing machine will ever halt (or give the state that it achieves if and when it does halt) are not in general primitive recursive.

On the face of it, our nestedly recursive functions seem like they must be primitive recursive, since they don’t for example appear to be “searching for anything”. But things like the presence of longer and longer lookbacks raise questions. And then there’s the potential confusion of the very first example (dating from the late 1920s) of a recursively defined function known not to be primitive recursive: the Ackermann function.

The Ackermann function has three (or sometimes two) arguments—and, notably, its definition (here given in its classic form) includes nested recursion:

This is what the evaluation graphs look like for some small cases:

Looking at these graphs we can begin to see a pattern. And in fact there’s a simple interpretation: f[m, x, y] for successive m is doing progressively more nested iterations of integer successor operations. f[0, x, y] computes x + y; f[1, x, y] does “repeated addition”, i.e. computes x × y; f[2, x, y] does “repeated multiplication”, i.e. computes xy; f[3, x, y] does “tetration”, i.e. computes the “power tower” Nest[x#&, 1, y]; etc.

Or, alternatively, these can be given explicitly in successively more nested form:

And at least in this form f[m, x, y] involves m nestings. But a given primitive recursive function can involve only a fixed number of nestings. It might be conceivable that we could rewrite f[m, x, y] in certain cases to involve only a fixed number of nestings. But if we look at f[m, m, m] then this turns out to inevitably grow too rapidly to be represented by a fixed number of nestings—and thus cannot be primitive recursive.

But it turns out that the fact that this can happen depends critically on the Ackermann function having more than one argument—so that one can construct the “diagonal” f[m, m, m].

So what about our nestedly recursive functions? Well, at least in the form that we’ve used them, they can all be written in terms of Fold. The key idea is to accumulate a list of values so far (conveniently represented as an association)—sampling whichever parts are needed—and then at the end take the last element. So for example the “Summer School function” T311

can be written:

An important feature here is that we’re getting Lookup to give 1 if the value it’s trying to look up hasn’t been filled in yet, implementing the fact that f[n] = 1 for n ≤ 0.

So, yes, our recursive definition might look back further and further. But it always just finds value 1—which is easy for us to represent without, for example, any extra nesting, etc.

The ultimate (historical) definition of primitive recursion, though, doesn’t involve subsets of the Wolfram Language (the definition was given almost exactly 100 years too early!). Instead, it involves a specific set of simple primitives:

(An alternative, equivalent definition for recursion—explicitly involving Fold—is r[g_, h_] := Fold[{u, v} h[u, v, ##2]], g[##2], Range[0, #1 – 1]] &.)

So can our nestedly recursive functions be written purely in terms of these primitives? The answer is yes, though it’s seriously complicated. A simple function like Plus can for example be written as r[p[1], s], so that e.g. r[p[1], s][2,3]5. Times can be written as r[z, c[Plus, p[1], p[3]]] or r[z, c[r[p[1], s], p[1], p[3]]], while Factorial can be written as r[c[s, z], c[Times, p[1], c[s, p[2]]]]. But even Fibonacci, for example, seems to require a very much longer specification.

In writing “primitive-recursive-style” definitions in Wolfram Language we accumulated values in lists and associations. But in the ultimate definition of primitive recursion, there are no such constructs; the only form of “data” is positive integers. But for our definitions of nestedly recursive functions we can use a “tupling function” that “packages up” any list of integer values into a single integer (and an untupling function that unpacks it). And we can do this say based on a pairing (2-element-tupling) function like:

But what about the actual If[n ≤0, 1, ...] lookback test itself? Well, If can be written in primitive recursive form too: for example, r[c[s, z], c[f, c[s, p[2]]]][n] is equivalent to If[n ≤ 0, 1, f[n]].

So our nestedly recursive functions as we’re using them are indeed primitive recursive. Or, more strictly, finding values f[n] is primitive recursive. Asking questions like “For what n does f[n] reach 1000?” might not be primitive recursive. (The obvious way of answering them involves a FoldWhile-style non-primitive-recursive search, but proving that there’s no primitive recursive way to answer the question is likely very much harder.)

By the way, it’s worth commenting that while for primitive recursive functions it’s always possible to compute a value f[n] for any n, that’s not necessarily true for general recursive functions. For example, if we ask “For what n does f[n] reach 1000?” there might simply be no answer to this; f[n] might never reach 1000. And when we look at the computations going on underneath, the key distinction is that in evaluating primitive recursive functions, the computations always halt, while for general recursive functions, they may not.

So, OK. Our nestedly recursive functions can be represented in “official primitive recursive form”, but they’re very complicated in that form. So that raises the question: what functions can be represented simply in this form? In A New Kind of Science I gave some examples, each minimal for the output it produces:

And then there’s the most interesting function I found:

It’s the simplest primitive recursive function whose output has no obvious regularity:

Because it’s primitive recursive, it’s possible to express it in terms of functions like Fold—though it’s two deep in those, making it in some ways more complicated (at least as far as the Grzegorczyk hierarchy that counts “Fold levels” is concerned) than our nestedly recursive functions:

But there’s still an issue to address with nestedly recursive functions and primitive recursion. When we have functions (like T212) that “reach back” progressively further as n increases, there’s a question of what they’ll find. We’ve simply assumed f[n] = 1 for n ≤0. But what if there was something more complicated there? Even if f[–m] was given by some primitive recursive function, say p[m], it seems possible that in computing f[n] one could end up somehow “bouncing back and forth” between positive and negative arguments, and in effect searching for an m for which p[m] has some particular value, and in doing that searching one could find oneself outside the domain of primitive recursive functions.

And this raises yet another question: are all definitions we can give of nestedly recursive functions consistent? Consider for example:

Now ask: what is f[1]? We apply the recursive definition. But it gives us f[1] = 1 – f[f[0]] or f[1] = 1 – f[1], or, in other words, an inconsistency. There are many such inconsistencies that seem to “happen instantly” when we apply definitions. But it seems conceivable that there could be “insidious inconsistencies” that show up only after many applications of a recursive definition. And it’s also conceivable that one could end up with “loops” like f[i] = f[i]. And things like this could be reasons that f[n] might not be a “total function”, defined for all n.

We’ve seen all sorts of complex behavior in nestedly recursive functions. And what the Principle of Computational Equivalence suggests is that whenever one sees complex behavior, one must in some sense be dealing with computations that are “as sophisticated as any computation can be”. And in particular one must be dealing with computations that can somehow support computation universality.

So what would it mean for a nestedly recursive function to be universal? For a start, one would need some way to “program” the function. There seem to be a couple of possibilities. First, one could imagine packing both “code” and “data” into the argument n of f[n]. So, for example, one might use some form of tupling function to take a description of a rule and an initial state for a Turing machine, together with a specification of a step number, then package all these things into an integer n that one feeds into one’s universal nestedly recursive function f. Then the idea would be that the value computed for f[n] could be decoded to give the state of the Turing machine at the specified step. (Such a computation by definition always halts—but much as one computes with Turing machines by successively asking for the next steps in their evolution, one can imagine setting up a “harness” that just keeps asking for values of f[n] at an infinite progression of values n.)

Another possible approach to making a universal nestedly recursive function is to imagine feeding in a “program” through the initial conditions one gives for the function. There might well need to be decoding involved, but in some sense what one might hope is that just by changing its initial conditions one could get a nestedly recursive function with a specific recursive definition to emulate a nestedly recursive function with any other recursive definition (or, say, for a start, any linear recurrence).

Perhaps one could construct a complicated nestedly recursive function that would have this property. But what the Principle of Computational Equivalence suggests is that it should be possible to find the property even in “naturally occurring cases”—like P312 or T212.

The situation is probably going to be quite analogous to what happened with the rule 110 cellular automaton or the s = 2, k = 3 596440 Turing machine. By looking at the actual typical behavior of the system one got some intuition about what was likely to be going on. And then later, with great effort, it became possible to actually prove computation universality.

In the case of nestedly recursive functions, we’ve seen here examples of just how diverse the behavior generated by changing initial conditions can be. It’s not clear how to harness this diversity to extract some form of universality. But it seems likely that the “raw material” is there. And that nestedly recursive functions will show themselves as able join so many other systems in fitting into the framework defined by the Principle of Computational Equivalence.

Some History

Once one has the concept of functions and the concept of recursion, nestedly recursive functions aren’t in some sense a “complicated idea”. And between this fact and the fact that nestedly recursive functions haven’t historically had a clear place in any major line of mathematical or other development it’s quite difficult to be sure one’s accurately tracing their history. But I’ll describe here at least what I currently know.

The concept of something like recursion is very old. It’s closely related to mathematical induction, which was already being used for proofs by Euclid around 300 BC. And in a quite different vein, around the same time (though not recorded in written form until many centuries later) Fibonacci numbers arose in Indian culture in connection with the enumeration of prosody (“How many different orders are there in which to say the Sanskrit words in this veda?”).

Then in 1202 Leonardo Fibonacci, at the end of his calculational math book Liber Abaci (which was notable for popularizing Hindu-Arabic numerals in the West) stated—more or less as a recreational example—his “rabbit problem” in recursive form, and explicitly listed the Fibonacci numbers up to 377. But despite this early appearance, explicit recursively defined sequences remained largely a curiosity until as late as the latter part of the twentieth century.

The concept of an abstract function began to emerge with calculus in the late 1600s, and became more solidified in the 1700s—but basically always in the context of continuous arguments. A variety of specific examples of recurrence relations—for binomial coefficients, Bernoulli numbers, etc.—were in fairly widespread use. But there didn’t seem to have yet been a sense that there was a general mathematical structure to study.

In the course of the 1800s there had been an increasing emphasis on rigor and abstraction in mathematics, leading by the latter part of the century to a serious effort to axiomatize concepts associated with numbers. Starting with concepts like the recursive definition of integers by repeated application of the successor operation, by the time of Peano’s axioms for arithmetic in 1891 there was a clear general notion (particularly related to the induction axiom) that (integer) functions could be defined recursively. And when David Hilbert’s program of axiomatizing mathematics got underway at the beginning of the 1900s, it was generally assumed that all (integer) functions of interest could actually be defined specifically using primitive recursion.

The notation for recursively specifying functions gradually got cleaner, making it easier to explore more elaborate examples. And in 1927 Wilhelm Ackermann (a student of Hilbert’s) introduced (in completely modern notation) a “reasonable mathematical function” that—as we discussed above—he showed was not primitive recursive. And right there, in his paper, without any particular comment, is a nestedly recursive function definition:

Ackermann nestedly recursive function paper

Ackermann nestedly recursive function definition

In 1931 Kurt Gödel further streamlined the representation of recursion, and solidified the notion of general recursion. There soon developed a whole field of recursion theory—though most of it was concerned with general issues, not with specific, concrete recursive functions. A notable exception was the work of Rózsa Péter (Politzer), beginning in the 1930s, and leading in 1957 to her book Recursive Functions—which contains a chapter on “Nested Recursion” (here in English translation):

Nested recursion book chapter

But despite the many specific (mostly primitive) recursive functions discussed in the rest of the book, this chapter doesn’t stray far from the particular function Ackermann defined (or at least Péter’s variant of it).

What about the recreational mathematics literature? By the late 1800s there were all sorts of publications involving numbers, games, etc. that at least implicitly involved recursion (an example being Édouard Lucas’s 1883 Tower of Hanoi puzzle). But—perhaps because problems tended to be stated in words rather than mathematical notation—it doesn’t seem as if nestedly recursive functions ever showed up.

In the theoretical mathematics literature, a handful of somewhat abstract papers about “nested recursion” did appear, an example being one in 1961 by William Tait, then at Stanford:

Nested recursion paper by William Tait

But, meanwhile, the general idea of recursion was slowly beginning to go from purely theoretical to more practical. John McCarthy—who had coined the term “artificial intelligence”—was designing LISP as “the language for AI” and by 1960 was writing papers with titles like “Recursive Functions of Symbolic Expressions and Their Computation by Machine”.

In 1962 McCarthy came to Stanford to found the AI Lab there, bringing with him enthusiasm for both AI and recursive functions. And by 1968 these two topics had come together in an effort to use “AI methods” to prove properties of programs, and in particular programs involving recursive functions. And in doing this, John McCarthy came up with an example he intended to be awkward—that’s exactly a nestedly recursive function:

John McCarthy nestedly recursive function example

In our notation, it would be:

And it became known as “McCarthy’s 91-function” because, yes, for many n, f[n] = 91. These days it’s trivial to evaluate this function—and to find out that f[n] = 91 only up to n = 102:

But even the evaluation graph is somewhat large

and in pure recursive evaluation the recursion stack can get deep—which back then was a struggle for LISP systems to handle.

There were efforts at theoretical analysis, for example by Zohar Manna, who in 1974 published Mathematical Theory of Computation which—in a section entitled “Fixpoints of Functionals”—presents the 91-function and other nestedly recursively functions, particularly in the context of evaluation-order questions.

In the years that followed, a variety of nestedly recursive functions were considered in connection with proving theorems about programs, and with practical assessments of LISP systems, a notable example being Ikuo Takeuchi’s 1978 triple recursive function:

Ikuo Takeuchi triple recursive function example

But in all these cases the focus was on how these functions would be evaluated, not on what their behavior would be (and it was typically very simple).

But now we have to follow another thread in the story. Back in 1961, right on the Stanford campus, a then-16-year-old Douglas Hofstadter was being led towards nestedly recursive functions. As Doug tells it, it all started with him seeing that squares are interspersed with gaps of 1 or 2 between triangular numbers, and then noticing patterns in those gaps (and later realizing that they showed nesting). Meanwhile, at Stanford he had access to a computer running Algol, a language which (like LISP and unlike Fortran) supported recursion (though this wasn’t particularly advertised, since recursion was still generally considered quite obscure).

And as Doug tells it, within a year or two he was using Algol to do things like recursively create trees representing English sentences. Meanwhile—in a kind of imitation of the Eleusis “guess-a-card-rule” game—Doug was apparently challenging his fellow students to a “function game” based on guessing a math function from specified values. And, as he tells it, he found that functions that were defined recursively were the ones people found it hardest to guess.

That was all in the early 1960s, but it wasn’t until the mid-1970s that Doug Hofstadter returned to such pursuits. After various adventures, Doug was back at Stanford—writing what became his book Gödel, Escher, Bach. And in 1977 he sent a letter to Neil Sloane, creator of the 1973 A Handbook of Integer Sequences (and what’s now the Online Encyclopedia of Integer Sequences, or OEIS):

Douglas Hofstadter letter to Neil Sloane

As suggested by the accumulation of “sequence ID” annotations on the letter, Doug’s “eta sequences” had actually been studied in number theory before—in fact, since at least the 1920s (they are now usually called Beatty sequences). But the letter went on, now introducing some related sequences—that had nestedly recursive definitions:

Sequences with nestedly recursive definitions

As Doug pointed out, these particular sequences (which were derived from golden ratio versions of his “eta sequences”) have a very regular form—which we would now call nested. And it was the properties of this form that Doug seemed most concerned about in his letter. But actually, as we saw above, just a small change in initial conditions in what I’m calling S1 would have led to much wilder behavior. But that apparently wasn’t something Doug happened to notice. A bit later in the letter, though, there was another nestedly recursive sequence—that Doug described as a “horse of an entirely nother color”: the “absolutely CRAZY” Q sequence:

Crazy Q sequence

Two years later, Doug’s Gödel, Escher, Bach book was published—and in it, tucked away at the bottom of page 137, a few pages after a discussion of recursive generation of text (with examples such as “the strange bagels that the purple cow without horns gobbled”), there was the Q sequence:

Chaotic Q sequence

Strangely, though, there was no picture of it, and Doug listed only 17 terms (which, until I was writing this, was all I assumed he had computed):

17 Q-sequence terms

So now nestedly recursive sequences were out in the open—in what quickly became a very popular book. But I don’t think many people noticed them there (though, as I’ll discuss, I did). Gödel, Escher, Bach is primarily a playful book focused on exposition—and not the kind of place you’d expect to find a new, mathematical-style result.

Still—quite independent of the book—Neil Sloane showed Doug’s 1977 letter to his Bell Labs colleague Ron Graham, who within a year made a small mention of the Q sequence in a staid academic math publication (in a characteristic “state-it-as-a-problem” Erdös form):

Erdös and Graham math paper

Erdös and Graham math paper continued

There was a small and tight-knit circle of serious mathematicians—essentially all of whom, as it happens, I personally knew—who would chase these kinds of easy-to-state-but-hard-to-solve problems. Another was Richard Guy, who soon included the sequence as part of problem E31 in his Unsolved Problems in Number Theory, and mentioned it again a few years later.

But for most of the 1980s little was heard about the sequence. As it later turns out, a senior British applied mathematician named Brian Conolly (who wasn’t part of the aforementioned tight-knit circle) had—presumably as a kind of hobby project—made some progress, and in 1986 had written to Guy about it. Guy apparently misplaced the letter, but later told Conolly that John Conway and Sol Golomb had worked on similar things.

Conway presumably got the idea from Hofstadter’s work (though he had a habit of obfuscating his sources). But in any case, on July 15, 1988, Conway gave a talk at Bell Labs entitled “Some Crazy Sequences” (note the word “crazy”, just like in Hofstadter’s letter to Sloane) in which he discussed the regular-enough-to-be-mathematically-interesting sequence (which we’re calling G3111 here):

Despite its visual regularity, Conway couldn’t mathematically prove certain features of the wiggles in the sequence—and in his talk offered a $10,000 prize for anyone who could. By August a Bell Labs mathematician named Colin Mallows had done it. Conway claimed (later to be contradicted by video evidence) that he’d only offered $1000—and somehow the whole affair landed as a story in the August 30 New York Times under the heading “Intellectual Duel: Brash Challenge, Swift Response”. But in any case, this particular nestedly recursive sequence became known as “Conway’s Challenge Sequence”.

So what about Sol Golomb? It turns out he’d started writing a paper—though never finished it:

Discrete Chaos paper

Discrete Chaos paper continued

He’d computed 280 terms of the Q sequence (he wasn’t much of a computer user) and noticed a few coincidences. But he also mentioned another kind of nestedly recursive sequence, no doubt inspired by his work on feedback shift registers:

As he noted, the behavior depends greatly on the initial conditions, though is always eventually periodic—with his student Unjeng Cheng having found long-period examples.

OK, so by 1988 nestedly recursive functions had at least some notoriety. So what happened next? Not so much. There’s a modest academic literature that’s emerged over the last few decades, mostly concentrated very specifically around “Conway’s Challenge Sequence”, Hofstadter’s Q function, or very similar “meta Fibonacci” generalizations of them. And so far as I know, even the first published large-scale picture of the Q sequence only appeared in 1998 (though I had pictures of it many years earlier):

Klaus Pinn Q-sequence paper

Klaus Pinn Q-sequence paper continued

Why wasn’t more done with nestedly recursive functions? At some level it’s because they tend to have too much computational irreducibility—making it pretty difficult to say much about them in traditional mathematical ways. But perhaps more important, studying them broadly is really a matter of ruliology: it requires the idea of exploring spaces of rules, and of expecting the kinds of behavior and phenomena that are characteristic of systems in the computational universe. And that’s something that’s still not nearly as widely understood as it should be.

My Personal Story with Nestedly Recursive Functions

I think 1979 was the year when I first took recursion seriously. I’d heard about the Fibonacci sequence (though not under that name) as a young child a decade earlier. I’d implicitly (and sometimes explicitly) encountered recursion (sometimes through error messages!) in computer algebra systems I’d used. In science, I’d studied fractals quite extensively (Benoit Mandelbrot’s book having appeared in 1977), and I’d been exposed to things like iterated maps. And I’d quite extensively studied cascade processes, notably of quarks and gluons in QCD.

As I think about it now, I realize that for several years I’d written programs that made use of recursion (and I had quite a lot of exposure to LISP, and the culture around it). But it was in 1979—having just started using C—that I first remember writing a program (for doing percolation theory) where I explicitly thought “this is using recursion”. But then, in late 1979, I began to design SMP (“Symbolic Manipulation Program”), the forerunner of the modern Wolfram Language. And in doing this I quickly solidified my knowledge of mathematical logic and the (then-fledgling) field of theoretical computer science.

My concept of repeated transformations for symbolic expressions—which is still the core of Wolfram Language today—is somehow fundamentally recursive. And by the time we had the first signs of life for our SMP system, Fibonacci was one of our very first tests. We soon tried the Ackermann function too. And in 1980 I became very interested in the problem of evaluation order, particularly for recursive functions—and the best treatment I found of it (though at the time not very useful to me) was in none other than the book by Zohar Manna that I mentioned above. (In a strange twist, I was at that time also studying gauge choices in physics—and it was only last year that I realized that they’re fundamentally the same thing as evaluation orders.)

It was soon after it came out in 1979 that I first saw Douglas Hofstadter’s book. At the time I wasn’t too interested in its Lewis-Carroll-like aspects, or its exposition; I just wanted to know what the “science meat” in it was. And somehow I found the page about the Q sequence, and filed it away as “something interesting”.

I’m not sure when I first implemented the Q sequence in SMP, but by the time we released Version 1.0 in July 1981, there it was: an external package (hence the “X” prefix) for evaluating “Hofstadter’s recursive function”, elegantly using memoization—with the description I gave saying (presumably because that’s what I’d noticed) that its values “have several properties of randomness”:

Hofstadter recursive function

Firing up a copy of SMP today—running on a virtual machine that still thinks it’s 1986—I can run this code, and easily compute the function:

SMP evaluation

I can even plot it—though without an emulator for a 1980s-period storage-tube display, only the ASCIIfied rendering works:

ASCIIfied rendering

So what did I make of the function back in 1981? I was interested in how complexity and randomness could occur in nature. But at the time, I didn’t have enough of a framework to understand the connection. And, as it was, I was just starting to explore cellular automata, which seemed a lot more “nature like”—and which soon led me to things like rule 30 and the phenomenon of computational irreducibility.

Still, I didn’t forget the Q sequence. And when I was building Mathematica I again used it as a test (the .tq file extension came from the brief period in 1987 when we were trying out “Technique” as the name of the system):

Combinatorial functions

Combinatorial functions continued

When Mathematica 1.0 was released on June 23, 1988, the Q sequence appeared again, this time as an example in the soon-in-every-major-bookstore Mathematica book:

Q sequence in Mathematica book

Q sequence in Mathematica book continued

I don’t think I was aware of Conway’s lecture that occurred just 18 days later. And for a couple of years I was consumed with tending to a young product and a young company. But by 1991, I was getting ready to launch into basic science again. Meanwhile, the number theorist (and today horologist) Ilan Vardi—yet again from Stanford—had been working at Wolfram Research and writing a book entitled Computational Recreations in Mathematica, which included a long section on the analysis of Takeuchi’s nested recursive function (“TAK”). My email archive records an exchange I had with him:

Wolfram–Vardi email

He suggested a “more symmetrical” nested recursive function. I responded, including a picture that made it fairly clear that this particular function would have only nested behavior, and not seem “random”:

Wolfram–Vardi followup email

Nested recursive function graphic

By the summer of 1991 I was in the thick of exploring different kinds of systems with simple rules, discovering the remarkable complexity they could produce, and filling out what became Chapter 3 of A New Kind of Science: “The World of Simple Programs”. But then came Chapter 4: “Systems Based on Numbers”. I had known since the mid-1980s about the randomness in things like digit sequences produced by successive arithmetic operations. But what about randomness in pure sequences of integers? I resolved to find out just what it would take to produce randomness there. And so it was that on August 13, 1993, I came to be enumerating possible symbolic forms for recursive functions—and selecting ones that could generate at least 10 terms:

Symbolic forms for recursive functions

As soon as I plotted the “survivors” I could see that interesting things were going to happen:

Recursive function graphs

Was this complexity somehow going to end? I checked out to 10 million terms. And soon I started collecting my “prize specimens” and making a gallery of them:

Recursive functions gallery

I had some one-term recurrences, and some two-term ones. Somewhat shortsightedly I was always using “Fibonacci-like” initial conditions f[1] = f[2] = 1—and I rejected any recurrence that tried to “look back” to f[0], f[–1], etc. And at the time, with this constraint, I only found “really interesting” behavior in two-term recurrences.

In 1994 I returned briefly to recursive sequences, adding a note “solving” a few of them, and discussing the evaluation graphs of others:

Properties of sequences

Evaluation graphs

When I finally finished A New Kind of Science in 2002, I included a list of historical “Close approaches” to its core discoveries, one of them being Douglas Hofstadter’s work on recursive sequences:

Douglas Hofstadter work on recursive sequences

In retrospect, back in 1981 I should have been able to take the “Q sequence” and recognize in it the essential “rule 30 phenomenon”. But as it was, it took another decade—and many other explorations in the computational universe—for me to build up the right conceptual framework to see this. In A New Kind of Science I studied many kinds of systems, probing them far enough, I hoped, to see their most important features. But recursive functions were an example where I always felt there was more to do; I felt I’d only just scratched the surface.

In June 2003—a year after A New Kind of Science was published—we held our first summer school. And as a way to introduce methodology—and be sure that people knew I was fallible and approachable—I decided on the first day of the summer school to do a “live experiment”, and try to stumble my way to discovering something new, live and in public.

A few minutes before the session started, I picked the subject: recursive functions. I began with some examples I knew. Then it was time to go exploring. At first lots of functions “didn’t work” because they were looking back too far. But then someone piped up “Why not just say that f[n] is 1 whenever n isn’t a positive integer?” Good idea! And very easy to try.

Soon we had the “obvious” functions written (today Apply[Plus, ...] could be just Total[...], but otherwise there’s nothing “out of date” here):

In a typical story of Wolfram-Language-helps-one-think-clearly, the obvious function was also very general, and allowed a recurrence with any number of terms. So why not start with just one term? And immediately, there it was—what we’re now calling T311:

T311

And then a plot (yes, after Version 6 one didn’t need the Show or the trailing “;”):

RSValues plot

Of course, as is the nature of computational constructions, this is something timeless—that looks the same today as it did 21 years ago (well, except that now our plots display with color by default).

I thought this was a pretty neat discovery. And I just couldn’t believe that years earlier I’d failed to see the obvious generalization of having “infinite” initial conditions.

The next week I did a followup session, this time talking about how one would write up a discovery like this. We started off with possible titles (including audience suggestions):

Suggested titles

And, yes, the first title listed is exactly the one I’ve now used here. In the notebook I created back then, there were first some notes (some of which should still be explored!):

Title notes

Three hours later (on the afternoon of July 11, 2003) there’s another notebook, with the beginnings of a writeup:

Initial recursive functions writeup

By the way, another thing we came up with at the summer school was the (non-nestedly) recursive function:

Plotting g[n + 1] – g[n] gives:

And, yes, bizarrely (and reminiscent of McCarthy’s 91-function) for n ≥ 396, g[n + 1] – g[n] is always 97, and g[n] = 38606 + 97 (n – 396).

But in any case, a week or so after my “writeups” session, the summer school was over. In January 2004 I briefly picked the project up again, and made some pictures that, yes, show interesting structure that perhaps I should investigate now:

f[n - f[n - 1]]

In the years that followed, I would occasionally bring nestedly recursive functions out again—particularly in interacting with high school and other students. And at our summer programs I suggested projects with them for a number of students.

In 2008, they seemed like an “obvious interesting thing” to add to our Demonstrations Project:

NKS summer school live experiment

But mostly, they languished. Until, that is, my burst of “finish this” intellectual energy that followed the launch of our Physics Project in 2020. So here now, finally, after a journey of 43 years, I feel like I’ve been able to do some justice to nestedly recursive functions, and provided a bit more illumination to yet another corner of the computational universe.

(Needless to say, there are many, many additional questions and issues to explore. Different primitives, e.g. Mod, Floor, etc. Multiple functions that refer to each other. Multiway cases. Functions based on rational numbers. And endless potential approaches to analysis, identifying pockets of regularity and computational reducibility.)

Thanks

Thanks to Brad Klee for extensive help. Thanks also to those who’ve worked on nestedly recursive functions as students at our summer programs over the years, including Roberto Martinez (2003), Eric Rowland (2003), Chris Song (2021) and Thomas Adler (2024). I’ve benefitted from interactions about nestedly recursive functions with Ilan Vardi (1991), Tal Kubo (1993), Robby Villegas (2003), Todd Rowland (2003 etc.), Jim Propp (2004), Matthew Szudzik (2005 etc.), Joseph Stocke (2021 etc.), Christopher Gilbert (2024) and Max Niedermann (2024). Thanks to Doug Hofstadter for extensive answers to questions about history for this piece. It’s perhaps worth noting that I’ve personally known many of the people mentioned in the history section here (with the dates I met them indicated): John Conway (1984), Paul Erdös (1986), Sol Golomb (1981), Ron Graham (1983), Benoit Mandelbrot (1986), John McCarthy (1981) and Neil Sloane (1983).

Bibliography of Nestedly Recursive Functions »