2024-12-10 02:38:15
Note: As of today, copies of Wolfram Version 14.1 are being auto-updated to allow subscription access to the capabilities described here. [For additional installation information see here.]
Nearly a year and a half ago—just a few months after ChatGPT burst on the scene—we introduced the first version of our Chat Notebook technology to integrate LLM-based chat into Wolfram Notebooks. For the past year and a half we’ve been building on those foundations. And today I’m excited to be able to announce that we’re releasing the fruits of those efforts: the first version of our Wolfram Notebook Assistant.
There are all sorts of gimmicky AI assistants out there. But Notebook Assistant isn’t one of them. It’s a serious, deep piece of new technology, and what’s more important, it’s really, really useful! In fact, I think it’s so useful as to be revolutionary. Personally, I thought I was a pretty efficient user of Wolfram Language—but Notebook Assistant has immediately made me not only significantly more efficient, but also more ambitious in what I try to do. I hadn’t imagined just how useful Notebook Assistant was going to be. But seeing it now I can say for sure that it’s going to raise the bar for what everyone can do. And perhaps most important of all, it’s going to open up computational language and computational thinking to a vast range of new people, who in the past assumed that those things just weren’t accessible to them.
Leveraging the decades of work we’ve done on the design and implementation of the Wolfram Language (and Wolfram|Alpha), Notebook Assistant lets people just say in their own words what they want to do; then it does its best to crispen it up and give a computational implementation. Sometimes it goes all the way and just delivers the answer. But even when there’s no immediate “answer” it does remarkably well at building up structures where things can be represented computationally and tackled concretely. People really don’t need to know anything about computational language—or computational thinking to get started; Notebook Assistant will take their ideas, rough as they may be, and frame them in computational language terms.
I’ve long seen Wolfram Language as uniquely providing the infrastructure and “notation” to enable “computational X” for all fields X. I’m excited to say that I think Notebook Assistant now bridges “the last mile” to let anyone—at almost any level—access the power of computational language, and “do computational X”. In its original conception, Wolfram Notebook Assistant was just intended to be “useful”. But it’s emerging as something much more than that; something positively revolutionary.
“I can’t believe it’ll do anything useful with that”, I’ll think. But then I’ll try it. And, very often, something amazing will happen. Something that gets me past some sticking point or over some confusion. Something that gives me an unexpected new building block—or new idea—for what I’m trying to do. And that uses the medium of our computational language to take me beyond where I would ever have reached before.
So how does one use Notebook Assistant? Once you’ve signed up you can just go to the toolbar of any notebook, and open a Notebook Assistant chat window:
Now tell Notebook Assistant what you want to do. The more precise and explicit you are, the better. But you don’t have to have thought things through. Just type what comes into your mind. Imagine you’ve been working in a notebook, and (somehow) you’ve got a picture of some cats. You wonder “How can I find the cats in this picture?” Well, just ask Notebook Assistant!
Notebook Assistant gives some narrative text, and then a piece of Wolfram Language code—which you can just run in your notebook (by pressing ):
It seems a bit like magic. You say something vague, and Notebook Assistant turns it into something precise and computational—which you can then run. It’s not always as straightforward as in this example. But the important thing is that in practice (at least in my rather demanding experience) Notebook Assistant essentially always does spectacularly well at being useful—and at telling me things that move forward what I’m trying to do.
Imagine that sitting next to you, you had someone very knowledgeable about Wolfram Language and about computational thinking in general. Think what you might ask them. That’s what you should ask Notebook Assistant. And if there’s one thing to communicate here, it’s “Just try it!” You might think what you’re thinking about is too vague, or too specific, or too technical. But just try asking Notebook Assistant. In my experience, you’ll be amazed at what it’s able to do, and how helpful it’s able to be.
Maybe you’re an experienced Wolfram Language user who “knows there must be a way to do something”, but can’t quite remember how. Just ask Notebook Assistant. And not only will it typically be able to find the function (or whatever) you need; it’ll also usually be able to create a code fragment that does the very specific thing you asked about. And, by the way, it’ll save you lots of typing (and debugging) by filling in those fiddly options and so on just how you need them. And even if it doesn’t quite nail it, it’ll have given a skeleton of what you need, that you can then readily edit. (And, yes, the fact that it’s realistic to edit it relies on the fact that Wolfram Language represents it in a way that humans can readily read as well as write.)
What if you’re a novice, who’s never used Wolfram Language before, and never really been exposed to computational thinking, or for that matter, “techie stuff” at all? Well, the remarkable thing is that Notebook Assistant will still be able to help you—a lot. You can ask it something very vague, that doesn’t even seem particularly computational. It does remarkably well at “computationalizing things”. Taking what you’ve said, and finding a way to address it computationally—and to lead you into the kind of computational thinking that’ll be needed for the particular thing you’re trying to do.
In what follows, we’ll see a whole range of different ways to use Notebook Assistant. In fact, even as I’ve been writing this, I’ve discovered quite a few new ways to use it that I’d never thought of before.
There are some general themes, though. The most important is the way Notebook Assistant pivotally relies on the Wolfram Language. In a sense, the main mission of Notebook Assistant is to make things computational. And the whole reason it can so successfully do that is that it has the Wolfram Language as its target. It’s leveraging the unique nature of the Wolfram Language as a full-scale computational language, able to coherently represent abstract and real-world things in a computational way.
One might think that the Wolfram Language would in the end be mainly an “implementation layer”—serving to make what Notebook Assistant produces runnable. But in reality it’s very, very much more than that. In particular, it’s basically the medium—the language—in which computational ideas are communicated. When Notebook Assistant generates Wolfram Language, it’s not just something for the computer to run; it’s also something for humans to read. Yes, Notebook Assistant can produce text, and that’s useful, especially for contextualizing things. But the most concentrated and poignant communication comes in the Wolfram Language it produces. Want the TL;DR? Just look at the Wolfram Language code!
Part of how Wolfram Language code manages to communicate so much so efficiently is that it’s precise. You can just mention the name of a function, and you know precisely what it does. You don’t have to “scaffold” it with text to make its meaning clear.
But there’s something else as well. With its symbolic character—and with all the coverage and consistency that we’ve spent so much effort on over the decades—the Wolfram Language is uniquely able to “communicate in fragments”. Any fragment of Wolfram Language code can be run, and more important, it can smoothly fit into a larger structure. And that means that even small fragments of code that Notebook Assistant generates can be used as building blocks.
It produces Wolfram Language code. You read the code (and it’s critical that it’s set up to be read). You figure out if it’s what you want. (And if it’s not, you edit it, or ask Notebook Assistant to do that.) Then you can use that code as a robust building block in whatever structure—large or small—that you might be building.
In practice, a critical feature is that you don’t have to foresee how Notebook Assistant is going to respond to what you asked. It might nail the whole thing. Or it might just take steps in the right direction. But then you just look at what it produced, and decide what to do next. Maybe in the end you’ll have to “break the problem down” to get Notebook Assistant to deal with it. But there’s no need to do that in advance—and Notebook Assistant will often surprise you by how far it’s able to get on its own.
You might imagine that Notebook Assistant would usually need you to break down what you’re asking into “pure computational questions”. But in effect it has good enough “general knowledge” that it doesn’t. And in fact it will usually do better the more context you give it about why you’re asking it to do something. (Is it for chemical engineering, or for sports analytics, or what?)
But how ambitious can what you ask Notebook Assistant be? What if you ask it something “too big”? Yes, it won’t be able to solve that 100-year-old problem or build a giant software system in its immediate output. But it does remarkably well at identifying pieces that it can say something about, and that can help you understand how to get started. So, as with many things about Notebook Assistant, you shouldn’t assume that it won’t be helpful; just try it and see what happens! And, yes, the more you use Notebook Assistant, the more you’ll learn just what kind of thing it does best, and how to get the most out of it.
So how should you ultimately think about Notebook Assistant? Mainly you should think of it like an very knowledgeable and hardworking expert. But at a more mundane level it can serve as a super-enhanced documentation lookup system or code completion system. It can also take something vague you might ask it, and somehow undauntedly find the “closest formalizable construct”—that it can then compute with.
An important feature is that it is—in human terms—almost infinitely patient and hardworking. Where a human might think: “it’s too much trouble to write out all those details”, Notebook Assistant just goes ahead and does it. And, yes, it saves you huge amounts of typing. But, more important, it makes it “cheap” to do things more perfectly and more completely. So that means you actually end up labeling those plot axes, or adding a comment to your code, or coming up with meaningful names for your variables.
One of the overarching points about Notebook Assistant is that it lowers the barrier to getting help. You don’t have to think carefully about formulating your question. You don’t have to go clicking through lots of links. And you don’t have to worry that it’s too trivial to waste a coworker’s time on the question. You can just ask Notebook Assistant. Oh, and it’ll give you a response immediately. (And you can go back and forth with it, and ask it to clarify and refine things.)
At least for me it’s very common: you have something in your mind that you want to do, but you don’t quite know how to achieve it in the Wolfram Language. Well, now you can just ask Notebook Assistant!
I’ll show various examples here. It’s worth emphasizing that these examples typically won’t look exactly the same if you run them again. Notebook Assistant has a certain amount of “AI-style random creativity”—and it also routinely makes use of what you’ve done earlier in a session, etc. It also has to be said that Notebook Assistant will sometimes make mistakes—or will misunderstand what you’re asking it. But if you don’t like what it did, you can always press the button to generate a new response.
Let’s start off with a basic computational operation:
As an experienced user of Wolfram Language, a simple “do it with FoldList” would already have been enough. But Notebook Assistant goes all the way—generating specific code for exactly what I asked. Courtesy of Wolfram Language, the code is very short and easy to read. But Notebook Assistant does something else for one as well: it produces an example of the code in action—which lets one check that it really does what one wanted. Oh, and then it goes even further, and tells me about a function in the Wolfram Function Repository (that I, for one, had never heard of; wait did I write it?) that directly does the operation I want.
OK, so that was a basic computational operation. Now let’s try something a little more elaborate:
This involves several steps, but Notebook Assistant nails it, giving a nice example. (And, yes, it’s reading the Wolfram Language documentation, so often its examples are based on that.)
But even after giving an A+ result right at the top, Notebook Assistant goes on, talking about various options and extensions. And despite being (I think) quite an expert on what the Wolfram Language can do, I was frankly surprised by what it came up with; I didn’t know about these capabilities!
There’s an incredible amount of functionality built into the Wolfram Language (yes, four decades worth of it). And quite often things you want to do can be done with just a single Wolfram Language function. But which one? One of the great things about Notebook Assistant is that it’s very good at taking “raw thoughts”, sloppily worded, and figuring out what function you need. Like here, bam, “use LineGraph!”
You can ask Notebook Assistant “fairly basic” questions, and it’ll respond with nice, synthesized-on-the-spot “custom documentation”:
You can also ask it about obscure and technical things; it knows about every Wolfram Language function, with all its details and options:
Notebook Assistant is surprisingly good at writing quite minimal code that does sophisticated things:
If you ask it open-ended questions, it’ll often answer with what amount to custom-synthesized computational essays:
Notebook Assistant is pretty good at “pedagogically explaining what you can do”:
In everything we’ve seen so far, the workflow is that you ask Notebook Assistant something, then it generates a result, and then you use it. But everything can be much more interactive, and you can go back and forth with Notebook Assistant—say refining what you want it to do.
Here I had something in mind, but I was quite sloppy in describing it. And although Notebook Assistant came up with a reasonable interpretation of what I asked, it wasn’t really what I had in mind:
So I went back and edited what I asked (right there in the Notebook Assistant window), and tried again:
The result was better, but still not right. But all I had to do was to tell it to make a change, and lo and behold, I got what I was thinking of:
By the way, you can also perfectly well ask about deployment to the web:
And while I might have some minor quibbles (why use a string for the molecule name, not "Chemical"; why not use CloudPublish; etc.) what Notebook Assistant produces works, and provides an excellent scaffold for further development. And, as it often does, Notebook Assistant adds a kind of “by the way, did you know?” at the end, showing how one could use ARPublish to produce output for augmented reality.
Here’s one last example: creating a user interface element. I want to make a slider-like control that goes around (like an analog clock):
Well, actually, I had in mind something more minimal:
Impressive. Even if maybe it got that from some documentation or other example. But what if I wanted to tweak it? Well, actually, Notebook Assistant does seem to understand what it has:
What we’ve seen so far are a few examples of asking Notebook Assistant to tell us how to do things. But you can also just ask Notebook Assistant to do things for you, in effect producing “finished goods”:
Pretty impressive! And it even just went ahead and made the picture. By the way, if I wanted the code packaged up into a single line, I can just ask for that:
Notebook Assistant can generate interactive content too. And—very usefully—you don’t have to give precise specifications up front: Notebook Assistant will automatically pick “sensible defaults” (that, yes, you can trivially edit later, or just tell Notebook Assistant to change it for you):
Here’s an example that requires putting together several different ideas and functions. But Notebook Assistant manages it just fine—and in fact the code it produces is interesting and clarifying to read:
Notebook Assistant knows about every area of Wolfram Language functionality—here synthetic geometry:
And here chemistry:
It also knows about things like the Wolfram Function Repository, here running a function from there that generates a video:
Here’s something that again leverages Notebook Assistant’s encyclopedic knowledge of Wolfram Language capabilities, now pulling in real-time data:
I can’t resist trying a few more examples:
Let’s try something involving more sophisticated math:
(I would have used RegularPolygon[5], and I don’t think DiscretizeRegion is necessary … but what Notebook Assistant did is still very impressive.)
Or here’s some more abstract math:
OK, so Notebook Assistant provides a very powerful way to go from words to computational results. So what then is the role of computational language and of “raw Wolfram Language”? First of all, it’s the Wolfram Language that makes everything we’ve seen here work; it’s what the words are being turned into so that they can be computed from. But there’s something much more than that. The Wolfram Language isn’t just for computers to compute with. It’s also for humans to think with. And it’s an incredibly powerful medium for that thinking. Like a great generalization of mathematical notation from the distant past, it provides a streamlined way to broadly formalize things in computational terms—and to systematically build things up.
Notebook Assistant is great for getting started with things, and for producing a first level of results. But words aren’t ultimately an efficient way say how to build up from there. You need the crisp, formal structure of computational language. In which even the tiny amounts of code you write can be incredibly powerful.
Now that I’ve been using Notebook Assistant for a while I think I can say that on quite a few occasions it’s helped me launch things, it’s helped me figure out details, and it’s helped me debug things that have gone wrong. But the backbone of my computational progress has been me writing Wolfram Language myself (though quite often starting from something Notebook Assistant wrote). Notebook Assistant is an important new part of the “on ramp” to Wolfram Language; but it’s raw Wolfram Language that lets one really zoom forward to build new structures and achieve what’s computationally possible.
Computational thinking is an incredibly powerful approach. But sometimes it’s hard to get started with, particularly if you’re not used to it. And although one might not imagine it, Notebook Assistant can be very useful here, essentially helping one brainstorm about what direction to take.
I was explaining this to our head of Sales, and tried:
I really didn’t expect this to do anything terribly useful … and I was frankly amazed at what happened. Pushing my luck I tried:
Obviously this isn’t the end of the story, but it’s a remarkably good beginning—going from a vague request to something that’s set up to be thought about computationally.
Here’s another example. I’m trying to invent a good system for finding books in my library. I just took a picture of a shelf of books behind my desk:
Once again, a very impressive result. Not the final answer, but a surprisingly good start. That points me in the direction of image processing and segmentation. At first, it’s running too slowly, so it downsamples the image. Then it tells me I might need to tweak the parameters. So I just ask it to create a tool to do that:
And then:
It’s very impressive how much Notebook Assistant can help one go “from zero to computation”. And when one gets used to using it, it starts to be quite natural to just try it on all sorts of things one’s thinking about. But if it’s just “quick, tell me something to compute”, it’s usually harder to come up with anything.
And that reminds me of the very first time I ever saw a computer in real life. It was 1969 and I was 9 years old (and the computer was an IBM mainframe). The person who was showing me the computer asked me: “So what do you want to compute?” I really had no idea at that time “what one might compute”. Rather lamely I said “the weight of a dinosaur”. So, 55 years later, let’s try that again:
And let’s try going further:
Something I find very useful with Notebook Assistant is having it “tweak the details” of something I’ve already generated. For example, let’s say I have a basic plot of a sine curve in a notebook:
Assuming I have that notebook in focus, Notebook Assistant will “see” what’s there. So then I can tell it to modify my sine curve—and what it will do is produce new code with extra details added:
That’s a good result. But as a Wolfram Language aficionado I notice that the code is a bit more complicated than it needs to be. So what can I do about it? Well, I can just ask Notebook Assistant to simplify it:
I can keep going, asking it to further “embellish” the plot:
Let’s push our luck and try going even further:
Oops. Something went wrong. No callouts, and a pink “error” box. I tried regenerating a few times. Often that helps. But this time it didn’t seem to. So I decided to give Notebook Assistant a suggestion:
And now it basically got it. And with a little more back and forth I can expect to get exactly what I want.
In the Wolfram Language, functions (like Plot) are set up to have good automatic defaults. But when you want, for example, to achieve some particular, detailed look, you often have to end up specifying all sorts of additional settings. And Notebook Assistant is very good at doing this, and in effect, patiently typing out all those option settings, etc.
Let’s say you wrote some Wolfram Language (or perhaps Notebook Assistant did it for you). And let’s say it doesn’t work. Maybe it just produces the wrong output. Or maybe it generates all sorts of messages when it runs. Either way, you can just ask the Assistant “What went wrong?”
Here the Assistant rather patiently and clearly explained the message that was generated, then suggested “correct code”:
The Assistant tends to be remarkably helpful in situations like this—even for an experienced Wolfram Language user like me. In a sense, though, it has an “unfair advantage”. Not only has it learned “what’s reasonable” from seeing large amounts of Wolfram Language code; it also has access to “internal information”—like a stream of telemetry about messages that were generated (as well as stack traces, etc.).
In general, Notebook Assistant is rather impressive at “spotting errors” even in long and sophisticated pieces of Wolfram Language code—and in suggesting possible fixes. And I can say that this is a way in which using Notebook Assistant has immediately saved me significant time in doing things with Wolfram Language.
Notebook Assistant doesn’t just know how to write Wolfram Language code; it knows how to write good Wolfram Language code. And in fact if you give it even a sloppy “outline” of Wolfram Language code, the Assistant is usually quite good at making it clean and complete. And that’s important not only in being able to produce code that will run correctly; it’s also important in making code that’s clear enough that you can understand it (courtesy of the readability of good Wolfram Language code).
Here’s an example starting with a rather horrible piece of Wolfram Language code on the right:
The code on the right is quite buggy (it doesn’t initialize list, for example). But Notebook Assistant guesses what it’s supposed to do, and then makes nice “Wolfram style” versions, explaining what it’s doing.
If the code you’re dealing with is long and complicated, Notebook Assistant may (like a person) get confused. But you can always select a particular part, then ask Notebook Assistant specifically about that. And the symbolic nature—and coherence—of the Wolfram Language will typically mean that Notebook Assistant will be able to act “modularly” on the piece that you’ve selected.
Something I’ve found rather useful is to have Notebook Assistant refactor code for me. Here I’m starting from a sequence of separate inputs (yes, itself generated by Notebook Assistant) and I’m turning it into a single function:
Now we can use the function however we want:
Going the other way is useful too. And Notebook Assistant is surprisingly good at grokking what a piece of code is “about”, and coming up with reasonable names for variables, functions, etc.:
Yet another thing Notebook Assistant is good at is knowing all sorts of tricks to make code run faster:
“What does that piece of code actually do?” Good Wolfram Language code—like good prose or good mathematical formalism—can succinctly communicate ideas, in its case in computational terms, precisely grounded in the definition of the language. But (as with prose and math) you sometimes need a more detailed exploration. And providing narrative explanations of code is something else that Notebook Assistant is good at. Here it’s taking a single line of (rather elegant) Wolfram Language code and writing a whole essay about what the code is doing:
What if you have a long piece of code, and you just want to explain some small part of it? Well, since Notebook Assistant sees selections you make, you can just select one part of your code, and Notebook Assistant will know that’s what you want to explain.
The Wolfram Language is carefully designed to have built-in functions that just “do what you need”, without having to use idioms or set up repeated boilerplate. But there are situations where there’s inevitably a certain amount of “bureaucracy” to do. For example, let’s say you’re writing a function to deploy to the Function Repository. You enter the definition for the function into a Function Resource Definition Notebook. But now you have to fill in documentation, examples, etc. And in fact that’s often the part that typically takes the longest. But now you can ask Notebook Assistant to do it for you. Here I put the cursor in the Examples section:
It’s always a good idea to set up tests for functions you define. And this is another thing Notebook Assistant can help with:
All the examples of interacting with Notebook Assistant that we’ve seen so far involve using the Notebook Assistant window, that you can open with the button on the notebook toolbar. But another method involves using the button in the toolbar, which we’ve been calling the “inspiration button”.
When you use the Notebook Assistant window, the Assistant will always try to figure out what you’re talking about. For example, if you say “Plot that” it’ll use what it knows about what notebook you’re using, and where you are in it, to try to work out what you mean by “that”. But when you use the button it’ll specifically try to “provide inspiration at your current selection”.
Let’s say you’ve typed Plot[Sin[x]. Press and it’ll suggest a possible completion:
After using that suggestion, you can keep going:
You can think of the button as providing a sophisticated meaning-aware autocomplete.
It also lets you do things like code simplification. Imagine you’ve written the (rather grotesque):
If you want to get rid of the For loops, just select them and press the button to get a much simpler version:
Want to go even further? Select that result and Notebook Assistant manages to get to a one-liner:
At some level it seems bizarre. Write a text cell that describes code to follow it. Start an Input cell, then press and Notebook Assistant will try to magically write the code!
You can go the other way as well. Start with the code, then start a CodeText cell above it, and it’ll “magically” write a caption:
If you start a heading cell, it’ll try to make up a heading:
Start a Text cell, and it’ll try to “magically” write relevant textual content:
You can go even further: just put the cursor underneath the existing content, and press —and Notebook Assistant will start suggesting how you can go on:
As I write this, of course I had to try it: what does Notebook Assistant think I should write next? Here’s what it suggests (and, yes, in this case, those aren’t such bad ideas):
One of the objectives for Notebook Assistant is to have it provide “hassle-free” access to AI and LLM technology integrated into the Wolfram System. And indeed, once you’ve set up your subscription (within your Wolfram Account), everything “just works”. Under the hood, there’s all sorts of technology, servers, etc. But you don’t have to worry about any of that; you can just use Notebook Assistant as a built-in part of the Wolfram Notebook experience.
As you work with Notebook Assistant, you’ll get progressively better intuition about where it can best help you. (And, yes, we’ll be continually updating Notebook Assistant, so it’ll often be worth trying things again if a bit of time has passed.) Notebook Assistant—like any AI-based system—has definite human-like characteristics, including sometimes making mistakes. Often those mistakes will be obvious (e.g. code with incorrect syntax colored red); sometimes they may be more difficult to spot. But the great thing about Notebook Assistant is that it’s firmly anchored to the “solid ground” of Wolfram Language. And any time it writes Wolfram Language code that you can see does what you want, you can always confidently use it.
There are some things that will help Notebook Assistant do its best for you. Particularly important is giving it the best view of the “context” for what you ask it. Notebook Assistant will generally look at whatever has already been said in a particular chat. So if you’re going to change the subject, it’s best to use the button to start a new chat, so Notebook Assistant will focus on the new subject, and not get confused by what you (or it) said before.
When you open the Notebook Assistant chat window you’ll often want to talk about—or refer to—material in some other notebook. Generally Notebook Assistant will assume that the notebook you last used is the one that’s relevant—and that any selection you have in that notebook is the thing to concentrate on the most. If you want Notebook Assistant to focus exclusively on what you’re saying in the chat window, one way to achieve that is to start a blank notebook. Another approach is to use the menu, which provides more detailed control over what material Notebook Assistant will consider. (For now, it just deals with notebooks you have open—but external files, URLs, etc. are coming soon.)
Notebook Assistant will by default store all your chat sessions. You can see your chat history (with chats automatically assigned names by the Assistant) by pressing the History button. You can delete chats from your history here. You can also “pop out” chats with , creating standalone notebooks that you can save, send to other people, etc.
So what’s inside Notebook Assistant? It’s quite a tower of technology. The core of its “linguistic interface” is an LLM (actually, several different LLMs)—trained on extensive Wolfram Language material, and with access to a variety of tools, especially Wolfram Language evaluators. Also critical to Notebook Assistant is its access to a variety of RAGs based on vector databases, that it uses for immediate semantic search of material such as Wolfram Language documentation. Oh, and then there’s a lot of technology to connect Notebook Assistant to the symbolic internal structure of notebooks, etc.
So when you use Notebook Assistant, where is it actually running? Its larger LLM tasks are currently running on cloud servers. But a substantial part of its functionally is running right on your computer—using Wolfram Language (notably the Wolfram machine learning framework, vector database system, etc.) And because these things are running locally, the Assistant can request access to local information on your computer—as well as avoiding the latency of accessing cloud-based systems.
Much of the time, you want your interactions with Notebook Assistant to be somehow “off on the side”—say in the Notebook Assistant window, or in the inspiration button menu. But sometimes you want your interactions to be right in your main notebook.
And for this you’ll soon (in Version 14.2) be able to use an enhanced version of the Chat Notebook technology that we developed last year, not just in a separate “Chat Notebook”, but fully integrated into any notebook.
At the beginning of a cell in any notebook, just press ‘. You get a chat cell that communicates with Notebook Assistant:
And now the output from that chat cell is placed directly below in the notebook—so you can create a notebook that mixes standard notebook content with chat content.
It all works basically just like a fully integrated version of our Chat Notebook technology. (And this functionality is already available in Version 14.1 if you explicitly create a chat notebook with File > New > Chat Notebook.) As in Chat Notebooks, you use a chat break (with ~) to start a new chat within the same notebook. (In general, when you use a chat cell in an ordinary notebook to access Notebook Assistant, the assistant will see only material that occurs before the chat, and within the same chat block.)
In mid-2023 we introduced LLMFunction, LLMSynthesize and related functions (as well as ChatEvaluate, ImageSynthesize, etc.) to let you access LLM functionality directly within the Wolfram Language. Until now these functions required connection to an external LLM provider. But along with Notebook Assistant we’re introducing today LLM Kit—which allows you to access all LLM functionality in the Wolfram Language directly through a subscription within your Wolfram Account.
It’s all very easy: as soon as you enable your subscription, not only Notebook Assistant but also all LLM functionality will just work, going through our LLM service. (And, yes, Notebook Assistant is basically built on top of LLM Kit and the LLM service access it defines.)
When you’ve enabled your Notebook Assistant + LLM Kit subscription, this is what you’ll see in the Preferences panel:
Our LLM service is primarily aimed at “human speed” LLM usage, in other words, things like responding to what you ask the Notebook Assistant. But the service also seamlessly supports programmatic things like LLMFunction. And for anything beyond small-scale uses of LLMFunction, etc. you’ll probably want to upgrade from the basic “Essentials” subscription level to the “Pro” level. And if you want to go “industrial scale” in your LLM usage, you can do that by explicitly purchasing Wolfram Service Credits.
Everything is set up to be easy if you use our Wolfram LLM service—and that’s what Notebook Assistant is based on. But for Chat Notebooks and programmatic LLM functionality, our Wolfram Language framework also supports connection to a wide range of external LLM service providers. You have to have your own external subscription to whatever external service you want to use. But once you have the appropriate access key you’ll be able to set things up so that you can pick that LLM provider interactively in Chat Notebooks, programmatically through LLMConfiguration, or in the Preferences panel.
(By the way, we’re continually monitoring the performance of different LLMs on Wolfram Language generation; you can see weekly benchmark results at the Wolfram LLM Benchmark Project website—or get the data behind that from the Wolfram Data Repository.)
There’s really never been anything quite like it before: a way of automatically taking what can be quite vague human thoughts and ideas, and making them crisp and structured—by expressing them computationally. And, yes, this is made possible now by the unexpectedly effective linguistic interface that LLMs give us. But ultimately what makes it possible is that the LLMs have a target: the Wolfram Language in all its breadth and depth.
For me it’s an exciting moment. Because it’s a moment where everything we’ve been building these past four decades is suddenly much more broadly accessible. Expert users of Wolfram Language will be able to make use of all sorts of amazing nooks of functionality they never knew about. And people who’ve never used Wolfram Language before—or never even formulated anything computationally—will suddenly be able to do so.
And it’s remarkable what kinds of things one can “make computational”. Let’s say you ask Wolfram Notebook Assistant to make up a story. Like pretty much anything today with LLMs inside, it’ll dutifully do that:
But how can one make something like this computational? Well, just ask Notebook Assistant:
And what it does is rather remarkable: it uses Wolfram Language to create an interactive agent-based computational game version of the story!
Computation is the great paradigm of our times. And the development of “computational X” for all X seems destined to be the future of pretty much every field. The whole tower of ideas and technology that is the modern Wolfram Language was built precisely to provide the computational language that is needed. But now Notebook Assistant is dramatically broadening access to that—making it possible to get “computational language superpowers” using just ordinary (and perhaps even vague) natural language.
And even though I’ve now been living the computational language story for more than four decades Notebook Assistant keeps on surprising me with what it manages to make computational. It’s incredibly powerful to be able “go computational”. And even if you can’t imagine how it could work in what you’re doing, you should still just try it! Notebook Assistant may well surprise you—and in that moment show you a path to leverage the great power of the computational paradigm in ways that you’re never imagined.
2024-12-06 07:13:27
This is a follow-on to Why Does Biological Evolution Work? A Minimal Model for Biological Evolution and Other Adaptive Processes [May 3, 2024].
A few months ago I introduced an extremely simple “adaptive cellular automaton” model that seems to do remarkably well at capturing the essence of what’s happening in biological evolution. But over the past few months I’ve come to realize that the model is actually even richer and deeper than I’d imagined. And here I’m going to describe some of what I’ve now figured out about the model—and about the often-surprising things it implies for the foundations of biological evolution.
The starting point for the model is to view biological systems in abstract computational terms. We think of an organism as having a genotype that’s represented by a program, that’s then run to produce its phenotype. So, for example, the cellular automaton rules on the left correspond to a genotype which are then run to produce the phenotype on the right (starting from a “seed” of a single red cell):
The key idea in our model is to adaptively evolve the genotype rules—say by making single “point mutations” to the list of outcomes from the rules:
At each step in the adaptive evolution we “accept” a mutation if it leads to a phenotype that has a higher—or at least equal—fitness relative to what we had before. So, for example, taking our fitness function to be the height (i.e. lifetime) of the phenotype pattern (with patterns that are infinite being assigned zero fitness), a sequence of (randomly chosen) adaptive evolution steps that go from the null rule to the rule above might be:
What if we make a different sequence of randomly chosen adaptive evolution steps? Here are a few examples of what happens—each in a sense “using a different idea” for how to achieve high fitness:
And, yes, one can’t help but be struck by how “lifelike” this all looks—both in the complexity of these patterns, and in their diversity. But what is ultimately responsible for what we’re seeing? It’s long been a core question about biological evolution. Are the forms it produces the result of careful “sculpting” by the environment (and by the fitness functions it implies)—or are their most important features somehow instead a consequence of something more intrinsic and fundamental that doesn’t depend on details of fitness functions?
Well, let’s say we pick a different fitness function—for example, not the height of a phenotype pattern, but instead its width (or, more specifically, the width of its bounding box). Here are some results of adaptive evolution in this case:
And, yes, the patterns we get are now ones that achieve larger “bounding box width”. But somehow there’s still a remarkable similarity to what we saw with a rather different fitness function above. And, for example, in both cases, high fitness, it seems, is normally achieved in a complicated and hard-to-understand way. (The last pattern is a bit of an exception; as can also happen in biology, this is a case where for once there’s a “mechanism” in evidence that we can understand.)
So what in the end is going on? As I discussed when I introduced the model a few months ago, it seems that the “dominant force” is not selection according to fitness functions, but instead the fundamental computational phenomenon of computational irreducibility. And what we’ll find here is that in fact what we see is, more than anything, the result of an interplay between the computational irreducibility of the process by which our phenotypes develop, and the computational boundedness of typical forms of fitness functions.
The importance of such an interplay is something that’s very much come into focus as a result of our Physics Project. And indeed it now seems that the foundations of both physics and mathematics are—more than anything—reflections of this interplay. And now it seems that’s true of biological evolution as well.
In studying our model, there are many detailed phenomena we’ll encounter—most of which seem to have surprisingly direct analogs in actual biological evolution. For example, here’s what happens if we plot the behavior of the fitness function for our first example above over the course of the adaptive evolution process:
We see a sequence of “plateaus”, punctuated by jumps in fitness that reflect some “breakthrough” being made. In the picture, each red dot represents the fitness associated with a genotype that was tried. Many fall below the line of “best results so far”. But there are also plenty of red dots that lie right on the line. And these correspond to genotypes that yield the same fitness that’s already been achieved. But here—as in actual biological evolution—it’s important that there can be “fitness-neutral evolution”, where genotypes change, but the fitness does not. Usually such changes of genotype yield not just the same fitness, but also the exact same phenotype. Sometimes, however, there can be multiple phenotypes with the same fitness—and indeed this happens at one stage in the example here
and at multiple stages in the second example we showed above:
In the previous section we saw examples of the results of a few particular random sequences of mutations. But what if we were to look at all possible sequences of mutations? As I discussed when I introduced the model, it’s possible to construct a multiway graph that represents all possible mutation paths. Here’s what one gets for symmetric k = 2, r = 2 rules—starting from the null rule, and using height as a fitness function:
The way this graph is constructed, there are arrows from a given phenotype to all phenotypes with larger (finite) height that can be reached by a single mutation.
But what if our fitness function is width rather than height? Well, then we get a different multiway graph in which arrows go to phenotypes not with larger height but instead with larger width:
So what’s really going on here? Ultimately one can think of there being an underlying graph (that one might call the “mutation graph”) in which every edge represents a transformation between two phenotypes that can be achieved by a single mutation in the underlying genotype:
At this level, the transformations can go either way, so this graph is undirected. But the crucial point is that as soon as one imposes a fitness function, it defines a particular direction for each transformation (at least, each transformation that isn’t fitness neutral for this fitness function). And then if one starts, say, from the null rule, one will pick out a certain “evolution cone” subgraph of the original mutation graph.
So, for example, with width as the fitness function, the subgraph one gets is what’s highlighted here:
There are several subtleties here. First, we simplified the multiway graph by doing transitive reduction and drawing only the minimal edges necessary to define the connectivity of the graph. If we want to see all possible single-mutation transformations between phenotypes we need to do transitive completion, in which case for the width fitness function the multiway graph we get is:
But now there’s another subtlety. The edges in the multiway graph represent fitness-changing transformations. But there are also fitness-neutral transformations. And occasionally these can even lead to different (though equal-fitness) phenotypes, so that really each node in the graph above (say, the transitively reduced one) should sometimes be associated with multiple phenotypes
which can “fitness neutrally” transform into each other, as in:
But even this isn’t the end of the subtleties. Fitness-neutral sets typically contain many genotypes differing by changes of rule cases that don’t affect the phenotype they produce. But it may be that just one or a few of these genotypes are “primed” to be able to generate another phenotype with just one additional mutation. Or, in other words, each node in the multiway graph above represents a whole class of genotypes “equivalent under fitness-neutral transformations”, and when we draw an arrow it indicates that some genotype in that class can be transformed by a single mutation to some genotype in the class associated with a different phenotype:
But beyond the subtleties, the key point is that particular fitness functions in effect just define particular orderings on the underlying mutation graph. It’s somewhat like choices of reference frames or families of simultaneity surfaces in physics. Different choices of fitness function in effect define different ways in which the underlying mutation graph can be “navigated” by evolution over the course of time.
As it happens, the results are not so different between height and width fitness functions. Here’s a combined multiway graph, indicating transformations variously allowed by these different fitness functions:
Homing in on a small part of this graph, we see that there are different “flows” associated with maximizing height and maximizing width:
With a single fitness function that for any two phenotypes systematically treats one phenotype as fitter than another, the multiway graph must always define a definite flow. But as soon as one considers changing fitness functions in the course of evolution, it’s possible to get cycles in the multiway graph, as in the example above—so that, in effect, “evolution can repeat itself”.
We’ve looked at fitness functions based on maximizing height and on maximizing width. But what if we try to combine these? Here’s a plot of the widths and heights of all phenotypes that occur in the symmetric k = 2, r = 2 case we studied above:
We could imagine a variety of ways to define “fitness frontiers” here. But as a specific example, let’s consider fitness functions that are based on trying to achieve specific aspect ratios—i.e. phenotypes that are as close as possible to a particular constant-aspect-ratio line in the plot above.
With the symmetric k = 2, r = 2 rules we’re using here, only a certain set of aspect ratios can ever be obtained:
The corresponding phenotypes (with their aspect ratios) are:
As we change the aspect ratio that we’re trying to achieve, the evolution multiway graph will change:
In all cases we’re starting from the null rule. For target aspect ratio 1.0 this rule itself already achieves that aspect ratio—so the multiway graph in that case is trivial. But in general, different aspect ratios yield evolution multiway graphs that are different subgraphs of the complete mutation graph we saw above.
So if we follow all possible paths of evolution, how close can we actually get to any given target aspect ratio? This plot shows what final aspect ratios can be achieved as a function of target aspect ratio:
And in a sense this is a summary of the effect of “developmental constraints” for “adaptive cellular automaton organisms” like this. If there were no constraints then for every target aspect ratio it’d be possible to get an “organism” with that aspect ratio—so in the plot there’d be a point lying on the red line. But in actuality the process of cellular automaton growth imposes constraints—that in particular allows only certain phenotypes, with certain aspect ratios, to exist. And beyond that, which phenotypes can actually be reached by adaptive evolution depends on the evolution multiway graph, with “different turns” on the graph leading to different fitness (i.e. different aspect ratio) phenotypes.
But what the plot above shows overall is that for a certain range of target aspect ratios, adaptive evolution is successfully able to get at least close to those aspect ratios. If the target aspect ratio gets out of that range, however, “developmental constraints” come in that prevent the target from being reached.
With “larger genomes”, i.e. rules with larger numbers of cases to specify, it’s possible to do better, and to more accurately achieve particular aspect ratios, over larger ranges of values. And indeed we can see some version of this effect even for symmetric k = 2, r = 2 rules by plotting aspect ratios that can be achieved as a function of the number of cases that need to be specified in the rule:
As an alternative visualization, we can plot the “best convergence to the target” as a function of the number of rule cases—and once again we see that larger numbers of rule cases let us get closer to target aspect ratios:
It’s worth mentioning that—just as we discussed for height and width fitness functions above—there are subtleties here associated with fitness-neutral sets. For example, here are sets of phenotypes that all have the specified aspect ratios—with phenotypes that can be reached by single point mutations being joined:
In the evolution multiway graphs above, we included only one phenotype for each fitness-neutral set. But here’s what we get for target aspect ratio 0.7 if we show all phenotypes with a given fitness:
Note that on the top line, we don’t just get the null rule. Instead, we get four phenotypes, all of which, like the null rule, have aspect ratio 1, and so are equally far from the target aspect ratio 0.7.
The picture above is only the transitively reduced graph. But if we include all possible transformations associated with single point mutations, we get instead:
Based on this graph, we can now make what amounts to a foliation, showing collections of phenotypes reached by a certain minimum number of mutations, progressively approaching our target aspect ratio (here 0.7):
Here’s what we get from the range of target aspect ratios shown above (where, as above, “terminal phenotypes” are highlighted):
In a sense these sequences show us what phenotypes can appear at progressive stages in the “fossil record” for different (aspect-ratio) fitness functions in our very simple model. The highlighted cases are “evolutionary dead ends”. The others can evolve further.
Our model takes the process of adaptive evolution to never “go backwards”, or, in other words, to never evolve from a particular genotype to one with lower fitness. But this means that starting with a certain genotype (say the null rule) there may be genotypes (and hence phenotypes) that will never be reached.
With height as a fitness function, there are just two single (“orphan”) phenotypes that can’t be reached:
And with width as the fitness function, it turns out the very same phenotypes also can’t be reached:
But if we use a fitness function that, for example, tries to achieve aspect ratio 0.7, we get many more phenotypes that can’t be reached starting from the null rule:
In the original mutation graph all the phenotypes appear. But when we foliate (or, more accurately, order) that graph using a particular fitness function, some phenotypes become unreachable by evolutionarily-possible transformations—in a rough analogy to the way some events in physics can become unreachable in the presence of an event horizon.
So far we’ve discussed multiway graphs here only for symmetric k = 2, r = 2 rules. There are a total of 524,288 (= 219) possible such rules, producing 77 distinct phenotypes. But what about larger classes of rules? As an example, we can consider all k = 2, r = 2 rules, without the constraint of symmetry. There are 2,147,483,648 (= 231) possible such rules, and there turn out to be3137distinct phenotypes.
For the height fitness function, the complete multiway graph in this case is
or, annotated with actual phenotypes:
If instead we just show bounding boxes, it’s easier to see where long-lifetime phenotypes occur:
With a different graph layout the evolution multiway graph (with initial node indicated) becomes:
One subtlety here is that the null rule has no successors with single point mutation. When we were talking about symmetric k = 2, r = 2 rules, we took a “single point mutation” always to change both a particular rule case and its mirror image. But if we don’t have the symmetry requirement, a single point mutation really can just change a single rule case. And if we start from the null range and look at the results of changing just one bit (i.e. the output of just one rule case) in all possible ways we find that we either get the same pattern as with the null rule, or we get a pattern that grows without bound:
Or, put another way, we can’t get anywhere with single bit mutations starting purely from the null rule. So what we’ve done is instead to start our multiway graph from k = 2, r = 2 rule 20, which has two bits “on”, and gives phenotype:
But starting from this, just one mutation (together with a sequence of fitness-neutral mutations) is sufficient to give94phenotypes—or49after removing mirror images:
The total number of new phenotypes we can reach after successively more (non-fitness-neutral) mutations is
while the successive longest-lifetime patterns are:
And what we see here is that it’s in principle possible to achieve long lifetimes even with fairly few mutations. But when the mutations are done at random, it can still take a very large number of steps to successfully “random walk” to long lifetime phenotypes.
And out of a total of2407distinct phenotypes,984are “dead ends” where no further evolution is possible. Some of these dead ends have long lifetimes
but others have very short lifetimes:
There’s much more to explore in this multiway graph—and we’ll continue a bit below. But for now let’s look at another evolution multiway graph of accessible size: the one for symmetric
Showing only bounding boxes of patterns this becomes:
Unlike the k = 2, r = 2 case, we can now start this whole graph with the null rule. However, if we look at all possible symmetric k = 3, r = 1 rules, there turn out to be 6 “isolates” that can’t be reached from the null rule by adaptive evolution with the height fitness function:
Starting from the null rule, the number of phenotypes reached after successively more (non-fitness-neutral) mutations is
and the successive longest-lived of these phenotypes are:
Just as we looked at fitness functions based on aspect ratio above for symmetric k = 2, r = 2 rules, so now we can do this for the whole space of all possible k = 2, r = 2 rules. Here’s a plot of the heights and widths of patterns that can be achieved with these rules:
These are the possible aspect ratios this implies:
And here’s their distribution (on a log scale):
The range of possible values extends much further than for symmetric k = 2, r = 2 rules: to rather than to . The patterns now with the largest aspect ratios are
while those with the smallest aspect ratios are:
Note that just as for symmetric k = 2, r = 2 rules, to reach a wider range of aspect ratios, more cases in the rule have to be specified:
So what happens if we use adaptive evolution to try to reach different possible target aspect ratios? Most of the time (at least up to aspect ratio ≈ 3) there’s some sequence of mutations that will do it—though often we can get stuck at a different aspect ratio:
If we look at the “best convergence” to a given target aspect ratio then we see that this improves as we increase the number of cases specified in the rule:
So what does the multiway graph look like for a fitness function associated with a particular aspect ratio? Here’s the result for aspect ratio 3:
The initial node involves patterns with aspect ratio 1—actually a fitness-neutral set of 263 of them. And as we go through the multiway graph, the aspect ratios get nearer to 3. The very closest they get, though, are for the patterns (whose locations are indicated on the graph):
But actually (as we saw in the lineup above), there is a rule that gives aspect ratio exactly 3:
But it turns out that this rule can’t be reached by adaptive evolution using single point mutations. In effect, adaptive evolution isn’t “strong enough” to achieve the exact aspect ratio we want; we can think of it as being “unpredictably prevented” by computationally irreducible “developmental constraints”.
OK, so what about the symmetric k = 3, r = 1 rules? Here’s how they’re distributed in width and height:
And, yes, in a typical “there are always surprises” story, there’s a strange height 265, width 173 pattern that shows up:
The overall possible aspect ratios are now
and their (log) distribution is:
The phenotypes with the largest aspect ratios are
while those with the smallest aspect ratios are:
Once again, to reach a larger range of aspect ratios, one has to specify more cases in the rule:
If we try to target a certain aspect ratio, there’s somewhat more of a tendency to get stuck than for k = 2, r = 2 rules—perhaps somewhat as a result of there now being fewer total rules (though more phenotypes) available:
Looking at a typical multiway evolution graph such as
we see that different phenotypes can be quite separated in the graph—a bit like organisms on different branches of the tree of life in actual biology. But how can we characterize this separation? One approach is to compute the so-called dominator tree of the graph:
We can think of this as a way to provide a map of the least common ancestors of all nodes. The tree is set up so that given two nodes you just trace up the tree to find their common ancestor. Another interpretation of the tree is that it shows you what nodes you have no choice but to pass through in getting from the initial node to any given node—or, in other words, what phenotypes adaptive evolution has to produce on the way to a given phenotype.
Here’s another rendering of the tree:
We can think of this as the analog of the biological tree of life, with successive branchings picking out finer and finer “taxonomic domains” (analogous to kingdoms, phyla, etc.)
The tree also shows us something else: how significant different links or nodes are—and how much of the tree one would “lop off” if they were removed. Or, put a different way, how much would be achieved by blocking a certain link or node—as one might imagine doing to try to block the evolution of bacteria or tumor cells?
What if we look at larger multiway evolution graphs, like the complete k = 2, r = 2 one? Once again we can construct a dominator tree:
It’s notable that there’s tremendous variance in the “fan out” here, with the phenotypes with largest successor counts being the rather undistinguished:
But what if one’s specifically trying to reach, say, one of the maximum lifetime (length308) phenotypes? Well, then one has to follow the paths in a particular subgraph of the original multiway evolution graph
corresponding to the phenotype graph:
If one goes off this “narrow path” then one simply can’t reach the length-308 phenotype; one inevitably gets stuck in what amounts to another branch of the analog of the “tree of life”. So if one is trying to “guide evolution” to a particular outcome, this tells one that one needs to block off lots of “exit ramps”.
But what “fraction of the whole graph” is the subgraph that leads to the length-308 phenotype? The whole graph has2409vertices and 3878 edges, while the subgraph has 64 vertices and 119 edges, i.e. in both cases about 3%. A different measure is what fraction of all paths through the graph lead to the length-308 phenotype. The total number of paths is606,081,while the number leading to the length-308 phenotype is 1260, or about 0.2%. Does this tell us what the probability of reaching that phenotype will be if we just make a random sequence of mutations? Not quite, because in the multiway evolution graph many equivalencings have been done, notably for fitness-neutral sets. And if we don’t do such equivalencings, it turns out (as we’ll discuss below) that the corresponding number is significantly smaller—about 0.007%.
The fitness functions we’ve been considering so far look only at coarse features of phenotype patterns—like their height, width and aspect ratio. But what happens if we have a fitness function that’s maximal only for a phenotype that exactly matches a particular pattern?
As an example, let’s consider k = 2, r = 1 cellular automata with phenotypes grown for a specific number of steps—and with a fitness function that counts the number of cells that agree with ones in a target:
Let’s say we start with the null rule, then adaptively evolve by making single point mutations to the rule (here just 8 bits). With a target of the rule 30 pattern, this is the multiway graph we get:
And what we see is that after a grand tour of nearly a third of all possible rules, we can successfully reach the rule 30 pattern. But we can also get stuck at rule 86 and rule 190 patterns—even though their fitness values are much lower:
If we consider all possible k = 2, r = 1 cellular automaton patterns as targets, it turns out that these can always be reached by adaptive evolution from the null rule—though a little less than half the time there are other possible endpoints (here specified by rule numbers) at which the evolution process can get stuck:
So far we’ve been assuming that we have a fitness function that’s maximized by matching some pattern generated by a cellular automaton pattern. But what if we pick some quite different pattern to match against? Say our pattern is:
With k = 2, r = 1 rules (running with wraparound in a finite-size region), we can construct a multiway graph
and find out that the maximum fitness endpoints are the not-very-good approximations:
We can also get to these by applying random mutations:
But what if we try a larger rule space, say k = 2, r = 2 rules? Our approximations to the “A” image get a bit better:
Going to k = 2, r = 3 leads to slightly better (but not great) final approximations:
If we try to do the same thing with our target instead being
we get for example
while with target
we get (even less convincing) results like:
What’s going on here? Basically it’s that if we try to set up too intricate a fitness function, then our rule spaces won’t contain rules that successfully maximize it, and our adaptive evolution process will end up with a variety of not-very-good approximations.
When one looks at an evolution process like
one typically has the impression that successive phenotypes are achieving greater fitness by somehow progressively “building on the ideas” of earlier ones. And to get a more granular sense of this we can highlight cells at each step that are using “newly added cases” in the rule:
We can think of new rule cases as a bit like new genes in biology. So what we’re seeing here is the analog of new genes switching on (or coming into existence) as we progress through the process of biological evolution.
Here’s what happens for some other paths of evolution:
What we see is quite variable. There are a few examples where new rule cases show up only at the end, as if a new “incrementally engineered” pattern was being “grafted on at the end”. But most of the time new rule cases show up sparsely dotted all over the pattern. And somehow those few “tweaks” lead to higher fitness—even though there’s no obvious reason why, and no obvious way to predict where they should be.
It’s interesting to compare this with actual biology, where it’s pretty common to see what appear to be “random gratuitous changes” between apparently very similar organisms. (And, yes, this can lead to all sorts of problems in things like comparing toxicity or drug effectiveness in model animals versus humans.)
There are many ways to consider quantitatively characterizing how “rule utilization” builds up. As just one example, here are plots for successive phenotypes along the evolution paths shown above of what stages in growth new rule cases show up:
Here are two “adaptively evolved” long-lifetime rules that we discussed at the beginning:
We can always run these rules and see what patterns they produce. But is there a way to explain what they do? And for example to analyze how they manage to yield lifetimes? Or is what we’re seeing in these rules basically “pure computational irreducibility” where the only way to tell what patterns they will generate—and how long they’ll live—is just explicitly to run them step by step?
The second rule here seems to have a bit more regularity than the first, so let’s tackle it first. Let’s look at the “blade” part. Once such an object—of any width—has formed, its behavior will basically be repetitive, and it’s easy to predict what will happen:
The left-hand edge moves by 1 position every 7 steps, and the right-hand edge by 4 positions every 12 steps. And since , however wide the initial configuration is, it’ll always die out, after a number of steps that’s roughly times the initial width.
But OK, how does a configuration like this get produced? Well, that’s far from obvious. Here’s what happens with a sequence of few-cell initial conditions …:
So, yes, it doesn’t always directly make the “blade”. Sometimes, for example, it instead makes things like these, some of which basically just become repetitive, and live forever:
And even if it starts with a “blade texture” unexpected things can happen:
There are repetitive patterns that can persist—and indeed the “blade” uses one of these:
Starting from a random initial condition one sees various kinds of behavior, with the blade being fairly common:
But none of this really makes much of a dent in “explaining” why with this rule, starting from a single red cell, we get a long-lived pattern. Yes, once the “blade” forms, we know it’ll take a while to come to a point. But beyond this little pocket of computational reducibility we can’t say much in general about what the rule does—or why, for example, a blade forms with this initial condition.
So what about our other rule? There’s no obvious interesting pocket of reducibility there at all. Looking at a sequence of few-cell initial conditions we get:
And, yes, there’s all sorts of different behavior that can occur:
The first of these patterns is basically periodic, simply shifting 2 cells to the left every 56 steps. The third one dies out after 369 steps, and the fourth one becomes basically periodic (with period 56) after 1023 steps:
If we start from a random initial condition we see a few places where things die out in a repeatable pattern. But mostly everything just looks very complicated:
As always happens, the rule supports regions of repetitive behavior, but they don’t normally extend far enough to introduce any significant computational reducibility:
So what’s the conclusion? Basically it’s that these rules—like pretty much all others we’ve seen here—behave in essentially computationally irreducible ways. Why do they have long lifetimes? All we can really say is “because they do”. Yes, we can always run them and see what happens. But we can’t make any kind of “explanatory theory”, for example of the kind we’re used to in mathematical approaches to physics.
We can think of the pattern of growth seen in each phenotype as defining what we might call in biology its “morphology”. So what happens if we try to operate as “pure taxonomists”, laying out different phenotypes in “morphospace”? Here’s a result based on using machine learning and FeatureSpacePlot:
And, yes, this tends to group “visually similar” phenotypes together. But how does proximity in morphospace relate to proximity in genotypes? Here is the same arrangement of phenotypes as above, but now indicating the transformations associated with single mutations in genotype:
If for example we consider maximizing for height, only some of the phenotypes are picked out:
For width, a somewhat different set are picked out:
And here is what happens if our fitness function is based on aspect ratio:
In other words, different fitness functions “select out” different regions in morphospace.
We can also construct a morphospace not just for symmetric but for all k = 2, r = 2 rules:
The detailed pattern here is not particularly significant, and, more than anything, just reflects the method of dimension reduction that we’ve used. What is more meaningful, however, is how different fitness functions select out different regions in morphospace. This shows the results for fitness functions based on height and on width—with points colored according to the actual values of height and width for those phenotypes:
Here are the corresponding results for fitness functions based on different aspect ratios, where now the coloring is based on closeness to the target aspect ratio:
What’s the main conclusion here? We might have expected that different fitness functions would cleanly select visibly different parts of morphospace. But at least with our machine-learning-based way of laying out morphospace that’s not what we’re seeing. And it seems likely that this is actually a general result—and that there is no layout procedure that can make any “easy to describe” fitness function “geometrically simple” in morphospace. And once again, this is presumably a consequence of underlying computational irreducibility—and to the fact that we can’t expect any morphospace layout procedure to be able to provide a way to “untangle the irreducibility” that will work for all fitness functions.
In what we’ve done so far, we’ve mostly been concerned with things like what sequences of phenotypes can ever be produced by adaptive evolution. But in making analogies to actual biological evolution—and particularly to how it’s captured in the fossil record—it’s also relevant to discuss time, and to ask not only what phenotypes can be produced, but also when, and how frequently.
For example, let’s assume there’s a constant rate of point mutations in time. Then starting from a given rule (like the null rule) there’ll be a certain rate at which transitions to other rules occur. Some of these transitions will lead to rules that are selected out. Others will be kept, but will yield the same phenotype. And still others will lead to transitions to different phenotypes.
We can represent this by a “phenotype transition diagram” in which the thickness of each outgoing edge from a given phenotype indicates the fraction of all possible mutations that lead to the transition associated with that edge:
Gray self-loops in this diagram represent transitions that lead back to the same phenotype (because they change cases in the rule that don’t matter). Pink self-loops correspond to transitions that lead to rules that are selected out. We don’t show rules that have been selected out here; instead we assume that in this case we just “wait at the original phenotype” and don’t make a transition.
We can annotate the whole symmetric k = 2, r = 2 multiway evolution graph with transition probabilities:
Underlying this graph is a matrix of transition probabilities between all 219 possible symmetric
Keeping only distinct phenotypes and ordering by lifetime, we can then make a matrix of phenotype transition probabilities:
Treating the transitions as a Markov process, this allows us to compute the expected frequency of each phenotype as a function of time (i.e. number of mutations):
What’s basically happening here is that there’s steady evolution away from the single-cell phenotype. There are some intermediate phenotypes that come and go, but in the end, everything “flows” to the final (“leaf”) phenotypes on the multiway evolution graph—leading to a limiting “equilibrium” probability distribution:
Stacking the different curves, we get an alternative visualization of the evolution of phenotype frequencies:
If we were “running evolution” with enough separate individuals, these would be the limiting curves we’d get. If we reduced the number of individuals, we’d start to see fluctuations—and there’d be a certain probability, for example, for a particular phenotype to end up with zero individuals, and effectively go extinct.
So what happens with a different fitness function? Here’s the result using width instead of height:
And here are results for fitness functions based on a sequence of targets for aspect ratio:
And, yes, the fitness function definitely influences the time course of our adaptive evolution process.
So far we’ve been looking only at symmetric k = 2, r = 2 rules. If we look at the space of all possible k = 2, r = 2 rules, the behavior we see is similar. For example, here’s the time evolution of possible phenotypes based on our standard height fitness function:
And this is what we see if we look only at the longest-lifetime (i.e. largest-height) cases:
As the scale here indicates, such long-lived phenotypes are quite rare—though most still occur with nonzero frequency even after arbitrarily large times (which is an inevitable given that they appear as “maximal fitness” terminal nodes in the multiway graph).
And indeed if we plot the final frequencies of phenotypes against their lifetimes we see that there are a wide range of different cases:
The phenotypes with the highest “equilibrium” frequencies are
with some having fairly small lifetimes, and others larger.
In the previous section, we looked at the time course of evolution with various different—but fixed—fitness functions. But what if we had a fitness function that changes with time—say analogous to an environment for biological evolution that changes with time?
Here’s what happens if we have an aspect ratio fitness function whose target value increases linearly with time:
The behavior we see is quite complex, with certain phenotypes “winning for a while” but then dying out, often quite precipitously—with others coming to take their place.
If instead the target aspect ratio decreases with time, we see rather different behavior:
(The discontinuous derivatives here are basically associated with the sudden appearance of new phenotypes at particular target aspect ratio values.)
It’s also possible to give a “shock to the system” by suddenly changing the target aspect ratio:
And what we see is that sometimes this shock leads to fewer surviving phenotypes, and sometimes to more.
We can think of a changing fitness function as being something that applies a “macroscopic driving force” to our system. Things happen quickly down at the level of individual mutation and selection events—but the fitness function defines overall “goals” for the system that in effect change only slowly. (It’s a bit like a fluid where there are fast molecular-scale processes, but typically slow changes of macroscopic parameters like pressure.)
But if the fitness function defines a goal, how well does the system manage to meet it? Here’s a comparison between an aspect ratio goal (here, linearly increasing) and the distribution of actual aspect ratios achieved, with the darker curve indicating the mean aspect ratio obtained by a weighted average over phenotypes, and the lighter blue area indicating the standard deviation:
And, yes, as we might have expected from earlier results, the system doesn’t do particularly well at achieving the goal. Its behavior is ultimately not “well sculpted” by the forces of a fitness function; instead it is mostly dominated by the intrinsic (computationally irreducible) dynamics of the underlying adaptive evolution process.
One important thing to note however is that our results depend on the value of a parameter: essentially the rate at which underlying mutations occur relative to the rate of change of the fitness function. In the picture above 5000 mutations occur over the time the fitness function goes from minimum to maximum value. This is what happens if we change the number of mutations that occur (or, in effect, the “mutation rate”):
Generally—and not surprisingly—adaptive evolution does better at achieving the target when the mutation rate is higher, though in both the cases shown here, nothing gets terribly close to the target.
In their general character our results here seem reminiscent of what one might expect in typical studies of continuum systems, say based on differential equations. And indeed one can imagine that there might be “continuum equations of adaptive evolution” that govern situations like the ones we’ve seen here. But it’s important to understand that it’s far from self evident that this is possible. Because underneath everything is a multiway evolution graph with a definite and complicated structure. And one might think that the details of this structure would matter to the overall “continuum evolution process”. And indeed sometimes they will.
But—as we have seen throughout our Physics Project—underlying computational irreducibility leads to a certain inevitable simplicity when looking at phenomena perceived by computationally bounded observers. And we can expect that something similar can happen with biological evolution (and indeed adaptive evolution in general). Assuming that our fitness functions (and their process of change) are computationally bounded, then we can expect that their “aggregate effects” will follow comparatively simple laws—which we can perhaps think of as laws for the “flow of evolution” in response to external input.
In the previous section we saw that with different fitness functions, different time series of phenotypes appear, with some phenotypes, for example, sometimes “going extinct”. But let’s say evolution has proceeded to a certain point with a particular fitness function—and certain phenotypes are now present. Then one question we can ask is whether it’s possible to “reverse” that evolution, and revert to phenotypes that were present before. In other words, if we change the fitness function, can we make evolution “go backwards”?
We’ve often discussed a fitness function based on maximizing total (finite) lifetime. But what if, after using this fitness function for a while, we “reverse it”, now minimizing total lifetime?
Consider the multiway evolution graph for symmetric k = 2, r = 2 rules starting from the null rule, with the fitness function yet again being maximizing lifetime:
But what if we now say the fitness function minimizes lifetime? If we start from the longest-lifetime phenotype we get the “lifetime minimization” multiway graph:
We can compare this “reversed graph” to the “forward graph” based on all paths from the null rule to the maximum-lifetime rule:
And in this case we see that the phenotypes that occur are almost the same, with the exception of the fact that can appear in the reverse case.
So what happens when we look at all k = 2, r = 2 rules? Here’s the “reverse graph” starting from the longest-lifetime phenotype:
A total of 345 phenotypes appear here eventually leading all the way back to . In the overall “forward graph” (which has to start from rather than ) a total of2409phenotypes appear, though (as we saw above) only 64 occur in paths that eventually lead to the maximum lifetime phenotype:
And what we see here is that the forward and reverse graphs look quite different. But could we perhaps construct a fitness function for the reverse graph that will successfully corral the evolution process to precisely retrace the steps of the forward graph?
In general, this isn’t something we can expect to be able to do. Because to do so would in effect require “breaking the computational irreducibility” of the system. It would require having a fitness function that can in essence predict every detail of the evolution process—and in so doing be in a position to direct it. But to achieve this, the fitness function would in a sense have to be computationally as sophisticated as the evolution process itself.
It’s a variant of an argument we’ve used several times here. Realistic fitness functions are computationally bounded (and in practice often very coarse). And that means that they can’t expect to match the computational irreducibility of the underlying evolution process.
There’s an analogy to the Second Law of thermodynamics. Just as the microscopic collisions of individual molecules are in principle easy to reverse, so potentially are individual transitions in the evolution graph. But putting many collisions or many transitions together leads to a process that is computationally sophisticated enough that the fairly coarse means at our disposal can’t “decode” and reverse it.
Put another way, there is in practice a certain inevitable irreversibility to both molecular dynamics and biological evolution. Yes, with enough computational effort—say carefully controlling the fitness function for every individual organism—it might in principle be possible to precisely “reverse evolution”. But in practice the kinds of fitness functions that exist in nature—or that one can readily set up in a lab—are computationally much too weak. And as a result one can’t expect to be able to get evolution to precisely retrace its steps.
Given only a genotype, is there a way to tell whether it’s “just random” or whether it’s actually the result of some long and elaborate process of adaptive evolution? From the genotype one can in principle use the rules it defines to “grow” the corresponding phenotype—and then look at whether it has an “unusually large” fitness. But the question is whether it’s possible to tell anything directly from the genotype, without going through the computational effort of generating the phenotype.
At some level it’s like asking, whether, say, from a cellular automaton rule, one can predict the ultimate behavior of the cellular automaton. And a core consequence of computational irreducibility is that one can’t in general expect to do this. Still, one might imagine that one could at least make a “reasonable guess” about whether a genotype is “likely” to have been chosen “purely randomly” or to have been “carefully selected”.
To explore this, we can look at the genotypes for symmetric k = 2, r = 2 rules, say ordered by their lifetime-based fitness—with black and white here representing “required” rule cases, and gray representing undetermined ones (which can all independently be either black or white):
On the right is a summary of how many white, black and undetermined (gray) outcomes are present in each genotype. And as we have seen several times, to achieve high fitness all or almost all of the outcomes must be determined—so that in a sense all or almost all of the genome is “being used”. But we still need to ask whether, given a certain actual pattern of outcomes, we can successfully guess whether or not a genotype is the result of selection.
To get more of a sense of this, we can look at plots of the probabilities for different outcomes for each case in the rule, first (trivially) for all combinatorially possible genotypes, then for all genotypes that give viable (i.e. in our case, finite-lifetime) phenotypes, and then for “selected genotypes”:
Certain cases are always completely determined for all viable genomes—but rather trivially so, because, for example, if then the pattern generated will expand at maximum speed forever, and so cannot have a finite lifetime.
So what happens for all k = 2, r = 2 rules? Here are the actual genomes that lead to particular fitness levels:
And now here are the corresponding probabilities for different outcomes for each case in the rule:
And, yes, given a particular setup we could imagine working out from results like these at least an approximation to the likelihood for a given randomly chosen genome to be a selected one. But what’s true in general? Is there something that can be determined with bounded computational effort (i.e. without explicitly computing phenotypes and their fitnesses) that gives a good estimate of whether a genome is selected? There are good reasons to believe that computational irreducibility will make this impossible.
It’s a different story, of course, if one’s given a “fully computed” phenotype. But at the genome level—without that computation—it seems unlikely that one can expect to distinguish random from “selected-somehow” genotypes.
In making our idealized model of biological evolution we’ve focused (as biology seems to) on the adaptive evolution of the genotype—or, in our case, the underlying rule for our cellular automata. But what if instead of changing the underlying rule, we change the initial condition used to “grow each organism”?
For example, let’s say that we start with the “single cell” we’ve been using so far, but then at each step in adaptive evolution we change the value of one cell in the initial condition (say within a certain distance of our original cell)—then keep any initial condition that does not lead to a shorter lifetime:
The sequence of lifetimes (“fitness values”) obtained in this process of adaptive evolution is
and the “breakthrough” initial conditions are:
The basic setup is similar to what we’ve seen repeatedly in the adaptive evolution of rules rather than initial conditions. But one immediate difference is that, at least in the example we’ve just seen, changing initial conditions does not as obviously “introduce new ideas” for how to increase lifetime; instead, it gives more of an impression of just directly extending “existing ideas”.
So what happens more generally? Rules with k = 2, r = 1 tend to show either infinite growth or no growth—with finite lifetimes arising only from direct “erosion” of initial conditions (here for rules 104 and 164):
For k = 2, r = 2 rules the story is more complicated, even in the symmetric case. Here are the sequences of longest lifetime patterns obtained with all possible progressively wider initial conditions with various rules:
Again, there is a certain lack of “fundamentally new ideas” in evidence, though there are definitely “mechanisms” that get progressively extended with larger initial conditions. (One notable regularity is that the maximum lifetimes of patterns often seem roughly proportional to the width of initial condition allowed.)
Can adaptive evolution “discover more”? Typically, when it’s just modifying initial conditions in a fixed region, it doesn’t seem so—again it seems to be more about “extending existing mechanisms” than introducing new ones:
Everything we’ve done so far has been for 1D cellular automata. So what happens if we go to 2D? In the end, the story is going to be very similar to 1D—except that the rule spaces even for quite minimal neighborhoods are vastly larger.
With k = 2 colors, it turns out that with a 5-cell neighborhood one can’t “escape from the null rule” by single point mutations. The issue is that any single case one adds in the rule will either do nothing, or will lead only to unbounded growth. And even with a 9-cell neighborhood one can’t get rules that show growth that is neither limited nor infinite with a single-cell initial condition. But with a initial condition this is possible, and for example here is a sequence of phenotype patterns generated by adaptive evolution using lifetime as a fitness function:
Here’s what these patterns look like when “viewed from above”:
And here’s how the fitness progressively increases in this case:
There are a total of 2512 ≈ 10154 possible 9-neighbor rules, and in this vast rule space it’s easy for adaptive evolution to find rules with long finite lifetimes. (By the way, I’ve no idea what the absolute maximum “busy beaver” lifetime in this space is.)
Just as in 1D, there’s a fair amount of variation in the behavior one sees. Here are some examples of the “final rules” for various instances of adaptive evolution:
In a few cases one can readily “see the mechanism” for the lifetime—say associated with collisions between localized structures. But mostly, as in the other examples we’ve seen, there’s no realistic “narrative explanation” for how these rules achieve long yet finite lifetimes.
OK, so we’ve now looked at 2D as well as 1D cellular automata. But what about systems that aren’t cellular automata at all? Will we still see the same core phenomena of adaptive evolution that we’ve identified in cellular automata? The Principle of Computational Equivalence would certainly lead one to expect that we would. But to check at least one example let’s look at Turing machines.
Here’s a Turing machine with s = 3 states for its head, and k = 2 colors for cells on its tape:
The Turing machine is set up to halt if it ever reaches a case in the rule where the output is . Starting from a blank initial condition, this particular Turing machine halts after 19 steps.
So what happens if we try to adaptively evolve Turing machines with long lifetimes (i.e. that take many steps to halt)? Say we start from a “null rule” that halts in all cases, and then we make a sequence of single point mutations in the rule, keeping ones that don’t lead the Turing machine to halt in fewer steps than before. Here’s an example where the adaptive evolution eventually reaches a Turing machine that takes 95 steps to halt:
The sequence of (“breakthrough”) mutations involved here is
corresponding to a fitness curve of the form:
And, yes, all of this is very analogous to what we’ve seen in cellular automata. But one difference is that with Turing machines there are routinely much larger jumps in halting times. And the basic reason for this is just that Turing machines have much less going on at any particular step than typical cellular automata do—so it can take them much longer to achieve some particular state, like a halting state.
Here’s an example of adaptive evolution in the space of s = 3, k = 3 Turing machines—and in this case the final halting time is long enough that we’ve had to squash the image vertically (by a factor of 5):
The fitness curve in this case is best viewed on a logarithmic scale:
But while the largest-lifetime cellular automata that we saw above typically seemed to have very complex behavior, the largest-lifetime Turing machine here seems, at least on the face of it, to operate in a much more “systematic” and “mechanical” way. And indeed this becomes even more evident if we compress our visualization by looking only at steps on which the Turing machine head reverses its direction:
Long-lifetime Turing machines found by adaptive evolution are not always so simple, though they still tend to show more regularity than long-lifetime cellular automata:
But—presumably because Turing machines are “less efficient” than cellular automata—the very longest possible lifetimes can be very large. It’s not clear whether rules with such lifetimes can be found by adaptive evolution—not least because even to evaluate the fitness function for any particular candidate rule could take an unbounded time. And indeed among s = 3, k = 3 rules the very longest possible is about 1017 steps—achieved by the rule
with the following “very pedantic behavior”:
So what about multiway evolution graphs? There are a total of 20,736 s = 2, k = 2 Turing machines with halting states allowed. From these there are 37 distinct finite-lifetime phenotypes:
Just as in other cases we’ve investigated, there are fitness-neutral sets such as:
Taking just one representative from each of these 18 sets, we can then construct a multiway evolution graph for 2,2 Turing machines with lifetime as our fitness function:
Here’s the analogous result for 3,2 Turing machines—with2250distinct phenotypes, and a maximum lifetime of 21 steps (and the patterns produced by the machines just show by “slabs”):
We could pick other fitness functions (like maximum pattern width, number of head reversals, etc.) But the basic structure and consequences of adaptive evolution seem to work very much the same in Turing machines as in cellular automata—much as we expect from the Principle of Computational Equivalence.
Ordinary Turing machines (as well as ordinary cellular automata) in effect always follow a single path of history, producing a definite sequence of states based on their underlying rule. But it’s also possible to study multiway Turing machines in which many paths of history can be followed. Consider for example the rule:
The case in this rule has two possible outcomes—so this is a multiway system, and to represent its behavior we need a multiway graph:
From a biological point of view, we can potentially think of such a multiway system as an idealized model for a process of adaptive evolution. So now we can ask: can we evolve this evolution? Or, in other words, can we apply adaptive evolution to systems like multiway Turing machines?
As an example, let’s assume that we make single point mutation changes to just one case in a multiway Turing machine rule:
Many multiway Turing machines won’t halt, or at least won’t halt on all their branches. But for our fitness function let’s assume we require multiway Turing machines to halt on all branches (or at least go into loops that revisit the same states), and then let’s take the fitness to be the total number of nodes in the multiway graph when everything has halted. (And, yes, this is a direct generalization of our lifetime fitness function for ordinary Turing machines.)
So with this setup here are some examples of sequences of “breakthroughs” in adaptive evolution processes:
But what about looking at all possible paths of evolution for multiway Turing machines? Or, in other words, what about making a multiway graph of the evolution of multiway Turing machines?
Here’s an example of what we get by doing this (showing at each node just a single example of a fitness-neutral set):
So what’s really going on here? We’ve got a multiway graph of multiway graphs. But it’s worth understanding that the inner and outer multiway graphs are a bit different. The outer one is effectively a rulial multiway graph, in which different parts correspond to following different rules. The inner one is effectively a branchial multiway graph, in which different parts correspond to different ways of applying a particular rule. Ultimately, though, we can at least in principle expect to encode branchial transformations as rulial ones, and vice versa.
So we can think of the adaptive evolution of multiway Turing machines as a first step in exploring “higher-order evolution”: the evolution of evolution, etc. And ultimately in exploring inevitable limits of recursive evolution in the ruliad—and how these might relate to the formation of observers in the ruliad.
What does all this mean for the foundations of biological evolution? First and foremost, it reinforces the idea of computational irreducibility as a dominant force in biology. One might have imagined that what we see in biology must have been “carefully sculpted” by fitness constraints (say imposed by the environment). But what we’ve found here suggests that instead much of what we see is actually just a direct reflection of computational irreducibility. And in the end, more than anything else, what biological evolution seems to be doing is to “recruit” lumps of irreducible computation, and set them up so as to achieve “fitness objectives”.
It is, as I recently discovered, very similar to what happens in machine learning. And in both cases this picture implies that there’s a limit to the kind of explanations one can expect to get. If one asks why something has the form it does, the answer will often just be: “because that’s the lump of irreducible computation that happened to be picked up”. And there isn’t any reason to think that there’ll be a “narrative explanation” of the kind one might hope for in traditional science.
The simplicity of models makes it possible to study not just particular possible paths of adaptive evolution, but complete multiway graphs of all possible paths. And what we’ve seen here is that fitness functions in effect define a kind of traversal order or (roughly) foliation for such multiway graphs. If such foliations could be arbitrarily complex, then they could potentially pick out specific outcomes for evolution—in effect successfully “sculpting biology” through the details of natural selection and fitness functions.
But the point is that fitness functions and resulting foliations of multiway evolution graphs don’t get arbitrarily complex. And even as the underlying processes by which phenotypes develop are full of computational irreducibility, the fitness functions that are applied are computationally bounded. And in effect the complexity that is perhaps the single most striking immediate feature of biological systems is therefore a consequence of the interplay between the computational boundedness of selection processes, and the computational irreducibility of underlying processes of growth and development.
All of this relies on the fundamental idea that biological evolution—and biology—are at their core computational phenomena. And given this interpretation, there’s then a remarkable unification that’s emerging.
It begins with the ruliad—the abstract object corresponding to the entangled limit of all possible computational processes. We’ve talked about the ruliad as the ultimate foundation for physics, and for mathematics. And we now see that we can think of it as the ultimate foundation for biology too.
In physics what’s crucial is that observers like us “parse” the ruliad in certain ways—and that these ways lead us to have a perception of the ruliad that follows core known laws of physics. And similarly, when observers like us do mathematics, we can think of ourselves as “extracting that mathematics” from the way we parse the ruliad. And now what we’re seeing is that biology emerges because of the way selection from the environment, etc. “parses” the ruliad.
And what makes this view powerful is that we have to assume surprisingly little about how selection works to still be able to deduce important things about biology. In particular, if we assume that the selection operates in a computationally bounded way, then just from the inevitable underlying computational irreducibility “inherited” from the ruliad, we immediately know that biology must have certain features.
In physics, the Second Law of thermodynamics arises from the interplay of underlying computational irreducibility of mechanical processes involving many particles or other objects, and our computational boundedness as observers. We have the impression that “randomness is increasing” because as computationally bounded observers we can’t “decrypt” the underlying computational irreducibility.
What’s the analog of this in biology? Much as we can’t expect to “say what happens” in a system that follows the Second Law, so we can’t expect to “explain by selection” what happens in a biological system. Or, put another way, much of what we see in biology is just the way it is because of computational irreducibility—and try as we might it won’t be “explainable” by some fitness criterion that we can describe.
But that doesn’t mean that we can’t expect to deduce “general laws of biology”, much as there are general laws about gases whose detailed structure follows the Second Law. And in what we’ve done here we can begin to see some hints of what those general laws might look like.
They’ll be things like bulk statements about possible paths of evolution, and the effect of changing the constraints on them—a bit like laws of fluid mechanics but now applied to the rulial space of possible genotypes. But if there’s one thing that’s clear it’s that the minimal model we’ve developed of biological evolution has remarkable richness and potential. In the past it’s been possible to say things about what amounts to the pure combinatorics of evolution; now we can start talking in a structured way about what evolution actually does. And in doing this we go in the direction of finally giving biology a foundation as a theoretical science.
Even though this is my second long piece about my minimal model of biological evolution, I’ve barely scratched the surface of what can be done with it. First and foremost there are many detailed connections to be made with actual phenomena that have been observed—or could be observed—in biology. But there are also many things to be investigated directly about the model itself—and in effect much ruliology to be done on it. And what’s particularly notable is how accessible a lot of that ruliology is. (And, yes, you can click any picture here to get the Wolfram Language code that generates it.) What are some obvious things to do? Here are few. Investigate other fitness functions. Other rule spaces. Other initial conditions. Other evolution strategies. Investigate evolving both rules and initial conditions. Investigate different kinds of changes of fitness functions during evolution. Investigate the effect of having a much larger rule space. Investigate robustness (or not) to perturbations.
In what I’ve done here, I’ve effectively aggregated identical genotypes (and phenotypes). But one could also investigate what happens if one in effect “traces every individual organism”. The result will be abstract structures that generalize the multiway systems we’ve shown here—and that are associated with higher levels of abstract formalism capable of describing phenomena that in effect go “below species”.
For historical notes see here »
Thanks to Wolfram Institute fellows Richard Assar and Nik Murzin for their help, as well as to the supporters of the new Wolfram Institute initiative in theoretical biology. Thanks also to Brad Klee for his help. Related student projects were done at our Summer Programs this year by Brian Mboya, Tadas Turonis, Ahama Dalmia and Owen Xuan.
Since writing my first piece about biological evolution in March, I’ve had occasion to attend two biology conferences: SynBioBeta and WISE (“Workshop on Information, Selection, and Evolution” at the Carnegie Institution). I thank many attendees at both conferences for their enthusiasm and input. Curiously, before the WISE conference in October 2024 the last conference I had attended on biological evolution was more than 40 years earlier: the June 1984 Mountain Lake Conference on Evolution and Development.
2024-10-09 05:41:58
Time is a central feature of human experience. But what actually is it? In traditional scientific accounts it’s often represented as some kind of coordinate much like space (though a coordinate that for some reason is always systematically increasing for us). But while this may be a useful mathematical description, it’s not telling us anything about what time in a sense “intrinsically is”.
We get closer as soon as we start thinking in computational terms. Because then it’s natural for us to think of successive states of the world as being computed one from the last by the progressive application of some computational rule. And this suggests that we can identify the progress of time with the “progressive doing of
But does this just mean that we are replacing a “time coordinate” with a “computational step count”? No. Because of the phenomenon of computational irreducibility. With the traditional mathematical idea of a time coordinate one typically imagines that this coordinate can be “set to any value”, and that then one can immediately calculate the state of the system at that time. But computational irreducibility implies that it’s not that easy. Because it says that there’s often essentially no better way to find what a system will do than by explicitly tracing through each step in its evolution.
In the pictures on the left there’s computational reducibility, and one can readily see what state will be after any number of steps t. But in the pictures on the right there’s (presumably) computational irreducibility, so that the only way to tell what will happen after t steps is effectively to run all those steps:
And what this implies is that there’s a certain robustness to time when viewed in these computational terms. There’s no way to “jump ahead” in time; the only way to find out what will happen in the future is to go through the irreducible computational steps to get there.
There are simple idealized systems (say with purely periodic behavior) where there’s computational reducibility, and where there isn’t any robust notion of the progress of time. But the point is that—as the Principle of Computational Equivalence implies—our universe is inevitably full of computational irreducibility which in effect defines a robust notion of the progress of time.
That time is a reflection of the progress of computation in the universe is an important starting point. But it’s not the end of the story. For example, here’s an immediate issue. If we have a computational rule that determines each successive state of a system it’s at least in principle possible to know the whole future of the system. So given this why then do we have the experience of the future only “unfolding as it happens”?
It’s fundamentally because of the way we are as observers. If the underlying system is computationally irreducible, then to work out its future behavior requires an irreducible amount of computational work. But it’s a core feature of observers like us that we are computationally bounded. So we can’t do all that irreducible computational work to “know the whole future”—and instead we’re effectively stuck just doing computation alongside the system itself, never able to substantially “jump ahead”, and only able to see the future “progressively unfold”.
In essence, therefore, we experience time because of the interplay between our computational boundedness as observers, and the computational irreducibility of underlying processes in the universe. If we were not computationally bounded, we could “perceive the whole of the future in one gulp” and we wouldn’t need a notion of time at all. And if there wasn’t underlying computational irreducibility there wouldn’t be the kind of “progressive revealing of the future” that we associate with our experience of time.
A notable feature of our everyday perception of time is that it seems to “flow only in one direction”—so that for example it’s generally much easier to remember the past than to predict the future. And this is closely related to the Second Law of thermodynamics, which (as I’ve argued at length elsewhere) is once again a result of the interplay between underlying computational irreducibility and our computational boundedness. Yes, the microscopic laws of physics may be reversible (and indeed if our system is simple—and computationally reducible—enough of this reversibility may “shine through”). But the point is that computational irreducibility is in a sense a much stronger force.
Imagine that we prepare a state to have orderly structure. If its evolution is computationally irreducible then this structure will effectively be “encrypted” to the point where a computationally bounded observer can’t recognize the structure. Given underlying reversibility, the structure is in some sense inevitably “still there”—but it can’t be “accessed” by a computationally bounded observer. And as a result such an observer will perceive a definite flow from orderliness in what is prepared to disorderliness in what is observed. (In principle one might think it should be possible to set up a state that will “behave antithermodynamically”—but the point is that to do so would require predicting a computationally irreducible process, which a computationally bounded observer can’t do.)
One of the longstanding confusions about the nature of time has to do with its “mathematical similarity” to space. And indeed ever since the early days of relativity theory it’s seemed convenient to talk about “spacetime” in which notions of space and time are bundled together.
But in our Physics Project that’s not at all how things fundamentally work. At the lowest level the state of the universe is represented by a hypergraph which captures what can be thought of as the “spatial relations” between discrete “atoms of space”. Time then corresponds to the progressive rewriting of this hypergraph.
And in a sense the “atoms of time” are the elementary “rewriting events” that occur. If the “output” from one event is needed to provide “input” to another, then we can think of the first event as preceding the second event in time—and the events as being “timelike separated”. And in general we can construct a causal graph that shows the dependencies between different events.
So how does this relate to time—and spacetime? As we’ll discuss below, our everyday experience of time is that it follows a single thread. And so we tend to want to “parse” the causal graph of elementary events into a series of slices that we can view as corresponding to “successive times”. As in standard relativity theory, there typically isn’t a unique way to assign a sequence of such “simultaneity surfaces”, with the result that there are different “reference frames” in which the identifications of space and time are different.
The complete causal graph bundles together what we usually think of as space with what we usually think of as time. But ultimately the progress of time is always associated with some choice of successive events that “computationally build on each other”. And, yes, it’s more complicated because of the possibilities of different choices. But the basic idea of the progress of time as “the doing of computation” is very much the same. (In a sense time represents “computational progress” in the universe, while space represents the “layout of its data structure”.)
Very much as in the derivation of the Second Law (or of fluid mechanics from molecular dynamics), the derivation of Einstein’s equations for the large-scale behavior of spacetime from the underlying causal graph of hypergraph rewriting depends on the fact that we are computationally bounded observers. But even though we’re computationally bounded, we still have to “have something going on inside”, or we wouldn’t record—or sense—any “progress in time”.
It seems to be the essence of observers like us—as captured in my recent Observer Theory—that we equivalence many different states of the world to derive our internal perception of “what’s going on outside”. And at some rough level we might imagine that we’re sensing time passing by the rate at which we add to those internal perceptions. If we’re not adding to the perceptions, then in effect time will stop for us—as happens if we’re asleep, anesthetized or dead.
It’s worth mentioning that in some extreme situations it’s not the internal structure of the observer that makes perceived time stop; instead it’s the underlying structure of the universe itself. As we’ve mentioned, the “progress of the universe” is associated with successive rewriting of the underlying hypergraph. But when there’s been “too much activity in the hypergraph” (which physically corresponds roughly to too much energy-momentum), one can end up with a situation in which “there are no more rewrites that can be done”—so that in effect some part of the universe can no longer progress, and “time stops” there. It’s analogous to what happens at a spacelike singularity (normally associated with a black hole) in traditional general relativity. But now it has a very direct computational interpretation: one’s reached a “fixed point” at which there’s no more computation to do. And so there’s no progress to make in time.
Our strong human experience is that time progresses as a single thread. But now our Physics Project suggests that at an underlying level time is actually in effect multithreaded, or, in other words, that there are many different “paths of history” that the universe follows. And it is only because of the way we as observers sample things that we experience time as a single thread.
At the level of a particular underlying hypergraph the point is that there may be many different updating events that can occur, and each sequence of such updating event defines a different “path of history”. We can summarize all these paths of history in a multiway graph in which we merge identical states that arise:
But given this underlying structure, why is it that we as observers believe that time progresses as a single thread? It all has to do with the notion of branchial space, and our presence within branchial space. The presence of many paths of history is what leads to quantum mechanics; the fact that we as observers ultimately perceive just one path is associated with the traditionally-quite-mysterious phenomenon of “measurement” in quantum mechanics.
When we talked about causal graphs above, we said that we could “parse” them as a series of “spacelike” slices corresponding to instantaneous “states of space”—represented by spatial hypergraphs. And by analogy we can similarly imagine breaking multiway graphs into “instantaneous slices”. But now these slices don’t represent states of ordinary space; instead they represent states of what we call branchial space.
Ordinary space is “knitted together” by updating events that have causal effects on other events that can be thought of as “located at different places in space”. (Or, said differently, space is knitted together by the overlaps of the elementary light cones of different events.) Now we can think of branchial space as being “knitted together” by updating events that have effects on events that end up on different branches of history.
(In general there is a close analogy between ordinary space and branchial space, and we can define a multiway causal graph that includes both “spacelike” and “branchlike” directions—with the branchlike direction supporting not light cones but what we can call entanglement cones.)
So how do we as observers parse what’s going on? A key point is that we are inevitably part of the system we’re observing. So the branching (and merging) that’s going on in the system at large is also going on in us. So that means we have to ask how a “branching mind” will perceive a branching universe. Underneath, there are lots of branches, and lots of “threads of history”. And there’s lots of computational irreducibility (and even what we can call multicomputational irreducibility). But computationally bounded observers like us have to equivalence most of those details to wind up with something that “fits in our finite minds”.
We can make an analogy to what happens in a gas. Underneath, there are lots of molecules bouncing around (and behaving in computationally irreducible ways). But observers like us are big compared to molecules, and (being computationally bounded) we don’t get to perceive their individual behavior, but only their aggregate behavior—from which we extract a thin set of computationally reducible “fluid-dynamics-level” features.
And it’s basically the same story with the underlying structure of space. Underneath, there’s an elaborately changing network of discrete atoms of space. But as large, computationally bounded observers we can only sample aggregate features in which many details have been equivalenced, and in which space tends to seem continuous and describable in basically computationally reducible ways.
So what about branchial space? Well, it’s basically the same story. Our minds are “big”, in the sense that they span many individual branches of history. And they’re computationally bounded so they can’t perceive the details of all those branches, but only certain aggregated features. And in a first approximation what then emerges is in effect a single aggregated thread of history.
With sufficiently careful measurements we can sometimes see “quantum effects” in which multiple threads of history are in evidence. But at a direct human level we always seem to aggregate things to the point where what we perceive is just a single thread of history—or in effect a single thread of progression in time.
It’s not immediately obvious that any of these “aggregations” will work. It could be that important effects we perceive in gases would depend on phenomena at the level of individual molecules. Or that to understand the large-scale structure of space we’d continually be having to think about detailed features of atoms of space. Or, similarly, that we’d never be able to maintain a “consistent view of history”, and that instead we’d always be having to trace lots of individual threads of history.
But the key point is that for us to stay as computationally bounded observers we have to pick out only features that are computationally reducible—or in effect boundedly simple to describe.
Closely related to our computational boundedness is the important assumption we make that we as observers have a certain persistence. At every moment in time, we are made from different atoms of space and different branches in the multiway graph. Yet we believe we are still “the same us”. And the crucial physical fact (that has to be derived in our model) is that in ordinary circumstances there’s no inconsistency in doing this.
So the result is that even though there are many “threads of time” at the lowest level—representing many different “quantum branches”—observers like us can (usually) successfully still view there as being a single consistent perceived thread of time.
But there’s another issue here. It’s one thing to say that a single observer (say a single human mind or a single measuring device) can perceive history to follow a single, consistent thread. But what about different human minds, or different measuring devices? Why should they perceive any kind of consistent “objective reality”?
Essentially the answer, I think, is that they’re all sufficiently nearby in branchial space. If we think about physical space, observers in different parts of the universe will clearly “see different things happening”. The “laws of physics” may be the same—but what star (if any) is nearby will be different. Yet (at least for the foreseeable future) for all of us humans it’s always the same star that’s nearby.
And so it is, presumably, in branchial space. There’s some small patch in which we humans—with our shared origins—exist. And it’s presumably because that patch is small relative to all of branchial space that all of us perceive a consistent thread of history and a common objective reality.
There are many subtleties to this, many of which aren’t yet fully worked out. In physical space, we know that effects can in principle spread at the speed of light. And in branchial space the analog is that effects can spread at the maximum entanglement speed (whose value we don’t know, though it’s related by Planck unit conversions to the elementary length and elementary time). But in maintaining our shared “objective” view of the universe it’s crucial that we’re not all going off in different directions at the speed of light. And of course the reason that doesn’t happen is that we don’t have zero mass. And indeed presumably nonzero mass is a critical part of being observers like us.
In our Physics Project it’s roughly the density of events in the hypergraph that determines the density of energy (and mass) in physical space (with their associated gravitational effects). And similarly it’s roughly the density of events in the multiway graph (or in branchial graph slices) that determines the density of action—the relativistically invariant analog of energy—in branchial space (with its associated effects on quantum phase). And though it’s not yet completely clear how this works, it seems likely that once again when there’s mass, effects don’t just “go off at the maximum entanglement speed in all directions”, but instead stay nearby.
There are definitely connections between “staying at the same place”, believing one is persistent, and being computationally bounded. But these are what seem necessary for us to have our typical view of time as a single thread. In principle we can imagine observers very different from us—say with minds (like the inside of an idealized quantum computer) capable of experiencing many different threads of history. But the Principle of Computational Equivalence suggests that there’s a high bar for such observers. They need not only to be able to deal with computational irreducibility but also multicomputational irreducibility, in which one includes both the process of computing new states, and the process of equivalencing states.
And so for observers that are “anything like us” we can expect that once again time will tend to be as we normally experience it, following a single thread, consistent between observers.
(It’s worth mentioning that all of this only works for observers like us “in situations like ours”. For example, at the “entanglement horizon” for a black hole—where branchially-oriented edges in the multiway causal graph get “trapped”—time as we know it in some sense “disintegrates”, because an observer won’t be able to “knit together” the different branches of history to “form a consistent classical thought” about what happens.)
In what we’ve discussed so far we can think of the progress of time as being associated with the repeated application of rules that progressively “rewrite the state of the universe”. In the previous section we saw that these rules can be applied in many different ways, leading to many different underlying threads of history.
But so far we’ve imagined that the rules that get applied are always the same—leaving us with the mystery of “Why those rules, and not others?” But this is where the ruliad comes in. Because the ruliad involves no such seemingly arbitrary choices: it’s what you get by following all possible computational rules.
One can imagine many bases for the ruliad. One can make it from all possible hypergraph rewritings. Or all possible (multiway) Turing machines. But in the end it’s a single, unique thing: the entangled limit of all possible computational processes. There’s a sense in which “everything can happen somewhere” in the ruliad. But what gives the ruliad structure is that there’s a definite (essentially geometrical) way in which all those different things that can happen are arranged and connected.
So what is our perception of the ruliad? Inevitably we’re part of the ruliad—so we’re observing it “from the inside”. But the crucial point is that what we perceive about it depends on what we are like as observers. And my big surprise in the past few years has been that assuming even just a little about what we’re like as observers immediately implies that what we perceive of the ruliad follows the core laws of physics we know. In other words, by assuming what we’re like as observers, we can in effect derive our laws of physics.
The key to all this is the interplay between the computational irreducibility of underlying behavior in the ruliad, and our computational boundedness as observers (together with our related assumption of our persistence). And it’s this interplay that gives us the Second Law in statistical mechanics, the Einstein equations for the structure of spacetime, and (we think) the path integral in quantum mechanics. In effect what’s happening is that our computational boundedness as observers makes us equivalence things to the point where we are sampling only computationally reducible slices of the ruliad, whose characteristics can be described using recognizable laws of physics.
So where does time fit into all of this? A central feature of the ruliad is that it’s unique—and everything about it is “abstractly necessary”. Much as given the definition of numbers, addition and equality it’s inevitable that one gets 1 + 1 = 2, so similarly given the definition of computation it’s inevitable that one gets the ruliad. Or, in other words, there’s no question about whether the ruliad exists; it’s just an abstract construct that inevitably follows from abstract definitions.
And so at some level this means that the ruliad inevitably just “exists as a complete thing”. And so if one could “view it from outside” one could think of it as just a single timeless object, with no notion of time.
But the crucial point is that we don’t get to “view it from the outside”. We’re embedded within it. And, what’s more, we must view it through the “lens” of our computational boundedness. And this is why we inevitably end up with a notion of time.
We observe the ruliad from some point within it. If we were not computationally bounded then we could immediately compute what the whole ruliad is like. But in actuality we can only discover the ruliad “one computationally bounded step at a time”—in effect progressively applying bounded computations to “move through rulial space”.
So even though in some abstract sense “the whole ruliad is already there” we only get to explore it step by step. And that’s what gives us our notion of time, through which we “progress”.
Inevitably, there are many different paths that we could follow through the ruliad. And indeed every mind (and every observer like us)—with its distinct inner experience—presumably follows a different path. But much as we described for branchial space, the reason we have a shared notion of “objective reality” is presumably that we are all very close together in rulial space; we form in a sense a tight “rulial flock”.
It’s worth pointing out that not every sampling of the ruliad that may be accessible to us conveniently corresponds to exploration of progressive slices of time. Yes, that kind of “progression in time” is characteristic of our physical experience, and our typical way of describing it. But what about our experience, say, of mathematics?
The first point to make is that just as the ruliad contains all possible physics, it also contains all possible mathematics. If we construct the ruliad, say from hypergraphs, the nodes are now not “atoms of space”, but instead abstract elements (that in general we call emes) that form pieces of mathematical expressions and mathematical theorems. We can think of these abstract elements as being laid out now not in physical space, but in some abstract metamathematical space.
In our physical experience, we tend to remain localized in physical space, branchial space, etc. But in “doing mathematics” it’s more as if we’re progressively expanding in metamathematical space, carving out some domain of “theorems we assume are true”. And while we could identify some kind of “path of expansion” to let us define some analog of time, it’s not a necessary feature of the way we explore the ruliad.
Different places in the ruliad in a sense correspond to describing things using different rules. And by analogy to the concept of motion in physical space, we can effectively “move” from one place to another in the ruliad by translating the computations done by one set of rules to computations done by another. (And, yes, it’s nontrivial to even have the possibility of “pure motion”.) But if we indeed remain localized in the ruliad (and can maintain what we can think of as our “coherent identity”) then it’s natural to think of there being a “path of motion” along which we progress “with time”. But when we’re just “expanding our horizons” to encompass more paradigms and to bring more of rulial space into what’s covered by our minds (so that in effect we’re “expanding in rulial space”), it’s not really the same story. We’re not thinking of ourselves as “doing computation in order to move”. Instead, we’re just identifying equivalences and using them to expand our definition of ourselves, which is something that we can at least approximate (much like in “quantum measurement” in traditional physics) as happening “outside of time”. Ultimately, though, everything that happens must be the result of computations that occur. It’s just that we don’t usually “package” these into what we can describe as a definite thread of time.
From the paradigm (and Physics Project ideas) that we’ve discussed here, the question “What is time?” is at some level simple: time is what progresses when one applies computational rules. But what’s critical is that time can in effect be defined abstractly, independent of the details of those rules, or the “substrate” to which they’re applied. And what makes this possible is the Principle of Computational Equivalence, and the ubiquitous phenomenon of computational irreducibility that it implies.
To begin with, the fact that time can robustly be thought of as “progressing”, in effect in a linear chain, is a consequence of computational irreducibility—because computational irreducibility is what tells us that computationally bounded observers like us can’t in general ever “jump ahead”; we just have to follow a linear chain of steps.
But there’s something else as well. The Principle of Computational Equivalence implies that there’s in a sense just one (ubiquitous) kind of computational irreducibility. So when we look at different systems following different irreducible computational rules, there’s inevitably a certain universality to what they do. In effect they’re all “accumulating computational effects” in the same way. Or in essence progressing through time in the same way.
There’s a close analogy here with heat. It could be that there’d be detailed molecular motion that even on a large scale worked noticeably differently in different materials. But the fact is that we end up being able to characterize any such motion just by saying that it represents a certain amount of heat, without getting into more details. And that’s very much the same kind of thing as being able to say that such-and-such an amount of time has passed, without having to get into the details of how some clock or other system that reflects the passage of time actually works.
And in fact there’s more than a “conceptual analogy” here. Because the phenomenon of heat is again a consequence of computational irreducibility. And the fact that there’s a uniform, “abstract” characterization of it is a consequence of the universality of computational irreducibility.
It’s worth emphasizing again, though, that just as with heat, a robust concept of time depends on us being computationally bounded observers. If we were not, then we’d able to break the Second Law by doing detailed computations of molecular processes, and we wouldn’t just describe things in terms of randomness and heat. And similarly, we’d be able to break the linear flow of time, either jumping ahead or following different threads of time.
But as computationally bounded observers of computationally irreducible processes, it’s basically inevitable that—at least to a good approximation—we’ll view time as something that forms a single one-dimensional thread.
In traditional mathematically based science there’s often a feeling that the goal should be to “predict the future”—or in effect to “outrun time”. But computational irreducibility tells us that in general we can’t do this, and that the only way to find out what will happen is just to run the same computation as the system itself, essentially step by step. But while this might seem like a letdown for the power of science, we can also see it as what gives meaning and significance to time. If we could always jump ahead then at some level nothing would ever fundamentally be achieved by the passage of time (or, say, by the living of our lives); we’d always be able to just say what will happen, without “living through” how we got there. But computational irreducibility gives time and the process of it passing a kind of hard, tangible character.
So what does all this imply for the various classic issues (and apparent paradoxes) that arise in the way time is usually discussed?
Let’s start with the question of reversibility. The traditional laws of physics basically apply both forwards and backwards in time. And the ruliad is inevitably symmetrical between “forward” and “backward” rules. So why is it then that in our typical experience time always seems to “run in the same direction”?
This is closely related to the Second Law, and once again it’s consequence of our computational boundedness interacting with underlying computational irreducibility. In a sense what defines the direction of time for us is that we (typically) find it much easier to remember the past than to predict the future. Of course, we don’t remember every detail of the past. We only remember what amounts to certain “filtered” features that “fit in our finite minds”. And when it comes to predicting the future, we’re limited by our inability to “outrun” computational irreducibility.
Let’s recall how the Second Law works. It basically says that if we set up some state that’s “ordered” or “simple” then this will tend to “degrade” to one that’s “disordered” or “random”. (We can think of the evolution of the system as effectively “encrypting” the specification of our starting state to the point where we—as computationally bounded observers—can no longer recognize its ordered origins.) But because our underlying laws are reversible, this degradation (or “encryption”) must happen when we go both forwards and backwards in time:
But the point is that our “experiential” definition of the direction of time (in which the “past” is what we remember, and the “future” is what we find hard to predict) is inevitably aligned with the “thermodynamic” direction of time we observe in the world at large. And the reason is that in both cases we’re defining the past to be something that’s computationally bounded (while the future can be computationally irreducible). In the experiential case the past is computationally bounded because that’s what we can remember. In the thermodynamic case it’s computationally bounded because those are the states we can prepare. In other words, the “arrows of time” are aligned because in both cases we are in effect “requiring the past to be simpler”.
So what about time travel? It’s a concept that seems natural—and perhaps even inevitable—if one imagines that “time is just like space”. But it becomes a lot less natural when we think of time in the way we’re doing here: as a process of applying computational rules.
Indeed, at the lowest level, these rules are by definition just sequentially applied, producing one state after another—and in effect “progressing in one direction through time”. But things get more complicated if we consider not just the raw, lowest-level rules, but what we might actually observe of their effects. For example, what if the rules lead to a state that’s identical to one they’ve produced before (as happens, for example, in a system with periodic behavior)? If we equivalence the state now and the state before (so we represent both as a single state) then we can end up with a loop in our causal graph (a “closed timelike curve”). And, yes, in terms of the raw sequence of applying rules these states can be considered different. But the point is that if they are identical in every feature then any observer will inevitably consider them the same.
But will such equivalent states ever actually occur? As soon as there’s computational irreducibility it’s basically inevitable that the states will never perfectly match up. And indeed for the states to contain an observer like us (with “memory”, etc.) it’s basically impossible that they can match up.
But can one imagine an observer (or a “timecraft”) that would lead to states that match up? Perhaps somehow it could carefully pick particular sequences of atoms of space (or elementary events) that would lead it to states that have “happened before”. And indeed in a computationally simple system this might be possible. But as soon as there’s computational irreducibility, this simply isn’t something one can expect any computationally bounded observer to be able to do. And, yes, this is directly analogous to why one can’t have a “Maxwell’s demon” observer that “breaks the Second Law”. Or why one can’t have something that carefully navigates the lowest-level structure of space to effectively travel faster than light.
But even if there can’t be time travel in which “time for an observer goes backwards”, there can still be changes in “perceived time”, say as a result of relativistic effects associated with motion. For example, one classic relativistic effect is time dilation, in which “time goes slower” when objects go faster. And, yes, given certain assumptions, there’s a straightforward mathematical derivation of this effect. But in our effort to understand the nature of time we’re led to ask what its physical mechanism might be. And it turns out that in our Physics Project it has a surprisingly direct—and almost “mechanical”—explanation.
One starts from the fact that in our Physics Project space and everything in it is represented by a hypergraph which is continually getting rewritten. And the evolution of any object through time is then defined by these rewritings. But if the object moves, then in effect it has to be “re-created at a different place in space”—and this process takes up a certain number of rewritings, leaving fewer for the intrinsic evolution of the object itself, and thus causing time to effectively “run slower” for it. (And, yes, while this is a qualitative description, one can make it quite formal and precise, and recover the usual formulas for relativistic time dilation.)
Something similar happens with gravitational fields. In our Physics Project, energy-momentum (and thus gravity) is effectively associated with greater activity in the underlying hypergraph. And the presence of this greater activity leads to more rewritings, causing “time to run faster” for any object in that region of space (corresponding to the traditional “gravitational redshift”).
More extreme versions of this occur in the context of black holes. (Indeed, one can roughly think of spacelike singularities as places where “time ran so fast that it ended”.) And in general—as we discussed above—there are many “relativistic effects” in which notions of space and time get mixed in various ways.
But even at a much more mundane level there’s a certain crucial relationship between space and time for observers like us. The key point is that observers like us tend to “parse” the world into a sequence of “states of space” at successive “moments in time”. But the fact that we do this depends on some quite specific features of us, and in particular our effective physical scale in space as compared to time.
In our everyday life we’re typically looking at scenes involving objects that are perhaps tens of meters away from us. And given the speed of light that means photons from these objects get to us in less than a microsecond. But it takes our brains milliseconds to register what we’ve seen. And this disparity of timescales is what leads us to view the world as consisting of a sequence of states of space at successive moments in time.
If our brains “ran” a million times faster (i.e. at the speed of digital electronics) we’d perceive photons arriving from different parts of a scene at different times, and we’d presumably no longer view the world in terms of overall states of space existing at successive times.
The same kind of thing would happen if we kept the speed of our brains the same, but dealt with scenes of a much larger scale (as we already do in dealing with spacecraft, astronomy, etc.).
But while this affects what it is that we think time is “acting on”, it doesn’t ultimately affect the nature of time itself. Time remains that computational process by which successive states of the world are produced. Computational irreducibility gives time a certain rigid character, at least for computationally bounded observers like us. And the Principle of Computational Equivalence allows there to be a robust notion of time independent of the “substrate” that’s involved: whether us as observers, the everyday physical world, or, for that matter, the whole universe.
2024-09-28 01:50:59
Integers. Addition. Subtraction. Maybe multiplication. Surely that’s not enough to be able to generate any serious complexity. In the early 1980s I had made the very surprising discovery that very simple programs based on cellular automata could generate great complexity. But how widespread was this phenomenon?
At the beginning of the 1990s I had set about exploring this. Over and over I would consider some type of system and be sure it was too simple to “do anything interesting”. And over and over again I would be wrong. And so it was that on the night of August 13, 1993, I thought I should check what could happen with integer functions defined using just addition and subtraction.
I knew, of course, about defining functions by recursion, like Fibonacci:
But could I find something like this that would have complex behavior? I did the analog of what I have done so many times, and just started (symbolically) enumerating possible definitions. And immediately I saw cases with nested functions, like:
(For some reason I wanted to keep the same initial conditions as Fibonacci: f[1] = f[2] = 1.) What would functions like this do? My original notebook records the result in this case:
But a few minutes later I found something very different: a simple nestedly recursive function with what seemed like highly complex behavior:
I remembered seeing a somewhat similarly defined function discussed before. But the behavior I’d seen reported for that function, while intricate, was nested and ultimately highly regular. And, so far as I could tell, much like with rule 30 and all the other systems I’d investigated, nobody had ever seen serious complexity in simple recursive functions.
It was a nice example. But it was one among many. And when I published A New Kind of Science in 2002, I devoted just four pages (and 7 notes) to “recursive sequences”—even though the gallery I made of their behavior became a favorite page of mine:
A year after the book was published we held our first Wolfram Summer School, and as an opening event I decided to do a live computer experiment—in which I would try to make a real-time science discovery. The subject I chose was nestedly recursive functions. It took a few hours. But then, yes, we made a discovery! We found that there was a nestedly recursive function simpler than the ones I’d discussed in A New Kind of Science that already seemed to have very complex behavior:
Over the couple of decades that followed I returned many times to nestedly recursive functions—particularly in explorations I did with high school and other students, or in suggestions I made for student projects. Then recently I used them several times as “intuition-building examples” in various investigations.
I’d always felt my work with nestedly recursive functions was unfinished. Beginning about five years ago—particularly energized by our Physics Project—I started looking at harvesting seeds I’d sown in A New Kind of Science and before. I’ve been on quite a roll, with a few pages or even footnotes repeatedly flowering into rich book-length stories. And finally—particularly after my work last year on “Expression Evaluation and Fundamental Physics”—I decided it was time to try to finish my exploration of nestedly recursive functions.
Our modern Wolfram Language tools—as well as ideas from our Physics Project—provided some new directions to explore. But I still thought I pretty much knew what we’d find. And perhaps after all these years I should have known better. Because somehow in the computational universe—and in the world of ruliology—there are always surprises.
And here, yet again, there was indeed quite a surprise.
Consider the definition (later we’ll call this “P312”)
which we can also write as:
The first few values for f[n] generated from this definition are:
Continuing further we get:
But how are these values actually computed? To see that we can make an “evaluation graph” in which we show how each value of f[n] is computed from ones with smaller values of n, here starting from f[20]:
The gray nodes represent initial conditions: places where f[n] was sampled for n ≤ 0. The two different colors of edges correspond to the two different computations done in evaluating
Continuing to f[30] we get:
But what’s the structure of this graph? If we pull out the “red” graph on its own, we can see that it breaks into two path graphs, that consist of the sequences of the f[n] for odd and even n, respectively:
The “blue” graph, on the other hand, breaks into four components—each always a tree—leading respectively to the four different initial conditions:
And for example we can now plot f[n], showing which tree each f[n] ends up being associated with:
We’ll be using this same basic setup throughout, though for different functions. We’ll mostly consider recursive definitions with a single term (i.e. with a single “outermost f”, not two, as in Fibonacci recurrences).
The specific families of recursive functions we’ll be focusing on are:
And with this designation, the function we just introduced is P312.
Let’s start off by looking in more detail at the function we just introduced. Here’s what it does up to n = 500:
It might seem as if it’s going to go on “seemingly randomly” forever. But if we take it further, we get a surprise: it seems to “resolve itself” to something potentially simpler:
What’s going on? Let’s plot this again, but now showing which “blue graph tree” each value is associated with:
And now what we see is that the f[–3] and f[–2] trees stop contributing to f[n] when n is (respectively) 537 and 296, and these trees are finite (and have sizes 53 and 15):
The overall structures of the “remaining” trees—here shown up to f[5000]—eventually start to exhibit some regularity:
We can home in on this regularity by arranging these trees in layers, starting from the root, then plotting the number of nodes in each successive layer:
Looking at these pictures suggests that there should be some kind of more-or-less direct “formula” for f[n], at least for large n. They also suggest that such a formula should have some kind of mod-6 structure. And, yes, there does turn out to be essentially a “formula”. Though the “formula” is quite complicated—and reminiscent of several other “strangely messy” formulas in other ruliological cases—like Turing machine 600720 discussed in A New Kind of Science or combinator s[s[s]][s][s][s][s].
Later on, we’ll see the much simpler recursive function P111 (f[n_] := 1 + f[n – f[n – 1]]). The values for this function form a sequence in which successive blocks of length k have value k:
P312 has the same kind of structure, but much embellished. First, it has 6 separate riffled (“mod”) subsequences. Each subsequence then consists of a sequence of blocks. Given a value n, this computes which subsequence this is on, which block for that subsequence it’s in, and where it is within that block:
So, for example, here are results for multiples of 1000:
For n = 1000 we’re not yet in the “simple” regime, we can’t describe the sequence in any simple way, and our “indices” calculation is meaningless. For n = 2000 it so happens that we are at block 0 for the mod-1 subsequence. And the way things are set up, we just start by giving exactly the form of block 0 for each mod. So for mod 1 the block is:
But now n = 2000 has offset 16 within this block, so the final value of f[2000] is simply the 16th value from this list, or 100. f[2001] is then simply the next element within this block, or 109. And so on—until we reach the end of the block.
But what if we’re not dealing with block 0? For example, according to the table above, f[3000] is determined by mod-3 block 1. It turns out there’s a straightforward, if messy, way to compute any block b (for mod m):
So now we have a way to compute the value, say of f[3000], effectively just by “evaluating a formula”:
And what’s notable is that this evaluation doesn’t involve any recursion. In other words, at the cost of “messiness” we’ve—somewhat surprisingly—been able to unravel all the recursion in P312 to arrive at a “direct formula” for the value of f[n] for any n.
So what else can we see about the behavior of f[n] for P312? One notable feature is its overall growth rate. For large n, it turns out that (as can be seen by substituting this form into the recursive definition and taking a limit):
One thing this means is that our evaluation graph eventually has a roughly conical form:
This can be compared to the very regular cone generated by P111 (which has asymptotic value ):
If one just looks at the form of the recursive definition for P312 it’s far from obvious “how far back” it will need to probe, or, in other words, what values of f[n] one will need to specify as initial conditions. As it turns out, though, the only values needed are f[–3], f[–2], f[–1] and f[0].
How can one see this? In 3 + f[n – f[n – 2]] it’s only the outer f that can probe “far back” values. But how far it actually goes back depends on how much larger f[n – 2] gets compared to n. Plotting f[n – 2] and n together we have:
And the point is that only for very few values of n does f[n – 2] exceed n—and it’s these values that probe back. Meanwhile, for larger n, there can never be additional “lookbacks”, because f[n] only grows like .
So does any P312 recursion always have the same lookback? So far, we’ve considered specifically the initial condition f[n] = 1 for all n ≤ 0. But what if we change the value of f[0]? Here are plots of f[n] for different cases:
And it turns out that with f[0] = z, the lookback goes to –z for z ≥ 3, and to z – 4 for 1 ≤ z ≤ 2.
(If z ≤ 0 the function f[n] is basically not defined, because the recursion is trying to compute f[n] from f[n], f[n + 1], etc., so never “makes progress”.)
The case f[0] = 2 (i.e. z = 2) is the one that involves the least lookback—and a total of 3 initial values. Here is the evaluation graph in this case:
By comparison, here is the evaluation graph for the case f[0] = 5, involving 6 initial values:
If we plot the value of f[n] as a function of f[0] we get the following:
For n f[0], f[n] always has simple behavior, and is essentially periodic in n with period 3:
And it turns out that for any specified initial configuration of values, there is always only bounded lookback—with the bound apparently being determined by the largest of the initial values f[ninit].
So what about the behavior of f[n] for large n? Just like in our original f[0] = 1 case, we can construct “blue graph trees” rooted at each of the initial conditions. In the case f[0] = 1 we found that of the 4 trees only two continue to grow as n increases. As we vary f[0], the number of “surviving trees” varies quite erratically:
What if instead of just changing f[0], and keeping all other f[–k] = 1, we set f[n] = s for all n ≤ 0? The result is somewhat surprising:
For s ≥ 2, the behavior turns out to be simple—and similar to the behavior of P111.
So what can P312 be made to do if we change its initial conditions? With f[n] = 2 for n f[0] the behavior remains “tame”, but as f[0] increases it starts showing its typical complexity:
One question to ask is what set of values f[n] takes on. Given that the initial values have certain residues mod 3, all subsequent values must have the same residues. But apart from this constraint, it seems that all values for f[n] are obtained—which is not surprising given that f[n] grows only like .
P312 is just one example of the “P family” of sequences defined by:
Here is the behavior of some other Pabc sequences:
And here are their evaluation graphs:
P312 is the first “seriously complex” example.
P111 (as mentioned earlier) has a particularly simple form
which corresponds to the simple formula:
The evaluation graph in this case is just:
Only a single initial condition f[0] = 1 is used, and there is only a single “blue graph tree” with a simple form:
Another interesting case is P123:
Picking out only odd values of n we get:
This might look just like the behavior of P111. But it’s not. The lengths of the successive “plateaus” are now
with differences:
But this turns out to be exactly a nested sequence generated by joining together the successive steps in the evolution of the substitution system:
P123 immediately “gets into its final behavior”, even for small n. But—as we saw rather dramatically with P312—there can be “transient behavior” that doesn’t “resolve” until n is large. A smaller case of this phenomenon occurs with P213. Above n = 68 it shows a simple “square root” pattern of behavior, basically like P111. But for smaller n it’s a bit more complicated:
And in this case the transients aren’t due to “blue graph trees” that stop growing. Instead, there are only two trees (associated with f[0] and f[–1]), but both of them soon end up growing in very regular ways:
What happens if our outermost operation is not addition, but multiplication?
Here are some examples of the behavior one gets. In each case we’re plotting on a log scale—and we’re not including T1xx cases, which are always trivial:
We see that some sequences have regular and readily predictable behavior, but others do not. And this is reflected in the evaluation graphs for these functions:
The first “complicated case” is T212:
The evaluation graph for f[50] in this case has the form:
And something that’s immediately notable is that in addition to “looking back” to the values of f[0] and f[–1], this also looks back to the value of f[–24]. Meanwhile, the evaluation graph for f[51] looks back not only to f[0] and f[–1] but also to f[–3] and f[–27]:
How far back does it look in general? Here’s a plot showing which lookbacks are made as a function of n (with the roots of the “blue graph trees” highlighted):
There’s alternation between behaviors for even and odd n. But apart from that, additional lookbacks are just steadily added as n increases—and indeed the total number of lookbacks seems to follow a simple pattern:
But—just for once—if one looks in more detail, it’s not so simple. The lengths of the successive “blocks” are:
So, yes, the lookbacks are quite “unpredictable”. But the main point here is that—unlike for the P family—the number of lookbacks isn’t limited. In a sense, to compute T212 for progressively larger n, progressively more information about its initial conditions is needed.
When one deals with ordinary, unnested recurrence relations, one’s always dealing with a fixed lookback. And the number of initial conditions then just depends on the lookback. (So, for example, the Fibonacci recurrence has lookback 2, so needs two initial conditions, while the standard factorial recurrence has lookback 1, so needs only one initial condition.)
But for the nested recurrence relation T212 we see that this is no longer true; there can be an unboundedly large lookback.
OK, but let’s look back at the actual T212 sequence. Here it is up to larger values of n:
Or, plotting each point as a dot:
Given the recursive definition of f[n], the values of f[n] must always be powers of 2. This shows where each successive power of 2 is first reached as a function of n:
Meanwhile, this shows the accumulated average of f[n] as a function of n:
This is well fit by 0.38 Log[n], implying that, at least with this averaging, f[n] asymptotically approximates n0.26. And, yes, it is somewhat surprising that what seems like a very “exponential” recursive definition should lead to an f[n] that increases only like a power. But, needless to say, this is the kind of surprise one has to expect in the computational universe.
It’s worth noticing that f[n] fluctuates very intensely as a function of n. The overall distribution of values is very close to exponentially distributed—for example with the distribution of logarithmic values of f[n] for n between 9 million and 10 million being:
What else can we say about this sequence? Let’s say we reduce mod 2 the powers of 2 for each f[n]. Then we get a sequence which starts:
This is definitely not “uniformly random”. But if one look at blocks of sequential values, one can plot at what n each of the 2b possible configurations of a length-b block first appears:
And eventually it seems as if all length-b blocks for any given b will appear.
By the way, whereas in the P family, there were always a limited number of “blue graph trees” (associated with the limited number of initial conditions), for T212 the number of such trees increases with n, as more initial conditions are used. So, for example, here are the trees for f[50] and f[51]:
We’ve so far discussed T212 only with the initial condition f[n] = 1 for n ≤ 0. The fact that f[n] is always a power of 2 relies on every initial value also being a power of 2. But here’s what happens, for example, if f(n) = 2s for n ≤ 0:
In general, one can think of T212 as transforming an ultimately infinite sequence of initial conditions into an infinite sequence of function values, with different forms of initial conditions potentially giving very different sequences of function values:
(Note that not all choices of initial conditions are possible; some lead to “f[n] = f[n]” or
Having explored T212, let’s now look at T311—the original one-term nestedly recursive function discovered at the 2003 Wolfram Summer School:
Here’s its basic behavior:
And here is its evaluation graph—which immediately reveals a lot more lookback than T212:
Plotting lookbacks as a function of n we get:
Much as with T212, the total number of lookbacks varies with n in the fairly simple way (~ 0.44 n):
Continuing the T311 sequence further, it looks qualitatively very much like T212:
And indeed T311—despite its larger number of lookbacks—seems to basically behave like T212. In a story typical of the Principle of Computational Equivalence, T212 seems to have already “filled out the computational possibilities”, so T311 “doesn’t have anything to add”.
As another (somewhat historically motivated) example of nestedly recursive functions, consider what we’ll call the “S family”, defined by:
Let’s start with the very minimal case S10 (or “S1”):
Our standard initial condition f[n] = 1 for n ≤ 0 doesn’t work here, because it implies that f[1] = 1 – f[1]. But if we take f[n] = 1 for n ≤ 1 we get:
Meanwhile, with f[n] = 1 for n ≤ 3 we get:
The first obvious feature of both these results is their overall slope: 1/ϕ ≈ 0.618, where ϕ is the golden ratio. It’s not too hard to see why one gets this slope. Assume that for large n we can take f[n] = σ n. Then substitute this form into both sides of the recursive definition for the S family to get σ n == n – σ (σ (n – a) – b). For large n all that survives is the condition for the coefficients
which has solution σ = 1/ϕ.
Plotting f[n] – n/ϕ for the case f[n] = 1 for n ≤ 1 we get:
The evaluation graph is this case has a fairly simple form
as we can see even more clearly with a different graph layout:
It’s notable that only the initial condition f[1] = 1 is used—leading to a single “blue graph tree” that turns out to have a very simple “Fibonacci tree” form (which, as we’ll discuss below, has been known since the 1970s):
From this it follows that f[n] related to the “Fibonacci-like” substitution system
and in fact the sequence of values of f[n] can be computed just as:
And indeed it turns out that in this case f[n] is given exactly by:
What about when f[n] = 1 not just for n ≤ 1 but beyond? For n ≤ 2 the results are essentially the same as for n ≤ 1. But for n ≤ 3 there’s a surprise: the behavior is considerably more complicated—as we can see if we plot f[n] – n/ϕ:
Looking at the evaluation graph in this case we see that the only initial conditions sampled are f[1] = 1 and f[3] = 1 (with f[2] only being reached if one specifically starts with f[2]):
And continuing the evaluation graph we see a mixture of irregularity and comparative regularity:
The plot of f[n] has a strange “hand-drawn” appearance, with overall regularity but detailed apparent randomness. The most obvious large-scale feature is “bursting” behavior (interspersed in an audio rendering with an annoying hum). The bursts all seem to have approximately (though not exactly) the same structure—and get systematically larger. The lengths of successive “regions of calm” between bursts (characterized by runs with Abs[f[n] – n/ϕ]
What happens to S1 with other initial conditions? Here are a few examples:
So how does Sa depend on a? Sometimes there’s at least a certain amount of clear regularity; sometimes it’s more complicated:
As is very common, adding the parameter b in the definition doesn’t seem to lead to fundamentally new behavior—though for b > 0 the initial condition f[n] = 1, n ≤ 0 can be used:
In all cases, only a limited number of initial conditions are sampled (bounded by the value of a + b in the original definition). But as we can see, the behavior can either be quite simple, or can be highly complex.
Highly complex behavior arises even from very simple rules. It’s a phenomenon one sees all over the computational universe. And we’re seeing it here in nestedly recursive functions. But if we make the rules (i.e. definitions) for our functions more complicated, will we see fundamentally different behavior, or just more of the same?
The Principle of Computational Equivalence (as well as many empirical observations of other systems) suggests that it’ll be “more of the same”: that once one’s passed a fairly low threshold the computational sophistication—and complexity—of behavior will no longer change.
And indeed this is what one sees in nestedly recursive functions. But below the threshold different kinds of things can happen with different kinds of rules.
There are several directions in which we can make rules more complicated. One that we won’t discuss here is to use operations (conditional, bitwise, etc.) that go beyond arithmetic. Others tend to involve adding more instances of f in our definitions.
An obvious way to do this is to take f[n_] to be given by a sum of terms, “Fibonacci style”. There are various specific forms one can consider. As a first example—that we can call ab—let’s look at:
The value of a doesn’t seem to matter much. But changing b we see:
12 has unbounded lookback (at least starting with f[n] = 1 for n ≤ 0), but for larger b, 1b has bounded lookback. In both 13 and 15 there is continuing large-scale structure (here visible in log plots)
though this does not seem to be reflected in the corresponding evaluation graphs:
As another level of Fibonacci-style definition, we can consider ab:
But the typical behavior here does not seem much different from what we already saw with one-term definitions involving only two f’s:
(Note that aa is equivalent to a. Cases like 13 lead after a transient to pure exponential growth.)
A somewhat more unusual case is what we can call abc:
Subtracting overall linear trends we get:
For 111 using initial conditions f[1] = f[2] = 1 and plotting f[n] – n/2 we get
which has a nested structure that is closely related to the result of concatenating binary digit sequences of successive integers:
But despite the regularity in the sequence of values, the evaluation graph for this function is not particularly simple:
So how else might we come up with more complicated rules? One possibility is that instead of “adding f’s by adding terms” we can add f’s by additional nesting. So, for example, we can consider what we can call S31 (here shown with initial condition f[n] = 1 for n ≤ 3):
We can estimate the overall slope here by solving for x in x == 1 – x3 to get ≈ 0.682. Subtracting this off we get:
We can also consider deeper nestings. At depth d the slope is the solution to x == 1 – xd. Somewhat remarkably, in all cases the only initial conditions probed are f[1] = 1 and f[3] = 1:
As another example of “higher nesting” we can consider the class of functions (that we call a):
Subtracting a constant 1/ϕ slope we get:
The evaluation graph for 1 is complicated, but has some definite structure:
What happens if we nest even more deeply, say defining a functions:
With depth-d nesting, we can estimate the overall slope of f[n] by solving for x in
or
so that for the d = 3 case here the overall slope is the real root of or about 0.544. Subtracting out this overall slope we get:
And, yes, the sine-curve-like form of 5 is very odd. Continuing 10x longer, though, things are “squaring off”:
What happens if we continue nesting deeper? stays fairly tame:
However, already allows for more complicated behavior:
And for different values of a there are different regularities:
There are all sorts of other extensions and generalizations one might consider. Some involve alternate functional forms; others involve introducing additional functions, or allowing multiple arguments to our function f.
In talking about recursive functions f[n] we’ve been assuming—as one normally does—that n is always an integer. But can we generalize what we’re doing to functions f[x] where x is a continuous real number?
Consider for example a continuous analog of the Fibonacci recurrence:
This produces a staircase-like function whose steps correspond to the usual Fibonacci numbers:
Adjusting the initial condition produces a slightly different result:
We can think of these as being solutions to a kind of “Fibonacci delay equation”—where we’ve given initial conditions not at discrete points, but instead on an interval.
So what happens with nestedly recursive functions? We can define an analog of S1 as:
Plotting this along with the discrete result we get:
In more detail, we get
where now the plateaus occur at the (“Wythoff numbers”) .
Changing the initial condition to be x ≤ 3 we get:
Removing the overall slope by subtracting x/ϕ gives:
One feature of the continuous case is that one can continuously change initial conditions—though the behavior one gets typically breaks into “domains” with discontinuous boundaries, as in this case where we’re plotting the value of f[x] as a function of x and the “cutoff” in the initial conditions f[x], x ≤ :
So what about other rules? A rule like P312 (f[n_] := 3 + f[n – f[n – 2]]) given “constant” initial conditions effectively just copies and translates the initial interval, and gives a simple order-0 interpolation of the discrete case. With initial condition f[x] = x some segments get “tipped”:
All the cases we’ve considered here don’t “look back” to negative values, in either the discrete or continuous case. But what about a rule like T212 (f[n_] := 2 f[n – 1 f[n – 2]]) that progressively “looks back further”? With the initial condition f[x] = 1 for x ≤ 0, one gets the same result as in the discrete case:
But if one uses the initial condition f[x ] = Abs[x – 1] for x ≤ 0 (the Abs[x – 1] is needed to avoid ending up with f[x] depending on f[y] for y > x) one instead has
yielding the rather different result:
Continuing for larger x (on a log scale) we get:
Successively zooming in on one of the first “regions of noise” we see that it ultimately consists just of a large number of straight segments:
What’s going on here? If we count the number of initial conditions that are used for different values of x we see that this has discontinuous changes, leading to disjoint segments in f[x]:
Plotting over a larger range of x values the number of initial conditions used is:
And plotting the actual values of those initial conditions we get:
If we go to later, “more intense” regions of noise, we see more fragmentation—and presumably in the limit x ∞ we get the analog of an essential singularity in f[x]:
For the S family, with its overall n/ϕ trend, even constant initial conditions—say for S1—already lead to tipping, here shown compared to the discrete case:
Let’s say we have a recursive definition—like the standard Fibonacci one:
How do we actually use this to compute the value of, say, f[7]? Well, we can start from f[7], then use the definition to write this as f[6] + f[5], then write f[6] as f[5] + f[4], and so on. And we can represent this using a evaluation graph, in the form:
But this computation is in a sense very wasteful; for example, it’s independently computing f[3] five separate times (and of course getting the same answer each time). But what if we just stored each f[n] as soon as we compute, and then just retrieve that stored (“cached”) value whenever we need it again?
In the Wolfram Language, it’s a very simple change to our original definition:
And now our evaluation graph becomes much simpler:
And indeed it’s this kind of minimal evaluation graph that we’ve been using in everything we’ve discussed so far.
What’s the relationship between the “tree” evaluation graph, and this minimal one? The tree graph is basically an “unrolled” version of the minimal graph, in which all the possible paths that can be taken from the root node to the initial condition nodes have been treed out.
In general, the number of edges that come out of a single node in a evaluation graph will be equal to the number of instances of the function f that appear on the right-hand side of the recursive definition we’re using (i.e. 2 in the case of the standard Fibonacci definition). So this means that if the maximum length of path from the root to the initial conditions is s, the maximum number of nodes that can appear in the “unrolled” graph is 2s. And whenever there are a fixed set of initial conditions (i.e. if there’s always the same lookback), the maximum path length is essentially n—implying in the end that the maximum possible number of nodes in the unrolled graph will be 2n.
(In the actual case of the Fibonacci recurrence, the number of nodes in the unrolled graph is, or about 1.6n.)
But if we actually evaluate f[7]—say in the Wolfram Language—what is the sequence of f[n]’s that we’ll end up computing? Or, in effect, how will the evaluation graph be traversed? Here are the results for the unrolled and minimal evaluation graphs—i.e. without and with caching:
Particularly in the first case this isn’t the only conceivable result we could have gotten. It’s the way it is here because of the particular “leftmost innermost” evaluation order that the Wolfram Language uses by default. In effect, we’re traversing the graph in a depth-first way. In principle we could use other traversal orders, leading to f[n]’s being evaluated in different orders. But unless we allow other operations (like f[3] + f[3] 2 f[3]) to be interspersed with f evaluations, we’ll still always end up with the same number of f evaluations for a given evaluation graph.
But which is the “correct” evaluation graph? The unrolled one? Or the minimal one? Well, it depends on the computational primitives we’re prepared to use. With a pure stack machine, the unrolled graph is the only one possible. But if we allow (random-access) memory, then the minimal graph becomes possible.
OK, so what happens with nestedly recursive functions? Here, for example, are unrolled and minimal graphs for T212:
Here are the sequences of f[n]’s that are computed:
And here’s a comparison of the number of nodes (i.e. f evaluations) from unrolled and minimal evaluation graphs (roughly 1.2n and 0.5 n, respectively):
Different recursive functions lead to different patterns of behavior. The differences are less obvious in evaluation graphs, but can be quite obvious in the actual sequence of f[n]’s that are evaluated:
But although looking at evaluation sequences from unrolled evaluation graphs can be helpful as a way of classifying behavior, the exponentially more steps involved in the unrolled graph typically makes this impractical in practice.
Recursive functions have a fairly long history, that we’ll be discussing below. And for nearly a hundred years there’s been a distinction made between “primitive recursive functions” and “general recursive functions”. Primitive recursive functions are basically ones where there’s a “known-in-advance” pattern of computation that has to be done; general recursive functions are ones that may in effect make one have to “search arbitrarily far” to get what one needs.
In Wolfram Language terms, primitive recursive functions are roughly ones that can be constructed directly using functions like Nest and Fold (perhaps nested); general recursive functions can also involve functions like NestWhile and FoldWhile.
So, for example, with the Fibonacci definition
the function f[n] is primitive recursive and can be written, say, as:
Lots of the functions one encounters in practice are similarly primitive recursive—including most “typical mathematical functions” (Plus, Power, GCD, Prime, …). And for example functions that give the results of n steps in the evolution of a Turing machine, cellular automaton, etc. are also primitive recursive. But functions that for example test whether a Turing machine will ever halt (or give the state that it achieves if and when it does halt) are not in general primitive recursive.
On the face of it, our nestedly recursive functions seem like they must be primitive recursive, since they don’t for example appear to be “searching for anything”. But things like the presence of longer and longer lookbacks raise questions. And then there’s the potential confusion of the very first example (dating from the late 1920s) of a recursively defined function known not to be primitive recursive: the Ackermann function.
The Ackermann function has three (or sometimes two) arguments—and, notably, its definition (here given in its classic form) includes nested recursion:
This is what the evaluation graphs look like for some small cases:
Looking at these graphs we can begin to see a pattern. And in fact there’s a simple interpretation: f[m, x, y] for successive m is doing progressively more nested iterations of integer successor operations. f[0, x, y] computes x + y; f[1, x, y] does “repeated addition”, i.e. computes x × y; f[2, x, y] does “repeated multiplication”, i.e. computes xy; f[3, x, y] does “tetration”, i.e. computes the “power tower” Nest[x#&, 1, y]; etc.
Or, alternatively, these can be given explicitly in successively more nested form:
And at least in this form f[m, x, y] involves m nestings. But a given primitive recursive function can involve only a fixed number of nestings. It might be conceivable that we could rewrite
But it turns out that the fact that this can happen depends critically on the Ackermann function having more than one argument—so that one can construct the “diagonal” f[m, m, m].
So what about our nestedly recursive functions? Well, at least in the form that we’ve used them, they can all be written in terms of Fold. The key idea is to accumulate a list of values so far (conveniently represented as an association)—sampling whichever parts are needed—and then at the end take the last element. So for example the “Summer School function” T311
can be written:
An important feature here is that we’re getting Lookup to give 1 if the value it’s trying to look up hasn’t been filled in yet, implementing the fact that f[n] = 1 for n ≤ 0.
So, yes, our recursive definition might look back further and further. But it always just finds value 1—which is easy for us to represent without, for example, any extra nesting, etc.
The ultimate (historical) definition of primitive recursion, though, doesn’t involve subsets of the Wolfram Language (the definition was given almost exactly 100 years too early!). Instead, it involves a specific set of simple primitives:
(An alternative, equivalent definition for recursion—explicitly involving Fold—is
So can our nestedly recursive functions be written purely in terms of these primitives? The answer is yes, though it’s seriously complicated. A simple function like Plus can for example be written as r[p[1], s], so that e.g. r[p[1], s][2,3]5. Times can be written as r[z, c[Plus, p[1], p[3]]] or r[z, c[r[p[1], s], p[1], p[3]]], while Factorial can be written as r[c[s, z], c[Times, p[1], c[s, p[2]]]]. But even Fibonacci, for example, seems to require a very much longer specification.
In writing “primitive-recursive-style” definitions in Wolfram Language we accumulated values in lists and associations. But in the ultimate definition of primitive recursion, there are no such constructs; the only form of “data” is positive integers. But for our definitions of nestedly recursive functions we can use a “tupling function” that “packages up” any list of integer values into a single integer (and an untupling function that unpacks it). And we can do this say based on a pairing (2-element-tupling) function like:
But what about the actual If[n ≤0, 1, ...] lookback test itself? Well, If can be written in primitive recursive form too: for example, r[c[s, z], c[f, c[s, p[2]]]][n] is equivalent to If[n ≤ 0, 1, f[n]].
So our nestedly recursive functions as we’re using them are indeed primitive recursive. Or, more strictly, finding values f[n] is primitive recursive. Asking questions like “For what n does f[n] reach 1000?” might not be primitive recursive. (The obvious way of answering them involves a FoldWhile-style non-primitive-recursive search, but proving that there’s no primitive recursive way to answer the question is likely very much harder.)
By the way, it’s worth commenting that while for primitive recursive functions it’s always possible to compute a value f[n] for any n, that’s not necessarily true for general recursive functions. For example, if we ask “For what n does f[n] reach 1000?” there might simply be no answer to this; f[n] might never reach 1000. And when we look at the computations going on underneath, the key distinction is that in evaluating primitive recursive functions, the computations always halt, while for general recursive functions, they may not.
So, OK. Our nestedly recursive functions can be represented in “official primitive recursive form”, but they’re very complicated in that form. So that raises the question: what functions can be represented simply in this form? In A New Kind of Science I gave some examples, each minimal for the output it produces:
And then there’s the most interesting function I found:
It’s the simplest primitive recursive function whose output has no obvious regularity:
Because it’s primitive recursive, it’s possible to express it in terms of functions like Fold—though it’s two deep in those, making it in some ways more complicated (at least as far as the Grzegorczyk hierarchy that counts “Fold levels” is concerned) than our nestedly recursive functions:
But there’s still an issue to address with nestedly recursive functions and primitive recursion. When we have functions (like T212) that “reach back” progressively further as n increases, there’s a question of what they’ll find. We’ve simply assumed f[n] = 1 for n ≤0. But what if there was something more complicated there? Even if f[–m] was given by some primitive recursive function, say p[m], it seems possible that in computing f[n] one could end up somehow “bouncing back and forth” between positive and negative arguments, and in effect searching for an m for which p[m] has some particular value, and in doing that searching one could find oneself outside the domain of primitive recursive functions.
And this raises yet another question: are all definitions we can give of nestedly recursive functions consistent? Consider for example:
Now ask: what is f[1]? We apply the recursive definition. But it gives us
We’ve seen all sorts of complex behavior in nestedly recursive functions. And what the Principle of Computational Equivalence suggests is that whenever one sees complex behavior, one must in some sense be dealing with computations that are “as sophisticated as any computation can be”. And in particular one must be dealing with computations that can somehow support computation universality.
So what would it mean for a nestedly recursive function to be universal? For a start, one would need some way to “program” the function. There seem to be a couple of possibilities. First, one could imagine packing both “code” and “data” into the argument n of f[n]. So, for example, one might use some form of tupling function to take a description of a rule and an initial state for a Turing machine, together with a specification of a step number, then package all these things into an integer n that one feeds into one’s universal nestedly recursive function f. Then the idea would be that the value computed for f[n] could be decoded to give the state of the Turing machine at the specified step. (Such a computation by definition always halts—but much as one computes with Turing machines by successively asking for the next steps in their evolution, one can imagine setting up a “harness” that just keeps asking for values of f[n] at an infinite progression of values n.)
Another possible approach to making a universal nestedly recursive function is to imagine feeding in a “program” through the initial conditions one gives for the function. There might well need to be decoding involved, but in some sense what one might hope is that just by changing its initial conditions one could get a nestedly recursive function with a specific recursive definition to emulate a nestedly recursive function with any other recursive definition (or, say, for a start, any linear recurrence).
Perhaps one could construct a complicated nestedly recursive function that would have this property. But what the Principle of Computational Equivalence suggests is that it should be possible to find the property even in “naturally occurring cases”—like P312 or T212.
The situation is probably going to be quite analogous to what happened with the rule 110 cellular automaton or the s = 2, k = 3 596440 Turing machine. By looking at the actual typical behavior of the system one got some intuition about what was likely to be going on. And then later, with great effort, it became possible to actually prove computation universality.
In the case of nestedly recursive functions, we’ve seen here examples of just how diverse the behavior generated by changing initial conditions can be. It’s not clear how to harness this diversity to extract some form of universality. But it seems likely that the “raw material” is there. And that nestedly recursive functions will show themselves as able join so many other systems in fitting into the framework defined by the Principle of Computational Equivalence.
Once one has the concept of functions and the concept of recursion, nestedly recursive functions aren’t in some sense a “complicated idea”. And between this fact and the fact that nestedly recursive functions haven’t historically had a clear place in any major line of mathematical or other development it’s quite difficult to be sure one’s accurately tracing their history. But I’ll describe here at least what I currently know.
The concept of something like recursion is very old. It’s closely related to mathematical induction, which was already being used for proofs by Euclid around 300 BC. And in a quite different vein, around the same time (though not recorded in written form until many centuries later) Fibonacci numbers arose in Indian culture in connection with the enumeration of prosody (“How many different orders are there in which to say the Sanskrit words in this veda?”).
Then in 1202 Leonardo Fibonacci, at the end of his calculational math book Liber Abaci (which was notable for popularizing Hindu-Arabic numerals in the West) stated—more or less as a recreational example—his “rabbit problem” in recursive form, and explicitly listed the Fibonacci numbers up to 377. But despite this early appearance, explicit recursively defined sequences remained largely a curiosity until as late as the latter part of the twentieth century.
The concept of an abstract function began to emerge with calculus in the late 1600s, and became more solidified in the 1700s—but basically always in the context of continuous arguments. A variety of specific examples of recurrence relations—for binomial coefficients, Bernoulli numbers, etc.—were in fairly widespread use. But there didn’t seem to have yet been a sense that there was a general mathematical structure to study.
In the course of the 1800s there had been an increasing emphasis on rigor and abstraction in mathematics, leading by the latter part of the century to a serious effort to axiomatize concepts associated with numbers. Starting with concepts like the recursive definition of integers by repeated application of the successor operation, by the time of Peano’s axioms for arithmetic in 1891 there was a clear general notion (particularly related to the induction axiom) that (integer) functions could be defined recursively. And when David Hilbert’s program of axiomatizing mathematics got underway at the beginning of the 1900s, it was generally assumed that all (integer) functions of interest could actually be defined specifically using primitive recursion.
The notation for recursively specifying functions gradually got cleaner, making it easier to explore more elaborate examples. And in 1927 Wilhelm Ackermann (a student of Hilbert’s) introduced (in completely modern notation) a “reasonable mathematical function” that—as we discussed above—he showed was not primitive recursive. And right there, in his paper, without any particular comment, is a nestedly recursive function definition:
In 1931 Kurt Gödel further streamlined the representation of recursion, and solidified the notion of general recursion. There soon developed a whole field of recursion theory—though most of it was concerned with general issues, not with specific, concrete recursive functions. A notable exception was the work of Rózsa Péter (Politzer), beginning in the 1930s, and leading in 1957 to her book Recursive Functions—which contains a chapter on “Nested Recursion” (here in English translation):
But despite the many specific (mostly primitive) recursive functions discussed in the rest of the book, this chapter doesn’t stray far from the particular function Ackermann defined (or at least Péter’s variant of it).
What about the recreational mathematics literature? By the late 1800s there were all sorts of publications involving numbers, games, etc. that at least implicitly involved recursion (an example being Édouard Lucas’s 1883 Tower of Hanoi puzzle). But—perhaps because problems tended to be stated in words rather than mathematical notation—it doesn’t seem as if nestedly recursive functions ever showed up.
In the theoretical mathematics literature, a handful of somewhat abstract papers about “nested recursion” did appear, an example being one in 1961 by William Tait, then at Stanford:
But, meanwhile, the general idea of recursion was slowly beginning to go from purely theoretical to more practical. John McCarthy—who had coined the term “artificial intelligence”—was designing LISP as “the language for AI” and by 1960 was writing papers with titles like “Recursive Functions of Symbolic Expressions and Their Computation by Machine”.
In 1962 McCarthy came to Stanford to found the AI Lab there, bringing with him enthusiasm for both AI and recursive functions. And by 1968 these two topics had come together in an effort to use “AI methods” to prove properties of programs, and in particular programs involving recursive functions. And in doing this, John McCarthy came up with an example he intended to be awkward—that’s exactly a nestedly recursive function:
In our notation, it would be:
And it became known as “McCarthy’s 91-function” because, yes, for many n, f[n] = 91. These days it’s trivial to evaluate this function—and to find out that f[n] = 91 only up to n = 102:
But even the evaluation graph is somewhat large
and in pure recursive evaluation the recursion stack can get deep—which back then was a struggle for LISP systems to handle.
There were efforts at theoretical analysis, for example by Zohar Manna, who in 1974 published Mathematical Theory of Computation which—in a section entitled “Fixpoints of Functionals”—presents the 91-function and other nestedly recursively functions, particularly in the context of evaluation-order questions.
In the years that followed, a variety of nestedly recursive functions were considered in connection with proving theorems about programs, and with practical assessments of LISP systems, a notable example being Ikuo Takeuchi’s 1978 triple recursive function:
But in all these cases the focus was on how these functions would be evaluated, not on what their behavior would be (and it was typically very simple).
But now we have to follow another thread in the story. Back in 1961, right on the Stanford campus, a then-16-year-old Douglas Hofstadter was being led towards nestedly recursive functions. As Doug tells it, it all started with him seeing that squares are interspersed with gaps of 1 or 2 between triangular numbers, and then noticing patterns in those gaps (and later realizing that they showed nesting). Meanwhile, at Stanford he had access to a computer running Algol, a language which (like LISP and unlike Fortran) supported recursion (though this wasn’t particularly advertised, since recursion was still generally considered quite obscure).
And as Doug tells it, within a year or two he was using Algol to do things like recursively create trees representing English sentences. Meanwhile—in a kind of imitation of the Eleusis “guess-a-card-rule” game—Doug was apparently challenging his fellow students to a “function game” based on guessing a math function from specified values. And, as he tells it, he found that functions that were defined recursively were the ones people found it hardest to guess.
That was all in the early 1960s, but it wasn’t until the mid-1970s that Doug Hofstadter returned to such pursuits. After various adventures, Doug was back at Stanford—writing what became his book Gödel, Escher, Bach. And in 1977 he sent a letter to Neil Sloane, creator of the 1973 A Handbook of Integer Sequences (and what’s now the Online Encyclopedia of Integer Sequences, or OEIS):
As suggested by the accumulation of “sequence ID” annotations on the letter, Doug’s “eta sequences” had actually been studied in number theory before—in fact, since at least the 1920s (they are now usually called Beatty sequences). But the letter went on, now introducing some related sequences—that had nestedly recursive definitions:
As Doug pointed out, these particular sequences (which were derived from golden ratio versions of his “eta sequences”) have a very regular form—which we would now call nested. And it was the properties of this form that Doug seemed most concerned about in his letter. But actually, as we saw above, just a small change in initial conditions in what I’m calling S1 would have led to much wilder behavior. But that apparently wasn’t something Doug happened to notice. A bit later in the letter, though, there was another nestedly recursive sequence—that Doug described as a “horse of an entirely nother color”: the “absolutely CRAZY” Q sequence:
Two years later, Doug’s Gödel, Escher, Bach book was published—and in it, tucked away at the bottom of page 137, a few pages after a discussion of recursive generation of text (with examples such as “the strange bagels that the purple cow without horns gobbled”), there was the Q sequence:
Strangely, though, there was no picture of it, and Doug listed only 17 terms (which, until I was writing this, was all I assumed he had computed):
So now nestedly recursive sequences were out in the open—in what quickly became a very popular book. But I don’t think many people noticed them there (though, as I’ll discuss, I did). Gödel, Escher, Bach is primarily a playful book focused on exposition—and not the kind of place you’d expect to find a new, mathematical-style result.
Still—quite independent of the book—Neil Sloane showed Doug’s 1977 letter to his Bell Labs colleague Ron Graham, who within a year made a small mention of the Q sequence in a staid academic math publication (in a characteristic “state-it-as-a-problem” Erdös form):
There was a small and tight-knit circle of serious mathematicians—essentially all of whom, as it happens, I personally knew—who would chase these kinds of easy-to-state-but-hard-to-solve problems. Another was Richard Guy, who soon included the sequence as part of problem E31 in his Unsolved Problems in Number Theory, and mentioned it again a few years later.
But for most of the 1980s little was heard about the sequence. As it later turns out, a senior British applied mathematician named Brian Conolly (who wasn’t part of the aforementioned tight-knit circle) had—presumably as a kind of hobby project—made some progress, and in 1986 had written to Guy about it. Guy apparently misplaced the letter, but later told Conolly that John Conway and Sol Golomb had worked on similar things.
Conway presumably got the idea from Hofstadter’s work (though he had a habit of obfuscating his sources). But in any case, on July 15, 1988, Conway gave a talk at Bell Labs entitled “Some Crazy Sequences” (note the word “crazy”, just like in Hofstadter’s letter to Sloane) in which he discussed the regular-enough-to-be-mathematically-interesting sequence (which we’re calling G3111 here):
Despite its visual regularity, Conway couldn’t mathematically prove certain features of the wiggles in the sequence—and in his talk offered a $10,000 prize for anyone who could. By August a Bell Labs mathematician named Colin Mallows had done it. Conway claimed (later to be contradicted by video evidence) that he’d only offered $1000—and somehow the whole affair landed as a story in the August 30 New York Times under the heading “Intellectual Duel: Brash Challenge, Swift Response”. But in any case, this particular nestedly recursive sequence became known as “Conway’s Challenge Sequence”.
So what about Sol Golomb? It turns out he’d started writing a paper—though never finished it:
He’d computed 280 terms of the Q sequence (he wasn’t much of a computer user) and noticed a few coincidences. But he also mentioned another kind of nestedly recursive sequence, no doubt inspired by his work on feedback shift registers:
As he noted, the behavior depends greatly on the initial conditions, though is always eventually periodic—with his student Unjeng Cheng having found long-period examples.
OK, so by 1988 nestedly recursive functions had at least some notoriety. So what happened next? Not so much. There’s a modest academic literature that’s emerged over the last few decades, mostly concentrated very specifically around “Conway’s Challenge Sequence”, Hofstadter’s Q function, or very similar “meta Fibonacci” generalizations of them. And so far as I know, even the first published large-scale picture of the Q sequence only appeared in 1998 (though I had pictures of it many years earlier):
Why wasn’t more done with nestedly recursive functions? At some level it’s because they tend to have too much computational irreducibility—making it pretty difficult to say much about them in traditional mathematical ways. But perhaps more important, studying them broadly is really a matter of ruliology: it requires the idea of exploring spaces of rules, and of expecting the kinds of behavior and phenomena that are characteristic of systems in the computational universe. And that’s something that’s still not nearly as widely understood as it should be.
I think 1979 was the year when I first took recursion seriously. I’d heard about the Fibonacci sequence (though not under that name) as a young child a decade earlier. I’d implicitly (and sometimes explicitly) encountered recursion (sometimes through error messages!) in computer algebra systems I’d used. In science, I’d studied fractals quite extensively (Benoit Mandelbrot’s book having appeared in 1977), and I’d been exposed to things like iterated maps. And I’d quite extensively studied cascade processes, notably of quarks and gluons in QCD.
As I think about it now, I realize that for several years I’d written programs that made use of recursion (and I had quite a lot of exposure to LISP, and the culture around it). But it was in 1979—having just started using C—that I first remember writing a program (for doing percolation theory) where I explicitly thought “this is using recursion”. But then, in late 1979, I began to design SMP (“Symbolic Manipulation Program”), the forerunner of the modern Wolfram Language. And in doing this I quickly solidified my knowledge of mathematical logic and the (then-fledgling) field of theoretical computer science.
My concept of repeated transformations for symbolic expressions—which is still the core of Wolfram Language today—is somehow fundamentally recursive. And by the time we had the first signs of life for our SMP system, Fibonacci was one of our very first tests. We soon tried the Ackermann function too. And in 1980 I became very interested in the problem of evaluation order, particularly for recursive functions—and the best treatment I found of it (though at the time not very useful to me) was in none other than the book by Zohar Manna that I mentioned above. (In a strange twist, I was at that time also studying gauge choices in physics—and it was only last year that I realized that they’re fundamentally the same thing as evaluation orders.)
It was soon after it came out in 1979 that I first saw Douglas Hofstadter’s book. At the time I wasn’t too interested in its Lewis-Carroll-like aspects, or its exposition; I just wanted to know what the “science meat” in it was. And somehow I found the page about the Q sequence, and filed it away as “something interesting”.
I’m not sure when I first implemented the Q sequence in SMP, but by the time we released Version 1.0 in July 1981, there it was: an external package (hence the “X” prefix) for evaluating “Hofstadter’s recursive function”, elegantly using memoization—with the description I gave saying (presumably because that’s what I’d noticed) that its values “have several properties of randomness”:
Firing up a copy of SMP today—running on a virtual machine that still thinks it’s 1986—I can run this code, and easily compute the function:
I can even plot it—though without an emulator for a 1980s-period storage-tube display, only the ASCIIfied rendering works:
So what did I make of the function back in 1981? I was interested in how complexity and randomness could occur in nature. But at the time, I didn’t have enough of a framework to understand the connection. And, as it was, I was just starting to explore cellular automata, which seemed a lot more “nature like”—and which soon led me to things like rule 30 and the phenomenon of computational irreducibility.
Still, I didn’t forget the Q sequence. And when I was building Mathematica I again used it as a test (the .tq file extension came from the brief period in 1987 when we were trying out “Technique” as the name of the system):
When Mathematica 1.0 was released on June 23, 1988, the Q sequence appeared again, this time as an example in the soon-in-every-major-bookstore Mathematica book:
I don’t think I was aware of Conway’s lecture that occurred just 18 days later. And for a couple of years I was consumed with tending to a young product and a young company. But by 1991, I was getting ready to launch into basic science again. Meanwhile, the number theorist (and today horologist) Ilan Vardi—yet again from Stanford—had been working at Wolfram Research and writing a book entitled Computational Recreations in Mathematica, which included a long section on the analysis of Takeuchi’s nested recursive function (“TAK”). My email archive records an exchange I had with him:
He suggested a “more symmetrical” nested recursive function. I responded, including a picture that made it fairly clear that this particular function would have only nested behavior, and not seem “random”:
By the summer of 1991 I was in the thick of exploring different kinds of systems with simple rules, discovering the remarkable complexity they could produce, and filling out what became Chapter 3 of A New Kind of Science: “The World of Simple Programs”. But then came Chapter 4: “Systems Based on Numbers”. I had known since the mid-1980s about the randomness in things like digit sequences produced by successive arithmetic operations. But what about randomness in pure sequences of integers? I resolved to find out just what it would take to produce randomness there. And so it was that on August 13, 1993, I came to be enumerating possible symbolic forms for recursive functions—and selecting ones that could generate at least 10 terms:
As soon as I plotted the “survivors” I could see that interesting things were going to happen:
Was this complexity somehow going to end? I checked out to 10 million terms. And soon I started collecting my “prize specimens” and making a gallery of them:
I had some one-term recurrences, and some two-term ones. Somewhat shortsightedly I was always using “Fibonacci-like” initial conditions f[1] = f[2] = 1—and I rejected any recurrence that tried to “look back” to f[0], f[–1], etc. And at the time, with this constraint, I only found “really interesting” behavior in two-term recurrences.
In 1994 I returned briefly to recursive sequences, adding a note “solving” a few of them, and discussing the evaluation graphs of others:
When I finally finished A New Kind of Science in 2002, I included a list of historical “Close approaches” to its core discoveries, one of them being Douglas Hofstadter’s work on recursive sequences:
In retrospect, back in 1981 I should have been able to take the “Q sequence” and recognize in it the essential “rule 30 phenomenon”. But as it was, it took another decade—and many other explorations in the computational universe—for me to build up the right conceptual framework to see this. In A New Kind of Science I studied many kinds of systems, probing them far enough, I hoped, to see their most important features. But recursive functions were an example where I always felt there was more to do; I felt I’d only just scratched the surface.
In June 2003—a year after A New Kind of Science was published—we held our first summer school. And as a way to introduce methodology—and be sure that people knew I was fallible and approachable—I decided on the first day of the summer school to do a “live experiment”, and try to stumble my way to discovering something new, live and in public.
A few minutes before the session started, I picked the subject: recursive functions. I began with some examples I knew. Then it was time to go exploring. At first lots of functions “didn’t work” because they were looking back too far. But then someone piped up “Why not just say that f[n] is 1 whenever n isn’t a positive integer?” Good idea! And very easy to try.
Soon we had the “obvious” functions written (today Apply[Plus, ...] could be just Total[...], but otherwise there’s nothing “out of date” here):
In a typical story of Wolfram-Language-helps-one-think-clearly, the obvious function was also very general, and allowed a recurrence with any number of terms. So why not start with just one term? And immediately, there it was—what we’re now calling T311:
And then a plot (yes, after Version 6 one didn’t need the Show or the trailing “;”):
Of course, as is the nature of computational constructions, this is something timeless—that looks the same today as it did 21 years ago (well, except that now our plots display with color by default).
I thought this was a pretty neat discovery. And I just couldn’t believe that years earlier I’d failed to see the obvious generalization of having “infinite” initial conditions.
The next week I did a followup session, this time talking about how one would write up a discovery like this. We started off with possible titles (including audience suggestions):
And, yes, the first title listed is exactly the one I’ve now used here. In the notebook I created back then, there were first some notes (some of which should still be explored!):
Three hours later (on the afternoon of July 11, 2003) there’s another notebook, with the beginnings of a writeup:
By the way, another thing we came up with at the summer school was the (non-nestedly) recursive function:
Plotting g[n + 1] – g[n] gives:
And, yes, bizarrely (and reminiscent of McCarthy’s 91-function) for n ≥ 396, g[n + 1] – g[n] is always 97, and g[n] = 38606 + 97 (n – 396).
But in any case, a week or so after my “writeups” session, the summer school was over. In January 2004 I briefly picked the project up again, and made some pictures that, yes, show interesting structure that perhaps I should investigate now:
In the years that followed, I would occasionally bring nestedly recursive functions out again—particularly in interacting with high school and other students. And at our summer programs I suggested projects with them for a number of students.
In 2008, they seemed like an “obvious interesting thing” to add to our Demonstrations Project:
But mostly, they languished. Until, that is, my burst of “finish this” intellectual energy that followed the launch of our Physics Project in 2020. So here now, finally, after a journey of 43 years, I feel like I’ve been able to do some justice to nestedly recursive functions, and provided a bit more illumination to yet another corner of the computational universe.
(Needless to say, there are many, many additional questions and issues to explore. Different primitives, e.g. Mod, Floor, etc. Multiple functions that refer to each other. Multiway cases. Functions based on rational numbers. And endless potential approaches to analysis, identifying pockets of regularity and computational reducibility.)
Thanks to Brad Klee for extensive help. Thanks also to those who’ve worked on nestedly recursive functions as students at our summer programs over the years, including Roberto Martinez (2003), Eric Rowland (2003), Chris Song (2021) and Thomas Adler (2024). I’ve benefitted from interactions about nestedly recursive functions with Ilan Vardi (1991), Tal Kubo (1993), Robby Villegas (2003), Todd Rowland (2003 etc.), Jim Propp (2004), Matthew Szudzik (2005 etc.), Joseph Stocke (2021 etc.), Christopher Gilbert (2024) and Max Niedermann (2024). Thanks to Doug Hofstadter for extensive answers to questions about history for this piece. It’s perhaps worth noting that I’ve personally known many of the people mentioned in the history section here (with the dates I met them indicated): John Conway (1984), Paul Erdös (1986), Sol Golomb (1981), Ron Graham (1983), Benoit Mandelbrot (1986), John McCarthy (1981) and Neil Sloane (1983).
2024-08-30 00:31:46
Today is my birthday—for the 65th time. Five years ago, on my 60th birthday, I did a livestream where I talked about some of my plans. So… what happened? Well, what happened was great. And in fact I’ve just had the most productive five years of my life. Nine books. 3939 pages of writings (1,283,267 words). 499 hours of podcasts and 1369 hours of livestreams. 14 software product releases (with our great team). Oh, and a bunch of big—and beautiful—ideas and results.
It’s been wonderful. And unexpected. I’ve spent my life alternating between technology and basic science, progressively building a taller and taller tower of practical capabilities and intellectual concepts (and sharing what I’ve done with the world). Five years ago everything was going well, and making steady progress. But then there were the questions I never got to. Over the years I’d come up with a certain number of big questions. And some of them, within a few years, I’d answered. But others I never managed to get around to.
And five years ago, as I explained in my birthday livestream, I began to think “it’s now or never”. I had no idea how hard the questions were. Yes, I’d spent a lifetime building up tools and knowledge. But would they be enough? Or were the questions just not for our time, but only perhaps for some future century?
At several points before in my life I’d faced such issues—and things had worked out well (A New Kind of Science, Wolfram|Alpha, etc.). And from this, I had gotten a certain confidence about what might be possible. In addition, as a serious student of intellectual history, I had a sense of what kind of boldness was needed. Five years ago there wasn’t really anything that made me need to do something big and new. But I thought: “What the heck. I might as well try. I’ll never know what’s possible unless I try.”
A major theme of my work since the early 1980s had been exploring the consequences of simple computational rules. And I had found the surprising result that even extremely simple rules could lead to immensely complex behavior. So what about the universe? Could it be that at a fundamental level our whole universe is just following some simple computational rule?
I had begun my career in the 1970s as a teenager studying the frontiers of existing physics. And at first I couldn’t see how computational rules could connect to what is known in physics. But in the early 1990s I had an idea, and by the late 1990s I had developed it and gotten some very suggestive results. But when I published these in A New Kind of Science in 2002, even my friends in the physics community didn’t seem to care—and I decided to concentrate my efforts elsewhere (e.g. building Wolfram|Alpha, Wolfram Language, etc.).
But I didn’t stop thinking “one day I need to get back to my physics project”. And in 2019 I decided: “What the heck. Let’s try it now.” It helped that I’d made a piece of technical progress the year before, and that now two young physicists were enthusiastic to work with me on the project.
And so it was, soon after my birthday in 2019, that we embarked on our Physics Project. It was a mixture of computer experiments and big concepts. But before the end of 2019 it was clear: it was going to work! It was an amazing experience. Thing after thing in physics that had always been mysterious I suddenly understood. And it was beautiful—a theory of such strength built on a structure of such incredible simplicity and elegance.
We announced what we’d figured out in April 2020, right when the pandemic was in full swing. There was still much to do (and there still is today). But the overall picture was clear. I later learned that a century earlier many well-known physicists were beginning to think in a similar direction (matter is discrete, light is discrete; space must be too) but back then they hadn’t had the computational paradigm or the other tools needed to move this forward. And now the responsibility had fallen on us to do this. (Pleasantly enough, given our framework, many modern areas of mathematical physics seemed to fit right in.)
And, yes, figuring out the basic “machine code” for our universe was of course pretty exciting. But seeing an old idea of mine blossom like this had another very big effect on me. It made me think: “OK, what about all those other projects I’ve been meaning to do? Maybe it’s time to do those too.”
And something else had happened as well. In doing the Physics Project we’d developed a new way of thinking about things—not just computational, but “multicomputational”. And actually, the core ideas behind this were in A New Kind of Science too. But somehow I’d never taken them seriously enough before, and never extended my intuition to encompass them. But now with the Physics Project I was doing this. And I could see that the ideas could also go much further.
So, yes, I had a new and powerful conceptual framework for doing science. And I had all the technology of the modern Wolfram Language. But in 2020 I had another thing too—in effect, a new distribution channel for my ideas and efforts. Early in my career I had used academic papers as my “channel” (at one point in 1979 even averaging a paper every few weeks). But in the late 1980s I had a very different kind of channel: embodying my ideas in the design and implementation of Mathematica and what’s now the Wolfram Language. Then in the 1990s I had another channel: putting everything together into what became my book A New Kind of Science.
After that was published in 2002 I would occasionally write small posts—for the community site around the science in my book, for our corporate blog, etc. And in 2010 I started my own blog. At first I mostly just wrote small, fun pieces. But by 2015—partly driven by telling historical stories (200th anniversary of George Boole, 200th anniversary of Ada Lovelace, …)—the things I was writing were getting ever meatier. (There’d actually already been some meaty ones about personal analytics in 2012.)
And by 2020 my pattern was set and I would routinely write 50+ -page pieces, full of pictures (all with immediately runnable “click-to-copy” code) and intended for anyone who cared to read them. Finally I had a good channel again. And I started using it. As I’d found over the years—whether with language documentation or with A New Kind of Science—the very act of exposition was a critical part of organizing and developing my ideas.
And now I started producing pieces. Some were directly about specific topics around the Physics Project. But within two months I was already writing about a “spinoff”: “Exploring Rulial Space: The Case of Turing Machines”. I had realized that one of the places the ideas of the Physics Project should apply was to the foundations of mathematics, and to metamathematics. In a footnote to A New Kind of Science I had introduced the idea of “empirical metamathematics”. And in the summer of 2020, fuelled by my newfound “finish those old projects” mindset, I ended up writing an 80-page piece on “The Empirical Metamathematics of Euclid and Beyond”.
December 7, 1920 was the date a certain Moses Schönfinkel introduced what we now call combinators: the very first clear foundations for universal computation. I had always found combinators interesting (if hard to understand). I had used ideas from them back around 1980 in the predecessor of what’s now the Wolfram Language. And I had talked about them a bit in A New Kind of Science. But as the centenary approached, I decided to make a more definitive study, in particular using methods from the Physics Project. And, for good measure, even in the middle of the pandemic I tracked down the mysterious history of Moses Schönfinkel.
In March 2021, there was another centenary, this time of Emil Post’s tag system, and again I decided to finish what I’d started in A New Kind of Science, and write a definitive piece, this time running to about 75 pages.
One might have thought that the excursions into empirical metamathematics, combinators, tag systems, rulial and multiway Turing machines would be distractions. But they were not. Instead, they just deepened my understanding and intuition for the new ideas and methods that had come out of the Physics Project. As well as finishing projects that I’d wondered about for decades (and the world had had open for a century).
Perhaps not surprisingly given its fundamental nature, the Physics Project also engaged with some deep philosophical issues. People would ask me about them with some regularity. And in March 2021 I started writing a bit about them, beginning with a piece on consciousness. The next month I wrote “Why Does the Universe Exist? Some Perspectives from Our Physics Project”. (This piece of writing happened to coincide with the few days in my life when I’ve needed to do active cryptocurrency trading—so I was in the amusing position of thinking about a philosophical question about as deep as they come, interspersed with making cryptocurrency trades.)
Everything kept weaving together. These philosophical questions made me internalize just how important the nature of the observer is in our Physics Project. Meanwhile I started thinking about the relationship of methods from the Physics Project to distributed computing, and to economics. And in May 2021 that intersected with practical blockchain questions, which caused me to write about “The Problem of Distributed Consensus”—which would soon show up again in the science and philosophy of observers.
The fall of 2021 involved really leaning into the new multicomputational paradigm, among other things giving a long list of where it might apply: metamathematics, chemistry, molecular biology, evolutionary biology, neuroscience, immunology, linguistics, economics, machine learning, distributed computing. And, yes, in a sense this was my “to do” list. In many ways, half the battle was just defining this. And I’m happy to say that just three years later, we’ve already made a big dent in it.
While all of this was going on, I was also energetically pursuing my “day job” as CEO of Wolfram Research. Version 12.1 of the Wolfram Language had come out less than a month before the Physics Project was announced. Version 12.2 right after the combinator centenary. And in 2021 there were two new versions. In all 635 new functions, all of which I had carefully reviewed, and many of which I’d been deeply involved in designing.
It’s a pattern in the history of science (as well as technology): some new methodology or some new paradigm is introduced. And suddenly vast new areas are opened up. And there’s lots of juicy “low-hanging fruit” to be picked. Well, that’s what had happened with the ideas from our Physics Project, and the concept of multicomputation. There were many directions to go, and many people wanting to get involved. And in 2021 it was becoming clear that something organizational had to be done: this wasn’t a job for a company (even for one as terrific and innovative as ours is), it was a job for something like an institute. (And, yes, in 2022, we indeed launched what’s now the Wolfram Institute for Computational Foundations of Science.)
But back in 1986, I had started the very first institute concentrating on complexity and how it could arise from simple rules. Running it hadn’t been a good fit for me back then, and very quickly I started our company. In 2002, when A New Kind of Science was published, I’d thought again about starting an institute. But it didn’t happen. But now there really seemed to be no choice. I started reflecting on what had happened to “complexity”, and whether there was something to leverage from the institutional structure that had grown up around it. Nearly 20 years after the publication of A New Kind of Science, what should “complexity” be now?
I wrote “Charting a Course for ‘Complexity’: Metamodeling, Ruliology and More”—and in doing so, finally invented a word for the “pure basic science of what simple rules do”: ruliology.
My original framing of what became our Physics Project had been to try to “find a computational rule that gives our universe”. But I’d always found this unsatisfying. Because even if we had the rule, we’d still be left asking “why this one, and not another?” But in 2020 there’d been a dawning awareness of a possible answer.
Our Physics Project is based on the idea of applying rules to abstract hypergraphs that represent space and everything in it. But given a particular rule, there are in general many ways it can be applied. And a key idea in our Physics Project is that somehow it’s always applied in all these ways—leading to many separate threads of history, that branch and merge—and, importantly, giving us a way to understand quantum mechanics.
We talked about these different threads of history corresponding to different places in branchial space—and about how the laws of quantum mechanics are the direct analogs in branchial space (or branchtime) of the laws of classical mechanics (and gravity) in physical space (or spacetime). But what if instead of just applying a given rule in all possible ways, we applied all possible rules in all possible ways?
What would one get? In November 2021 I came up with a name for it: the ruliad. A year and a half earlier I’d already been starting to talk about rulial space—and the idea of us as observers perceiving the universe in terms of our particular sampling of rulial space. But naming the ruliad really helped to crystallize the concept. And I began to realize that I had come upon a breathtakingly broad intellectual arc.
The ruliad is the biggest computational thing there can be: it’s the entangled limit of all possible computations. It’s abstract and it’s unique—and it’s as inevitable in its structure as 2 + 2 = 4. It encompasses everything computational—including us. So what then is physics? Well, it’s a description of how observers like us embedded in the ruliad perceive the ruliad.
Back in 1984 I’d introduced what I saw as being the very central concept of computational irreducibility: the idea that there are many computational processes whose outcomes can be found only by following them step by step—with no possibility of doing what mathematical science was used to, and being able to “jump ahead” and make predictions without going through each step. At the beginning of the 1990s, when I began to work on A New Kind of Science, I’d invented the Principle of Computational Equivalence—the idea that systems whose behavior isn’t obviously simple will always tend to be equivalent in the sophistication of the computations they do.
Given the Principle of Computational Equivalence, computational irreducibility was inevitable. It followed from the fact that the observer could only be as computationally sophisticated as the system they were observing, and so would never be able to “jump ahead” and shortcut the computation. There’d come to be a belief that eventually science would always let one predict (and control) things. But here—from inside science—was a fundamental limitation on the power of science. All these things I’d known in some form since the 1980s, and with clarity since the 1990s.
But the ruliad took things to another level. For now I could see that the very laws of physics we know were determined by the way we are as observers. I’d always imagined that the laws of physics just are the way they are. But now I realized that we could potentially derive them from the inevitable structure of the ruliad, and very basic features of what we’re like as observers.
I hadn’t seen this philosophical twist coming. But somehow it immediately made sense. We weren’t getting our laws of physics from nothing; we were getting them from being the way we are. Two things seemed to be critical: that as observers we are computationally bounded, and that (somewhat relatedly) we believe we are persistent in time (i.e. we have a continuing thread of experience through time).
But even as I was homing in on the idea of the ruliad as it applied to physics, I was also thinking about another application: the foundations of mathematics. I’d been interested in the foundations of mathematics for a very long time; in fact, in the design of Mathematica (and what’s now the Wolfram Language) and its predecessor SMP, I’d made central use of ideas that I’d developed from thinking about the foundations of mathematics. And in A New Kind of Science, I’d included a long section on the foundations of mathematics, discussing things like the network of all possible theorems, and the space of all possible axiom systems.
But now I was developing a clearer picture. The ruliad represented not only all possible physics, but also all possible mathematics. And the actual mathematics that we perceive—like the actual physics—would be determined by our nature as observers, in this case mathematical observers. There were lots of technical details, and it wasn’t until March 2022 that I published “The Physicalization of Metamathematics and Its Implications for the Foundations of Mathematics”.
In some ways this finished what I’d started in the mid-1990s. But it went much further than I expected, in particular in providing a sweeping unification of the foundations of physics and mathematics. It talked about what the ultimate limit of mathematics would be like. And it talked about how “human-level mathematics”—where we can discuss things like the Pythagorean theorem rather than just the microdetails of underlying axioms—emerges for observers like us just like our human-level impression of physical space emerges from the underlying network of atoms of space.
One of the things I’d discovered in computational systems is how common computational irreducibility is, along with undecidability. And I had always wondered why undecidability wasn’t more common in typical mathematics. But now I had an answer: it just isn’t what mathematical observers like us “see” in the ruliad. At some level, this was a very philosophical result. But for me it also had practical implications, notably greatly validating the idea of using higher-level computational language to represent useful human-level mathematics, rather than trying to drill down to “axiomatic machine code”.
October 22, 2021 had marked a third of a century of Mathematica. And May 14, 2022 was the 20th anniversary of A New Kind of Science. And in contextualizing my activities, and planning for the future, I’ve increasingly found it useful to reflect on what I’ve done before, and how it’s worked out. And in both these cases I could see that seeds I’d planted many years earlier had blossomed, sometimes in ways I’d suspected they might, and sometimes in ways that far exceeded what I’d imagined.
What had I done right? The key, it seemed, was drilling down to find the essence of things, and then developing that. Even if I hadn’t been able to imagine quite what could be built on them, I’d been able to construct solid foundations, that successfully encapsulated things in the cleanest and simplest ways.
In talking about observers and the ruliad—and in fact our Physics Project in general—I kept on making analogies to the way that the gas laws and fluid dynamics emerge from the complicated underlying dynamics of molecules. And at the core of this is the Second Law of thermodynamics.
Well, as it happens, the very first foundational question in physics that I ever seriously studied was the origin of the Second Law. But that was when I was 12 years old, in 1972. For more than a century the Second Law had been quite mysterious. But when I discovered computational irreducibility in 1984 I soon realized that it might be the key to the Second Law. And in the summer of 2022—armed with a new perspective on the importance of observers—I decided I’d better once and for all write down how the Second Law works.
Once again, there were lots of technical details. And as a way to check my ideas I decided to go back and try to untangle the rather confused 150-year history of the Second Law. It was an interesting exercise, satisfying for seeing how my new ways of thinking clarified things, but cautionary in seeing how wrong turns had been taken—and solidified—in the past. But in the end, there it was: the Second Law was a consequence of the interplay between underlying computational irreducibility, and our limitations as observers.
It had taken half a century, but finally I had finished the project I’d started when I was 12 years old. I was on a roll finishing things. But I was also realizing that a bigger structure than I’d ever imagined was emerging. The Second Law project completed what I think is the most beautiful thing I’ve ever discovered. That all three of the core theories of twentieth century physics—general relativity, quantum mechanics and the Second Law (statistical mechanics)—have the same origin: the interplay between the underlying computational structure of the ruliad, and our characteristics and limitations as observers.
And I knew it didn’t stop there. I’d already applied the same kind of thinking to the foundations of mathematics. And I was ready to start applying it to all sorts of deep questions in science, in philosophy, and beyond. But at the end of 2022, just as I was finishing my pieces about the Second Law, there was a surprise: ChatGPT.
I’d been following AI and neural nets for decades. I first simulated a neural net in 1981. My first company, started in 1981, had, to my chagrin, been labeled an “AI company”. And from the early 2010s we’d integrated neural nets into the Wolfram Language. But—like the creators of ChatGPT—I didn’t expect the capabilities that emerged in ChatGPT. And as soon as I saw ChatGPT I started trying to understand it. What was it really doing? What would its capabilities be?
In the world at large, there was a sense of shock: if AI can do this now, soon it’ll be able to do everything. But I immediately thought about computational irreducibility. And it gave us limitations. But those limitations would inevitably apply to AIs as well. There would be things that couldn’t be “quickly figured out by pure thought”—by humans and AIs alike. And, by the way, I’d just spent four decades building a way to represent things computationally, and actually do systematic computations on them—because that was the point of the Wolfram Language.
So immediately I could see we were in a very interesting position. The Wolfram Language had the completely unique mission of creating a full-scale computational language. And now this was a crucial tool for AIs. The AIs could provide a very interesting and useful broad linguistic interface. But when it came to solid computation, they were—like humans—going to need a tool. Conveniently, Wolfram|Alpha already communicated in natural language. And it took only a few weeks to hook up Wolfram|Alpha—and Wolfram Language—to ChatGPT. We’d given “computational superpowers” to the AI.
ChatGPT was everywhere. And people kept asking me about it. And over and over again I ended up explaining things about it. So at the beginning of February 2023 I decided it’d be better for me just to write down once and for all what I knew. It took a little over a week (yes, I’m a fast writer)—and then I had an “explainer” (that ran altogether to 76 pages) of ChatGPT.
Partly it talked in general about how machine learning and neural nets work, and how ChatGPT in particular works. But what a lot of people wanted to know was not “how” but “why” ChatGPT works. Why was something like that possible? Well, in effect ChatGPT was showing us a new science discovery—about language. Everyone knows that there’s a certain syntactic grammar of language—like that, in English, sentences typically have the form noun-verb-noun. But what ChatGPT was showing us is that there’s also a semantic grammar—some pattern of rules for what words can be put together and make sense.
I’ve thought about the foundations of language for a long time (which isn’t too surprising, given the four decades I’ve spent as a computational language designer). So in effect I was well primed to think about its interaction with ChatGPT. And it also helped that—as I’ll talk about below—one of my long-unfinished projects is precisely on a formal framework for capturing meaning that I call “symbolic discourse language”.
In technology and other things I always like best situations where basically nothing is known, and one has to invent everything from scratch. And that’s what was happening for functionality based on LLMs in the middle of 2023. How would LLM-based Wolfram Language functions work? How would a prompt repository work? How would LLMs interact with notebooks?
Meanwhile, there was still lots of foment in the world about the “AI shock”. Before the arrival of the Physics Project in 2019—I’d been quite involved in AI philosophy, AI ethics, etc. And in March 2023 I wrote a piece on “Will AIs Take All Our Jobs and End Human History—or Not?” In the end—after all sorts of philosophical arguments, and an analysis of actual historical data—the answer was: “It’s Complicated”. But along the way computational irreducibility and the ruliad were central elements: limiting the controllability of AIs, allowing for an infinite frontier of invention, and highlighting the inevitable meaninglessness of everything in the absence of human choice.
By this point (and actually, with remarkable speed) my explainer on ChatGPT had turned into a book—that proved extremely popular (and now, for example, exists in over 10 languages). It was nice that people found the book useful—and perhaps it helped remove some of the alarming mystique of AI. But I couldn’t help noticing that of all the many things I’d written, this had been one of the fastest to write, yet it was garnering one of the largest readerships.
One might have imagined that AI was pretty far from our Physics Project, the ruliad, etc. But actually it soon became clear that there were close connections, and that there were things to learn in both directions. In particular, I’d come to think of minds that work in different ways as occupying different positions in the ruliad. But how could one get intuition about what such minds would experience—or observe? Well, I realized, one could just look at generative AI. In July I wrote “Generative AI Space and the Mental Imagery of Alien Minds”. I called this the “cats in hats piece”, because, yes, it has lots of pictures of (often bizarrely distorted) cats (in hats)—used as examples of what happens if one moves a mind around in rulial space. But despite the whimsy of the cats, this piece provided a surprisingly useful window into what for me has been a very longstanding question of how other minds might perceive things.
And this fed quite directly into my piece on “Observer Theory” in December 2023. Ever since things like Turing machines we’ve had a formal model for the process of computation. My goal was to do the same kind of thing for the process of observation. In a sense, computation constructs sequences of new things, say with time. Observation, on the other hand, equivalences things together, so they fit in finite minds. And just what equivalencing is done—by our senses, our measuring devices, our thinking—determines what our ultimate perceptions will be. Or, put another way, if we can characterize well enough what we’re like as observers, it’ll show us how we sample the ruliad, and what we’ll perceive the laws of physics to be.
When I started the Physics Project I wasn’t counting on it having any applications for hundreds of years. But quite soon it became clear that actually there were going to be all sorts of near-term applications, particularly of the formalism of multicomputation. And every time one used that formalism one could get more intuition about features of the Physics Project, particularly related to quantum mechanics. I ended up writing a variety of “ruliological” pieces, all, as it happens, expanding on footnotes in A New Kind of Science. There was “Multicomputation with Numbers” (October 2021), “Games and Puzzles as Multicomputational Systems” (June 2022) and “Aggregation and Tiling as Multicomputational Processes” (November 2023). And in September 2023 there was also “Expression Evaluation and Fundamental Physics”.
Back around 1980—when I was working on SMP—I’d become interested in the theory of expression evaluation. And finally, now, with the Physics Project—and my work on combinators and metamathematics—four decades later I had a principled way to study it (potentially with immediate application in distributed computing and computational language design around that). And I could check off progress on another long-pending project.
I give many talks, and do many podcasts and livestreams—essentially all unprepared. But in October 2023 I agreed to give a TED talk. And I just didn’t see any way to fit a reasonable snapshot of my activities into 18 minutes without preparation. How was I to coherently explain the Physics Project, the ruliad and computational language in such a short time? I called the talk “How to Think Computationally about AI, the Universe and Everything”. And I began with what for me was a new condensation: “Human language. Mathematics. Logic. These are all ways to formalize the world. And in our century there’s a new and yet more powerful one: computation.”
Over the years I’d done all sorts of seemingly very different projects in science and in technology. But somehow it seemed like they were now all converging. Back in 1979, for example, I’d invented the idea of transformations for symbolic expressions as a foundation for computational language. But now—more than four decades later—our Physics Project was saying that those kinds of transformations (specifically on hypergraphs) were just what the “machine code of the universe” was made of.
Since the 1980s I’d thought that computation was a useful paradigm with which to think about the world. But now our Physics Project and the ruliad were saying that it wasn’t just useful; it was the underlying paradigm of the world. For some time I’d been viewing our whole Wolfram Language effort as a way to provide a way to formalize computation for the purposes of both humans and machines. Four hundred years ago mathematical notation had streamlined mathematical thinking, allowing what became the mathematical sciences to develop. I saw what we were doing with our computational language as a way to streamline computational thinking, and allow “computational X” for all fields “X” to develop.
I began to see computational thinking as a way to “humanize” the ruliad; to pick out those parts that are meaningful to humans. And I began to see computational language as the bridge between the power of raw computation, and the kinds of things we humans think about.
But how did AI fit in? At the beginning of 2024, lots of people were still asking in effect “Can AI Solve Science?” So I decided to analyze that. I certainly didn’t expect AI to be able to “break computational irreducibility”. And it didn’t. Yes, it could automate much of what humans could do in a quick look. But formalized, irreducible computation: that was going to need computational language, not AI.
It’s easy to be original in the computational universe: if you pick a rule at random, it’s overwhelmingly likely nobody’s ever looked at it before. But will anyone care? They’ll care if in effect that part of the ruliad has been “colonized”; if there’s already a human connection to it. But what if you define some attribute that you want, then just “search out there” for a rule that exhibits it? That’s basically what biological evolution—or machine learning training—seems to do.
And as a kind of off-hand note I decided to just see if I could make a minimal model for that. I’d tried before—in the mid-1980s. And in the 1990s when I was writing A New Kind of Science I’d become convinced that computational irreducibility was in a sense a stronger force than adaptive evolution, and that when complex behavior was seen in biology, it was computational irreducibility that should take most of the credit.
But I decided to just do the experiment and see. And although computational irreducibility in a sense tells one to always “expect the unexpected”, in all these years I’ve never fully come to terms with that—and I’m still regularly surprised by what simple systems somehow “cleverly” manage to do. And so it was with my minimal model of biological evolution.
I’d always wondered why biological evolution managed to work at all, why it didn’t “get stuck”, and how it managed to come up with the ornate “solutions” it did. Well, now I knew: and it turned out it was, once again, a story of computational irreducibility. And I’d managed to finish another project that I started in the 1980s.
But then there was machine learning. And despite all the energy around it—as well as practical experience with it—it didn’t seem like there was a good foundational understanding of what it was doing or why it worked. For a couple of years I’d been asking all the machine learning experts I ran into what they knew. But mostly they confirmed that, yes, it wasn’t well understood. And in fact several of them suggested that I’d be the best person to figure it out.
So just a few weeks ago, starting with ideas from the biological evolution project, and mixing in some things I tried back in 1985, I decided to embark on exploring minimal models of machine learning. I just posted the results last week. And, yes, one seems to be able to see the essence of machine learning in systems vastly simpler than neural nets. In these systems one can visualize what’s going on—and it’s basically a story of finding ways to put together lumps of irreducible computation to do the tasks we want. Like stones one might pick up off the ground to put together into a stone wall, one gets something that works, but there’s no reason for there to be any understandable structure to it.
Like so many of the projects I’ve done in the past five years, I could in principle have done this project much earlier—even in the 1980s. But back then I didn’t have the intuition, the tools or the intellectual confidence to actually dive in and get the project done. And what’s been particularly exciting over the past five years is that I can feel—and even very tangibly see—how what I can do has grown. With every project I’ve done I’ve further honed my intuition, developed more tools (both conceptual and practical), and built my intellectual confidence. Could I have gotten here earlier in my life? I don’t think so. I think to get to where I am now required the kind of journey I’ve taken through science, technology and the other things I’ve done. A living example of the phenomenon of computational irreducibility.
I started my career young—and usually found myself the “youngest person in the room”. But shockingly fast all those years whizzed by, and now I’m usually the “oldest person in the room”. But somehow I always still seem to feel like a young whippersnapper—not settled into some expected pattern, and “pushing for the future”.
I’ve always done projects that are hard. Projects that many people thought were impossible. Projects that stretched my capabilities to the limit. And to do this has required a certain mixture of confidence and humility. Confidence that it’s worth me trying the project. Humility in not assuming that it’ll be easy for me.
I’ve learned a lot of fields by now, and with them a lot of different ways of thinking. But somehow it’s never enough to make the projects I do easy. Somehow the projects are always far enough out on the frontier that I have to learn new things and new ways of thinking to succeed at them. And so there I am, often the only person in the room whose project isn’t somehow easy for them. And who still has to be pushing, whippersnapper style.
At this point, a fair fraction of the projects I do are ones that I’ve thought about for a long time; a smaller fraction are opportunistic—coming into scope just now as a result of something I’ve done, or something that’s happened in the world at large. Before the past five years I had a lot of projects that had languished, often for decades. Yes, I thought they would be interesting, and I gradually collected information about them. But somehow I wasn’t quite in a place to tackle them.
But now I feel quite differently. In the past five years, I’ve gone back and finished a fair fraction of all those languishing projects. And it’s been great. Without exception, the projects turned out to be richer and more interesting than I expected. Often I realized I really couldn’t have done them without the tools and ideas (and infrastructure) I now have. And—often to my great surprise—the projects turned out to have very direct connections to big themes around the ruliad, the Physics Project and, for that matter, computational language.
Why was this happening? Partly it’s a tribute to the breadth of the computational (and now multicomputational) paradigm. But partly it has to do with the specific character of projects I was choosing—always seeking what seemed like the simplest, most foundational versions of things.
I’ve done quite a few big projects in my life, many seemingly very different. But as I look back, I realize that all my projects have a certain overall pattern to them. They’re all about taking something that seems complicated, then drilling down to find the foundations of what’s going on, and then building up from these—often with considerable engineering-style effort. And the methods and tools I’ve developed have in a sense implicitly been optimized for this pattern of work.
I suppose one gets used to the rhythm of it all. The time when one’s drilling down, slowly trying to understand things. The time when one’s doing all the work to build the big structure up. And yes, it’s all hard. But by now I know the signs of progress, and they’re always energizing to see.
At any given time, I’ll have many projects gestating—often for years or decades. But once a project becomes active, it’s usually the only one I’m working on. And I’ll work on it with great intensity, pushing hard to keep going until it’s done. Often I’ll be working with other people, usually much younger than me. And I think it’s always a surprise that I’ll routinely be the one who works with the greatest intensity—every day, at all hours.
I think I’m pretty efficient too. Of course, it helps that I have a tool—Wolfram Language—that I’ve been building for decades to support me. And it helps that I’ve developed all kinds of practices around how I organize code and notebooks I create, and how I set up my process of writing about things. Of course, it also helps that I have very capable people around me to make suggestions, explore additional directions, fill in details, check things, and get my write-ups produced and published.
As I have written about elsewhere, my life is in many ways set up to be quite simple and routine. I get up at the same time every day, eat the same thing for breakfast, and so on. But in a sense this frees me to concentrate on the intellectual things I’m doing—which are different every day, often in unexpected ways.
But how is it that I even get the time to do all these intellectual things? After all, I am—as I have been for the past 38 years—the CEO of a very active tech company. Two things I think help (in addition, of course, to the fact that I have such a great long-term team at the company). First, organization. And second, resolve. Every day I’ll have tightly scheduled meetings over the course of the working day. (And there are lots of details to this. I get up in the late morning, then do my first two meetings while walking, and so on.) But somehow—mostly on evenings and weekends—I find time to work intensely on my intellectual projects.
It’s not as if I ignore everything else in the world. But I do have a certain drive—and resolve—that fills any time available with my projects, and somehow seems to succeed in getting them done. (And, yes, there are many optimizations in the details of my life, saving me all sorts of time. And it probably helps that I’ve been a work-from-home CEO now for 33 years.)
One might have thought that CEOing would greatly detract from being able to do intellectual work. But I find the exact opposite. Because in my experience the discipline of strategy and decision making (as well as communicating thoughts and ideas to other people) that comes with CEOing is critical to being able to do incisive intellectual work. And, by the way, the kind of thinking that goes with intellectual work is also incredibly valuable in being an effective CEO.
There’s another critical part to my “formula”. And that has to do with exposition. For me, the exposition of a project is an integral part of the project. Part of it is that the very definition of the question is often one of the most important parts of a project. But more than that, it’s through exposition that I find I really understand things. It takes a certain discipline. It can be easy enough to make some highfalutin technical statement. But can one grind it down into truly simple pieces that one can immediately understand? Yes, that means other people will be able to understand it too. But for me, what’s critical is that that’s the way I can tell if I’m getting things right. And for me the exposition is what in the end defines the backbone of a project.
Normally I write quickly, and basically without revision. But whenever there’s a piece I’m finding unduly hard to write I know that’s where I’m muddled, and need to go back and understand what’s going on. Some of my projects (like creating this piece, for example) end up being essentially “pure writing”. But most are deeply computational—and full of computer experiments. And just as I put a lot of effort into making written exposition clear, I do the same for computational language, and for pictures. Indeed, many of my projects are in large measure driven by pictures. Usually these are what one can think of as “algorithmic diagrams”—created automatically with a structure optimized for exposition.
And the pictures aren’t just useful for presenting what I’ve done; they’re also critical to my own efforts to figure things out. And I’ve learned that it’s important to get the presentational details of pictures right as early as possible in a project—to give myself the best chance to notice things.
Often the projects I do require exploring large numbers of possible systems. And somehow with great regularity this leads to me ending up looking at large arrays of little pictures. Yes, there’s a lot of “looking” that can be automated. But in the end computational irreducibility means there’ll always be the unexpected, that I basically have to see for myself.
A great thing about the Wolfram Language is that it’s been very stable ever since it was first released. And that means that I can take notebooks even from the 1980s and immediately run them today. And, yes, given all the “old” projects I’ve worked on in the past five years, that’s been very important.
But in addition to being very stable, the Wolfram Language is also very self contained—and very much intended to be readable by humans. And the result is something that I’ve found increasingly important: every computational picture in everything I write has Wolfram Language code “behind it”, that you can get by clicking. All the time I find myself going back to previous things I’ve written, and picking up click-to-copy code to run for some new case, or use as the basis for something new I’m doing.
And of course that click-to-copy code is open for anyone to use. Not only for its “computational content”, but also for the often-elaborate visuals it implements.
Most of my writings over the past five years have been about new basic science. But interspersed with this—along with pieces about technology and about philosophy—are pieces about history. And in fact many of my scientific pieces have had extensive historical sections as well.
Why do I put such effort into history? Partly I just find it fun to figure out. But mostly it’s to contextualize my understanding of things. Particularly in the past five years I’ve ended up working on a whole sequence of projects that are in a sense about changing longstanding directions in science. And to feel confident about making such changes, one has to know why people went in those directions in the first place. And that requires studying history.
Make no mistake: history—or at least good history—is hard. Often there’ll be a standard simple story about how some discovery was suddenly made, or how some direction was immediately defined. But the real story is usually much more complicated—and much more revealing of the true intellectual foundations of what was figured out. Almost never did someone discover something “one day”; almost always it took many years to build up the conceptual framework so that “one day” the key thing could even be noticed.
When I do history I always make a big effort to look at the original documents. And often I realize that’s critical—because it’s only with whatever new understanding I’ve developed that one would stand a chance of correctly interpreting what’s in the documents. And even if one’s mainly interested in the history of ideas, I’ve always found it’s crucial to also understand the people who were involved with them. What was their motivation? What was their practical situation? What kinds of things did they know about? What was their intellectual style in thinking about things?
It has helped me greatly that I’ve had my own experiences in making discoveries—that gives me an intuition for how the process of discovery works. And it also helps that I’ve had my fair share of “worldly” experiences. Still, often it’s at first a mystery how some idea developed or some discovery got made. But my consistent experience is that with enough effort one can almost always solve it.
Particularly for the projects I’ve done in recent years, it often leaves me with a strange feeling of connection. For in many cases I find out that the things I’ve now done can be viewed as direct follow-ons to ideas that were thought about a century or more ago, and for one reason or another ignored or abandoned since.
And I’m then usually left with a strong sense of responsibility. An idea that was someone’s great achievement had been buried and lost to the world. But now I have found it again, and it rests on me to bring it into the future.
In addition to writing about “other people’s history”, I’ve also been writing quite a bit about my own history. And in the last few years I’ve made a point of explaining my personal history around the science—and technology—I describe. In doing this, it helps a lot that I have excellent personal archives—that routinely let me track to within minutes discoveries I made even four decades ago.
My goal in describing my own history is to help other people contextualize things I write about. But I have to say that time and time again I’ve found the effort to piece together my own history extremely valuable just for me. As I go through life, I try to build up a repertoire of patterns for how things I do fit together. But often those patterns aren’t visible at the time. And it takes going back—often years later—to see them.
I do the projects I do first and foremost for myself. But I’ve always liked the idea that other people can get their own pleasure and benefit from my projects. And—basically starting with the Physics Project—I’ve tried to open to the world not just the results of my projects, but the process by which they’re done.
I post my working notebooks. Whenever practical I livestream my working meetings. And, perhaps taking things to an extreme, I record even my own solitary work, posting it in “video work logs”. (Except I just realized I forgot to record the writing I’m doing right now!)
A couple of years before the Physics Project I actually also opened up my technology development activities—livestreaming our software design reviews, in the past five years 692 hours of them. (And, yes, I put a lot of work and effort into designing the Wolfram Language!)
At the beginning of the pandemic I thought: “There are all these kids out of school. Let me try to do a little bit of public service and livestream something about science and technology for them.” And that’s how I started my “Science & Technology Q&A for Kids & Others” livestreams, that I’ve now been doing for four and a half years. Along the way, I’ve added “History of Science & Technology Q&A”, “Future of Science & Technology Q&A”, and “Business, Innovation & Managing Life Q&A”. Altogether I’ve done 272 hours of these, that have generated 376 podcast episodes.
Twice a week I sit down in front of a camera, watch the feed of questions, and try to answer them. It’s always off the cuff, completely unprepared. And I find it a great experience. I can tell that over the time I’ve been doing this, I’ve become a better and more fluent explainer, which no doubt helps my written exposition too. Often in answering questions I’ll come up with a new way to explain something, that I’ve never thought of before. And often there’ll be questions that make me think about things I’ve never thought about at all before. Indeed, several of my recent projects actually got started as a result of questions people asked.
When I was younger I always just wanted to get on with research, create things, and so on; I wasn’t interested in education. But as I’ve gotten older I’ve come to really like education. Partly it’s because I feel I learn a lot myself from it, but mostly it’s because I find it fulfilling to use what I know and try to help people develop.
I’ve always been interested in people—a useful attribute in running a talent-rich company for four decades. (I’m particularly interested in how people develop through their lives—leading me recently, for example, to organize a 50-year reunion for my elementary school class.) I’ve had a long-time “hobby” of mentoring CEOs and kids (both being categories of people who tend to believe that anything is possible).
But my main educational efforts are concentrated in a few weeks of the year when we do our Wolfram Summer School (started in 2003) and our Wolfram High School Summer Research Program (started in 2012). All the students in these programs (775 of them over the past five years) do an original project, and one of my jobs is to come up with what all these projects should be. Over the course of the year I’ll accumulate ideas—though rather often when I actually meet a student I’ll invent something new.
I obviously do plenty of projects myself. But it’s always an interesting—and invigorating—experience to see so many projects get done with such intensity at our summer programs. Plus, I get lots of extra practice in framing projects that helps when I come to frame my own projects.
At this point, I’ve spent years trying to organize my life to optimize it for what I want to get out of it. I need long stretches of time when I can concentrate coherently. But I like having a diversity of activities, and I’m pretty sure I wouldn’t have the energy and effectiveness I do without that. Over the years, I’ve added in little pieces. Like my weekly virtual sessions where I “do my homework” with a group of kids, working on something that I need to get done, but that doesn’t quite fit elsewhere. Or my weekly sessions with local kids, talking about things that make me and them think. Or, for that matter, my “call while driving” list of calls it’s good to make, but wouldn’t usually quite get the priority to happen.
Doing all the things I do is hard work. But it’s what I want to do. Yes, things can drag from time to time. But at this point I’m so used to the rhythm of projects that I don’t think I notice much. And, yes, I work basically every hour of every day I can. Do I have hobbies? Well, back when I was an academic, business was my main “hobby”. When I started CEOing, science became a “hobby”. Writing. Education. Livestreaming. These were all “hobbies” too. But somehow one of the patterns of my life is that nothing really stays quite as a “true hobby”.
The past five years have not only been my most productive ever, but they’ve also built more “productivity momentum” than I’ve had before. So, what’s next? I have a lot of projects currently “in motion”, or ready to “get into motion”. Then I have many more that are in gestation, for which the time may finally have come. But I know there’ll also be surprises: projects that suddenly occur to me, or that I suddenly realize are possible. And one of the great challenges is to be in a position to actually jump into such things.
It has to be said that there’s always a potentially complicated tradeoff. To what extent should one “tend” the things one’s already done, and to what extent should one do new things? Of course, there are some things that are never “done”—like the Wolfram Language, which I started building 38 years ago, and still (energetically) work on every day. Or the Physics Project, where there’s just so much to figure out. But one of the things that’s worked well in most of the basic science projects I’ve done in the past five years or is that once I’ve written my piece about the project, I can usually consider the project “done for now”. It always takes a lot of effort to get a project to the point where I can write about it. But I work hard to make sure I only have to do it once; that I’ve “picked the low-hanging fruit”, so I don’t feel I have to come back “to add a little more”.
I put a lot of effort into the pieces I write about my projects. And I also give talks, do interviews, etc. (about 500 altogether in the past five years). But I certainly don’t “market” my efforts as much as I could. It’s a decision I’ve made: that at this point in my life—particularly with the burst of productivity I’m experiencing—I want to spend as much of my time as possible doing new things. And so I need to count on others to follow up and spread knowledge about what I’ve done, whether in the academic world, on Wikipedia, the web, etc. (And, yes, pieces I write and the pictures they contain are set up to be immediately reproducible wherever appropriate.)
OK, so what specific new things are currently in my pipeline? Well, there’s lots of science (and related intellectual things). And there’s also lots of technology. But let’s talk about science first.
A big story is the Physics Project—where there’s a lot to be done, in many different directions. There’s foundational theory to be developed. And there are experimental implications to be found.
It’d be great if we could find experimental evidence of the discreteness of space, or maximum entanglement speed, or a host of other unexpected phenomena in our models. A century or so ago it was something of a stroke of luck that atoms were big enough that they could be detected. And we don’t know if the discreteness of space is something we’ll be able to detect now—or only centuries from now.
There are phenomena—particularly associated with black holes—that might effectively serve as powerful “spacetime microscopes”. And there are phenomena like dimension fluctuations that could potentially show up in a variety of astrophysical settings. But one direction I’m particularly interested in exploring is what one might call “spacetime heat”—the effect of detailed microscopic dynamics in the hypergraph that makes up spacetime. Could “dark matter”, for example, not be “matter” at all, but instead be associated with spacetime heat?
Part of investigating this involves building practical simulation software to investigate our models on as large a scale as possible. And part of it involves “good, old-fashioned physics”, figuring out how to go from underlying foundational effects to observable phenomena.
And there’s a foundational piece to this too. How does one set up mathematics—and mathematical physics—when one’s starting from a hypergraph? A traditional manifold is ultimately built up from Euclidean space. But what kind of object is the limit of a hypergraph? To understand this, we need to construct what I’m calling infrageometry—and infracalculus alongside it. Infrageometry—as its name suggests—starts from something lower level than traditional geometry. And the challenge is in effect to build a “21st century Euclid”, then Newton, etc.—eventually finding generalizations of things like differential geometry and algebraic topology that answer questions like what 3-dimensional curvature tensors are like, or how we might distinguish local gauge degrees of freedom from spatial ones in a limiting hypergraph.
Another direction has to do with particles—like electrons. The fact is that existing quantum field theory in a sense only really deals with particles indirectly, by thinking of them as perturbations in a field—which in turn is full of (usually unobservable) zero-point fluctuations. In our models, the structure of everything—from spacetime up—is determined by the “fluctuating” structure of the underlying hypergraph (or, more accurately, by the whole multiway graph of “possible fluctuations”). And what this suggests is that there’s in a sense a much lower level version of the Feynman diagrams we use in quantum field theory and where we can discuss the “effect of particles” without ever having to say exactly what a particle “is”.
I must say that I expected we’d have to know what particles were even to talk about energy. But it turned out there was a “bulk” way to do that. And maybe similarly there’s an indirect way to talk about interactions between particles. My guess is that in our model particles are structures a bit like black holes—but we may be able to go a very long way without having to know the details.
One of the important features of our models is that quantum mechanics is “inevitable” in them. And one of the projects I’m hoping to do is to finally “really understand quantum mechanics”. In general terms, it’s connected to the way branching observers (like us) perceive branching universes. But how do we get intuition for this, and what effects can we expect? Several projects over the past years (like multiway Turing machines, multiway games, multiway aggregation, etc.) I’ve done in large part to bolster my intuition about branchial space and quantum mechanics.
I first worked on quantum computers back in 1980. And at the time, I thought that the measurement process (whose mechanism isn’t described in the standard formalism of quantum mechanics) would be a big problem for them. Years have gone by, and enthusiasm for quantum computers has skyrocketed. In our models there’s a rather clear picture that inside a quantum computer there are “many threads of history” that can in effect do computations in parallel. But for an observer like us to “know what the answer is” we have to knit those threads together. And in our models (particularly with my observer theory efforts) we start to be able to see how that might happen, and what the limitations might be.
Meanwhile, in the world at large there are all sorts of experimental quantum computers being built. But what are their limitations? I have a suspicion that there’s some as-yet-unknown fundamental physics associated with these limitations. It’s like building telescopes: you polish the mirror, and keep on making engineering tweaks. But unless you know about diffraction, you won’t understand why your resolution is limited. And I have a slight hope that even existing results on quantum computers may be enough to see limitations perhaps associated with maximum entanglement speed in our models. And the way our models work, knowing this speed, you can for example immediately deduce the discreteness scale of space.
Back in 1982, I and another physicist wrote two papers on “Properties of the Vacuum”. Part 1 was mechanical properties. Part 2 was electrodynamic. We announced a part 3, on gravitational properties. But we never wrote it. Well, finally, it looks as if our Physics Project shows us how to think about such properties. So perhaps it’s time to finally write “Part 3”, and respond to all those people who sent preprint request cards for it four decades ago.
One of the great conclusions of our Physics Project—and the concept of the ruliad—is that we have the laws of physics we do because we are observers of the kind we are. And just knowing very coarsely about us as observers seems to already imply the major laws of twentieth century physics. And to be able to say more, I think we need more characterization of us as observers. And my guess is, for example, that some feature of us that we probably consider completely obvious is what leads us to perceive space as (roughly) three dimensional. And indeed I increasingly suspect that the whole structure of our Physics Project can be derived—a bit like early derivations of special relativity—from certain axiomatic assumptions about our nature as observers, and fundamental features of computation.
There’s plenty to do on our Physics Project, and I’m looking forward to making progress with all of it. But the ideas of the Physics Project—and multicomputation in general—apply to lots of other fields too. And I have many projects planned on these.
Let’s talk first about chemistry. I never found chemistry interesting as a kid. But as we’ve added chemistry functionality in the Wolfram Language, I’ve understood more about it, and why it’s interesting. And I’ve also followed molecular computing since the 1980s. And now, largely inspired by thinking about multicomputation, I’ve become very interested in what one might call the foundations of chemistry. Actually, what I’m most interested in is what I’m calling “subchemistry”. I suppose one can think of it as having a similar kind of relation to chemistry as infrageometry has to geometry.
In ordinary chemistry, one thinks about reactions between different species of molecules. And to calculate rates of reactions, one multiplies concentrations of different species, implicitly assuming that there’s perfect randomness in which specific molecules interact. But what if one goes to a lower level, and starts talking about the interactions not of species of molecules, but individual molecules? From our Physics Project we get the idea of making causal graphs that represent the causal relations between different specific interaction events.
In a gas the assumption of molecular-level randomness will probably be pretty good. But even in a liquid it’ll be more questionable. And in more exotic materials it’ll be a completely different story. And I suspect that there are “subchemical” processes that can potentially be important, perhaps in a sense finding a new “slice of computational reducibility” within the general computational irreducibility associated with the Second Law.
But the most important potential application of subchemistry is in biology. If we look at biological tissue, a basic question might be: “What phase of matter is it?” One of the major takeaways from molecular biology in the last few decades has been that in biological systems, molecules (or at least large ones) are basically never just “bouncing around randomly”. Instead, their motion is typically carefully orchestrated.
So when we look at biological tissue—or a biological system—we’re basically seeing the result of “bulk orchestration”. But what are the laws of bulk orchestration? We don’t know. But I want to find out. I think the “mechanoidal phase” that I identified in studying the Second Law is potentially a good test case.
If we look at a microprocessor, it’s not very useful to describe it as “containing a gas of electrons”. And similarly, it’s not useful to describe a biological cell as “being liquid inside”. But just what kind of theory is needed to have a more useful description we don’t know. And my guess is that there’ll be some new level of abstraction that’s needed to think about this (perhaps a bit like the new abstraction that was needed to formulate information theory).
Biology is not big on theory. Yes, there’s natural selection. And there’s the digital nature of biomolecules. But mostly biology has ended up just accumulating vast amounts of data (using ever better instrumentation) without any overarching theory. But I suspect that in fact there’s another foundational theory to be found in biology. And if we find it, a lot of the data that’s been collected will suddenly fall into place.
There’s the “frankly molecular” level of biology. And there’s the more “functional” level. And I was surprised recently to be able to find a very minimal model that seems to capture “functional” aspects of biological evolution. It’s a surprisingly rich model, and there’s much more to explore with it, notably about how different “ideas” get propagated and developed in the process of adaptive evolution—and what kinds of tree-of-life-style branchings occur.
And then there’s the question of self replication—a core feature of biology. Just how simple a system can exhibit it in a “biologically relevant way”? I had thought that self replication was “just relevant for biology”. But in thinking about the problem of observers in the ruliad, I’ve come to realize that it’s also relevant at a foundational level there. It’s no good to just have one observer; you have to have a whole “rulial flock” of similar ones. And to get similar ones you need something like self replication.
Talking of “societies of observers” brings me to another area I want to study: economics. How does a coherent economic system emerge from all the microscopic transactions and other events in a society? I suspect it’s a story that’s in the end similar to the theories we’ve studied in physics—from the emergence of bulk properties in fluids, to the emergence of continuum spacetime, and so on. But now in economics we’re dealing not with fluid density or metric, but instead with things like price. I don’t yet know how it will work out. Maybe computational reducibility will be associated with value. Maybe computational irreducibility will be what determines robustness of value. But I suspect that there’s a way of thinking about “economic observers” in the ruliad—and figuring out what “natural laws” they’ll “inevitably observe”. And maybe some of those natural laws will be relevant in thinking about the kind of questions we humans care about in economics.
It’s rather amazing in how many different areas one seems to be able to apply the kind of approach that’s emerged from the Physics Project, the ruliad, etc. One that I’ve very recently tackled is machine learning. And in my effort to understand its foundations, I’ve ended up coming up with some very minimal models. My purpose was to understand the essence of machine learning. But—somewhat to my surprise—it looks as if these minimal models can actually be practical ways to do machine learning. Their hardware-level tradeoffs are somewhat different. But—given my interest in practical technology—I want to see if one can build out a practical machine-learning framework that’s based on these (fundamentally discrete) models.
And while I’m not currently planning to investigate this myself, I suspect that the approach I’ve used to study machine learning can also be applied to neuroscience, and perhaps to linguistics. And, yes, there’ll probably be a lot of computational irreducibility in evidence. And once again one has to hope that the pockets of computational reducibility that exist will give rise to “natural laws” that are useful for what we care about in these fields.
In addition to these “big” projects, I’m also hoping to do a variety of “smaller” projects. Many I started decades ago, and in fact mentioned in A New Kind of Science. But now I feel I have the tools, intuition and intellectual momentum to finally finish them. Nestedly recursive functions. Deterministic random tilings. Undecidability in the three-body problem. “Meta-engineering” in the Game of Life. These might on their own seem esoteric. But my repeated experience—particularly in the past five years—is that by solving problems like these one builds examples and intuition that have surprisingly broad application.
And then there are history projects. Just what did happen to theories of discrete space in the early twentieth century (and how close did people like Einstein get to the ideas of our Physics Project)? What was “ancient history” of neural nets, and why did people come to assume they should be based on continuous real numbers? I fully expect that as I investigate these things, I’ll encounter all sorts of “if only” situations—where for example some unpublished note languishing in an archive (or attic) would have changed the course of science if it had seen the light of day long ago. And when I find something like this, it’s yet more motivation to actually finish those projects of mine that have been languishing so long in the filesystem of my computer.
There’s a lot I want to do “down in the computational trenches”, in physics, chemistry, biology, economics, etc. But there are also things at a more abstract level in the ruliad. There’s more to study about metamathematics, and about how mathematics that we humans care about can emerge from the ruliad. And there are also foundational questions in computer science. P vs. NP, for example, can be formulated as an essentially geometric problem in the ruliad—and conceivably there are mathematical methods (say from higher category theory) that might give insight into it.
Then there are questions about hyperruliads and hyporuliads. In a hyperruliad that’s based on hypercomputation, there will be hyperobservers. But is there a kind of “rulial relativity” that makes their perception of things just the same as “ordinary observers” in the ordinary ruliad? A way to get some insight into this may be to study hyporuliads—versions of the ruliad in which there are only limited levels of computation possible. A bit like the way a spacelike singularity associated with a black hole supports only limited time histories, or a decidable axiomatic theory supports only proofs of limited length, there will be limitations in the hyporuliad. And by studying them, there’s a possibility that we’ll be able to see more about issues like what kinds of mathematical axioms can be compatible with observers like us.
It’s worth commenting that our Physics Project—and the ruliad—have all sorts of connections and resonances with long-studied ideas in philosophy. “Didn’t Kant talk about that? Isn’t that similar to Leibniz?”, etc. I’ve wanted to try to understand these historical connections. But while I’ve done a lot of work on the historical development of ideas, the ideas in question have tended to be more focused, and more tied to concrete formalism than they usually are in philosophy. “Did Kant actually mean that, or something completely different?” You might have to understand all his works to know. And that’s more than I think I can do.
I invented the concept of the ruliad as a matter of science. But it’s now clear that the ruliad has all sorts of connections and resonances not only with philosophy but also with theology. Indeed, in a great many belief systems there’s always been the idea that somehow in the end “everything is one”. In cases where this gets slightly more formalized, there’s often some kind of combinatorial enumeration involved (think: I Ching, or various versions of “counting the names of God”).
There are all sorts of examples where long-surviving “ancient beliefs” end up having something to them, even if the specific methods of post-1600s science don’t have much to say about them. One example is the notion of a soul, which we might now see as an ancient premonition of the modern notion of abstract computation. And whenever there’s a belief that’s ancient, there’s likely to have been lots of thinking done around it over the millennia. So if we can, for example, see a connection to the ruliad, we can expect to leverage that thinking. And perhaps also be able to provide new input that can refine the belief system in interesting and valuable ways.
I’m always interested in different viewpoints about things—whether from science, philosophy, theology, wherever. And an extreme version of this is to think about how other “alien” minds might view things. Nowadays I think of different minds as effectively being at different places in the ruliad. Humans with similar backgrounds have minds that are close in rulial space. Cats and dogs have minds that are further away. And the weather (with its “mind of its own”) is still further.
Now that we have AIs we potentially have a way to study the correspondence—and communication—between “different minds”. I looked at one aspect of this in my “cats” piece. But my recent work on the foundations of machine learning suggests a broader approach, that can also potentially tell us things about the fundamental character of language, and about how it serves as a medium that can “transport thoughts” from one mind to another.
Many non-human animals seem to have at least some form of language—though mostly in effect just a few standalone words. But pretty unquestionably the greatest single invention of our species is language—and particularly compositional language where words and phrases can fit together in an infinite number of ways. But is there something beyond compositional language? And, for example, where might we get if our brains were bigger?
With the 100 billion neurons in our brains, we seem to be able to handle about 50,000 words. If we had a trillion neurons we’d probably be able to handle more words (though perhaps more slowly), in effect letting us describe more things more easily. But what about something fundamentally beyond compositional language? Something perhaps “higher order”?
With a word we are in effect conflating all instances of a certain concept into a single object that we can then work with. But typically with ordinary words we’re dealing with what we might call “static concepts”. So what about “ways of thinking”, or paradigms? They’re more like active, functional concepts. And it’s a bit like dogs versus us: dogs deal with a few standalone words; we “package” those together into whole sentences and beyond. And at the next level, we could imagine in effect packaging things like generators of meaningful sentences.
Interestingly enough, we have something of a preview of ideas like this—in computational language. And this is one of those places where my efforts in science—and philosophy—start to directly intersect with my efforts in technology.
The foundation of the Wolfram Language is the idea of representing everything in computational terms, and in particular in symbolic computational terms. And one feature of such a representation is that it can encompass both “data” and “code”—i.e. both things one might think about, and ways one might think about them.
I first started building Wolfram Language as a practical tool—though one very much informed by my foundational ideas. And now, four decades later, the Wolfram Language has emerged as the largest single project of my life, and something that, yes, I expect to always put immense effort into. It wasn’t long ago that we finally finished my 1991 to-do list for Wolfram Language—and we have many projects running now that will take years to complete. But the mission has always remained the same: to take the concept of computation and apply it as broadly as possible, through the medium of computational language.
Now, however, I have some additional context for that—viewing computational language as a bridge from what we humans think about to what’s possible in the computational universe. And this helps in framing some of the ways to expand the foundations of our computational language, for example to multicomputation, or to hypergraph-based representations. It also helps in understanding the character of current AI, and how it needs to interact with computational language.
In the Wolfram Language we’ve been steadily trying to create a representation for everything. And when it comes to definitive, objective things we’ve gotten a long way. But there’s more than that in everyday discourse. For example, I might say “I’m going to drink a glass of orange juice.” Well, we do just fine at representing “a glass of orange juice” in the Wolfram Language, and we can compute lots of things—like nutrition content—about it. But what about “I’m going to drink…”? For that we need something different.
And, actually, I’ve been thinking for a shockingly long time about what one might need. I first considered the question in the early 1980s, in connection with “extending SMP to AI”. I learned about the attempts to make “philosophical languages” in the 1600s, and about some of the thinking around modern conlangs (constructed languages). Something that always held me back, though, was use cases. Yes, I could see how one could use things like this for tasks like customer service. But I wasn’t too excited about that.
But finally there was blockchain, and with it, smart contracts. And around 2015 I started thinking about how one might represent contracts in general not in legalese but in some precise computational way. And the result was that I began to crispen my ideas about what I called “symbolic discourse language”. I thought about how this might relate to questions like a “constitution for AIs” and so on. But I never quite got around to actually starting to design the specifics of the symbolic discourse language.
But then along came LLMs, together with my theory that their success had to do with a “semantic grammar” of language. And finally now we’ve launched a serious project to build a symbolic discourse language. And, yes, it’s a difficult language design problem, deeply entangled with a whole range of foundational issues in philosophy. But as, by now at least, the world’s most experienced language designer (for better or worse), I feel a responsibility to try to do it.
In addition to language design, there’s also the question of making all the various “symbolic calculi” that describe in appropriately coarse terms the operation of the world. Calculi of motion. Calculi of life (eating, dying, etc.). Calculi of human desires. Etc. As well as calculi that are directly supported by the computation and knowledge in the Wolfram Language.
And just as LLMs can provide a kind of conversational linguistic interface to the Wolfram Language, one can expect them also to do this to our symbolic discourse language. So the pattern will be similar to what it is for Wolfram Language: the symbolic discourse language will provide a formal and (at least within its purview) correct underpinning for the LLM. It may lose the poetry of language that the LLM handles. But from the outset it’ll get its reasoning straight.
The symbolic discourse language is a broad project. But in some sense breadth is what I have specialized in. Because that’s what’s needed to build out the Wolfram Language, and that’s what’s needed in my efforts to pull together the foundations of so many fields.
And in maintaining a broad range of interests there are some where I imagine that someday there’ll be a project I can do, but there may for example be many years of “ambient technology” that are needed before that project will be feasible. Usually, though, I have some “conceptual idea” of what the project might be. For example, I’ve followed robotics, imagining that one day there’ll be a way to do “general-purpose robotics”, perhaps constructing everything out of modular elements. I’ve followed biomedicine, partly out of personal self interest, and partly because I think it’ll relate to some of the foundational questions I’m asking in biology.
But in addition to all the projects where the goal is basic research, or technology development, I’m also hoping to pursue my interests in education. Much of what I hope to do relates to content, but some of it relates to access and motivation. I don’t have perfect evidence, but I strongly believe there’s a lot of young talent out there in the world that never manages to connect for example with things like the educational programs we put on. We–and I—have tried quite hard over the years to “bridge the gap”. But with the world as it is, it’s proved remarkably difficult. But it’s still a problem I’d like to solve, and I’ll keep picking away at it, hoping to change for the better some kids’ “trajectories”.
But about content I believe my path is clearer. With the modern Wolfram Language I think we’ve gone a long way towards being able to take computational thinking about almost anything, and being able to represent it in a formalized way, and compute from it. But how do people manage to do the computational thinking in the first place? Well, like mathematical thinking and other formalized kinds of thinking, they have to learn how to do it.
For years people have been telling me I should “write the book” to teach this. And finally in January of this year I started. I’m not sure how long it will take, but I’ll soon be starting to post sections I’ve written so far.
My goal is to create a general book—and course—that’s an introduction to computational thinking at a level suitable for typical first-year college students. Lots of college students these days say they want to study “computer science”. But really it’s computational X for some field X that they’re ultimately interested in. And neither the theoretical nor the engineering aspects of typical “computer science” are what’s most relevant to them. What they need to know is computational thinking as it might be applied to computational X—not “CS” but what one might call “CX”.
So what will CX101 be like? In some ways more like a philosophy course than a CS one. Because in the end it’s about generally learning to think, albeit in the new paradigm of computation. And the point is that once someone has a clear computational conceptualization of something, then it’s our job in the Wolfram Language to make sure that it’s easy for them to concretely implement it.
But how does one teach computational conceptualization? What I’ve concluded is that one needs to anchor it in actual things in the world. Geography. Video. Genomics. Yes, there are principles to explain. But they need practical context to make them useful, or even understandable. And what I’m finding is that framing everything computationally makes things incredibly much easier to explain than before. (A test example coming soon is whether I can easily explain math ideas like algebra and calculus this way.)
OK, so that’s a lot of projects. But I’m excited about all of them, and can’t wait to make them happen. At an age when many of my contemporaries are retiring, I feel like I’m just getting started. And somehow the way my projects keep on connecting back to things I did decades ago makes me feel—in a computational irreducibility kind of way—that there’s something necessary about all the steps I’ve taken. I feel like the things I’ve done have let me climb some hills. But now there are many more hills that have come into view. And I look forward to being able to climb those too. For myself and for the world.
2024-08-23 02:28:17
It’s surprising how little is known about the foundations of machine learning. Yes, from an engineering point of view, an immense amount has been figured out about how to build neural nets that do all kinds of impressive and sometimes almost magical things. But at a fundamental level we still don’t really know why neural nets “work”—and we don’t have any kind of “scientific big picture” of what’s going on inside them.
The basic structure of neural networks can be pretty simple. But by the time they’re trained up with all their weights, etc. it’s been hard to tell what’s going on—or even to get any good visualization of it. And indeed it’s far from clear even what aspects of the whole setup are actually essential, and what are just “details” that have perhaps been “grandfathered” all the way from when computational neural nets were first invented in the 1940s.
Well, what I’m going to try to do here is to get “underneath” this—and to “strip things down” as much as possible. I’m going to explore some very minimal models—that, among other things, are more directly amenable to visualization. At the outset, I wasn’t at all sure that these minimal models would be able to reproduce any of the kinds of things we see in machine learning. But, rather surprisingly, it seems they can.
And the simplicity of their construction makes it much easier to “see inside them”—and to get more of a sense of what essential phenomena actually underlie machine learning. One might have imagined that even though the training of a machine learning system might be circuitous, somehow in the end the system would do what it does through some kind of identifiable and “explainable” mechanism. But we’ll see that in fact that’s typically not at all what happens.
Instead it looks much more as if the training manages to home in on some quite wild computation that “just happens to achieve the right results”. Machine learning, it seems, isn’t building structured mechanisms; rather, it’s basically just sampling from the typical complexity one sees in the computational universe, picking out pieces whose behavior turns out to overlap what’s needed. And in a sense, therefore, the possibility of machine learning is ultimately yet another consequence of the phenomenon of computational irreducibility.
Why is that? Well, it’s only because of computational irreducibility that there’s all that richness in the computational universe. And, more than that, it’s because of computational irreducibility that things end up being effectively random enough that the adaptive process of training a machine learning system can reach success without getting stuck.
But the presence of computational irreducibility also has another important implication: that even though we can expect to find limited pockets of computational reducibility, we can’t expect a “general narrative explanation” of what a machine learning system does. In other words, there won’t be a traditional (say, mathematical) “general science” of machine learning (or, for that matter, probably also neuroscience). Instead, the story will be much closer to the fundamentally computational “new kind of science” that I’ve explored for so long, and that has brought us our Physics Project and the ruliad.
In many ways, the problem of machine learning is a version of the general problem of adaptive evolution, as encountered for example in biology. In biology we typically imagine that we want to adaptively optimize some overall “fitness” of a system; in machine learning we typically try to adaptively “train” a system to make it align with certain goals or behaviors, most often defined by examples. (And, yes, in practice this is often done by trying to minimize a quantity normally called the “loss”.)
And while in biology there’s a general sense that “things arise through evolution”, quite how this works has always been rather mysterious. But (rather to my surprise) I recently found a very simple model that seems to do well at capturing at least some of the most essential features of biological evolution. And while the model isn’t the same as what we’ll explore here for machine learning, it has some definite similarities. And in the end we’ll find that the core phenomena of machine learning and of biological evolution appear to be remarkably aligned—and both fundamentally connected to the phenomenon of computational irreducibility.
Most of what I’ll do here focuses on foundational, theoretical questions. But in understanding more about what’s really going on in machine learning—and what’s essential and what’s not—we’ll also be able to begin to see how in practice machine learning might be done differently, potentially with more efficiency and more generality.
Note: Click any diagram to get Wolfram Language code to reproduce it.
To begin the process of understanding the essence of machine learning, let’s start from a very traditional—and familiar—example: a fully connected (“multilayer perceptron”) neural net that’s been trained to compute a certain function f[x]:
If one gives a value x as input at the top, then after “rippling through the layers of the network” one gets a value at the bottom that (almost exactly) corresponds to our function f[x]:
Scanning through different inputs x, we see different patterns of intermediate values inside the network:
And here’s (on a linear and log scale) how each of these intermediate values changes with x. And, yes, the way the final value (highlighted here) emerges looks very complicated:
So how is the neural net ultimately put together? How are these values that we’re plotting determined? We’re using the standard setup for a fully connected multilayer network. Each node (“neuron”) on each layer is connected to all nodes on the layer above—and values “flow” down from one layer to the next, being multiplied by the (positive or negative) “weight” (indicated by color in our pictures) associated with the connection through which they flow. The value of a given neuron is found by totaling up all its (weighted) inputs from the layer before, adding a “bias” value for that neuron, and then applying to the result a certain (nonlinear) “activation function” (here ReLU or Ramp[z], i.e. If[z z]).
What overall function a given neural net will compute is determined by the collection of weights and biases that appear in the neural net (along with its overall connection architecture, and the activation function it’s using). The idea of machine learning is to find weights and biases that produce a particular function by adaptively “learning” from examples of that function. Typically we might start from a random collection of weights, then successively tweak weights and biases to “train” the neural net to reproduce the function:
We can get a sense of how this progresses (and, yes, it’s complicated) by plotting successive changes in individual weights over the course of the training process (the spikes near the end come from “neutral changes” that don’t affect the overall behavior):
The overall objective in the training is progressively to decrease the “loss”—the average (squared) difference between true values of f[x] and those generated by the neural net. The evolution of the loss defines a “learning curve” for the neural net, with the downward glitches corresponding to points where the neural net in effect “made a breakthrough” in being able to represent the function better:
It’s important to note that typically there’s randomness injected into neural net training. So if one runs the training multiple times, one will get different networks—and different learning curves—every time:
But what’s really going on in neural net training? Effectively we’re finding a way to “compile” a function (at least to some approximation) into a neural net with a certain number of (real-valued) parameters. And in the example here we happen to be using about 100 parameters.
But what happens if we use a different number of parameters, or set up the architecture of our neural net differently? Here are a few examples, indicating that for the function we’re trying to generate, the network we’ve been using so far is pretty much the smallest that will work:
And, by the way, here’s what happens if we change our activation function from ReLU
to the smoother ELU :
Later we’ll talk about what happens when we do machine learning with discrete systems. And in anticipation of that, it’s interesting to see what happens if we take a neural net of the kind we’ve discussed here, and “quantize” its weights (and biases) in discrete levels:
The result is that (as recent experience with large-scale neural nets has also shown) the basic “operation” of the neural net does not require precise real numbers, but survives even when the numbers are at least somewhat discrete—as this 3D rendering as a function of the discreteness level δ also indicates:
So far we’ve been discussing very traditional neural nets. But to do machine learning, do we really need systems that have all those details? For example, do we really need every neuron on each layer to get an input from every neuron on the previous layer? What happens if instead every neuron just gets input from at most two others—say with the neurons effectively laid out in a simple mesh? Quite surprisingly, it turns out that such a network is still perfectly able to generate a function like the one we’ve been using as an example:
And one advantage of such a “mesh neural net” is that—like a cellular automaton—its “internal behavior” can readily be visualized in a rather direct way. So, for example, here are visualizations of “how the mesh net generates its output”, stepping through different input values x:
And, yes, even though we can visualize it, it’s still hard to understand “what’s going on inside”. Looking at the intermediate values of each individual node in the network as a function of x doesn’t help much, though we can “see something happening” at places where our function f[x] has jumps:
So how do we train a mesh neural net? Basically we can use the same procedure as for a fully connected network of the kind we saw above (ReLU activation functions don’t seem to work well for mesh nets, so we’re using ELU here):
Here’s the evolution of differences in each individual weight during the training process:
And here are results for different random seeds:
At the size we’re using, our mesh neural nets have about the same number of connections (and thus weights) as our main example of a fully connected network above. And we see that if we try to reduce the size of our mesh neural net, it doesn’t do well at reproducing our function:
Mesh neural nets simplify the topology of neural net connections. But, somewhat surprisingly at first, it seems as if we can go much further in simplifying the systems we’re using—and still successfully do versions of machine learning. And in particular we’ll find that we can make our systems completely discrete.
The typical methodology of neural net training involves progressively tweaking real-valued parameters, usually using methods based on calculus, and on finding derivatives. And one might imagine that any successful adaptive process would ultimately have to rely on being able to make arbitrarily small changes, of the kind that are possible with real-valued parameters.
But in studying simple idealizations of biological evolution I recently found striking examples where this isn’t the case—and where completely discrete systems seemed able to capture the essence of what’s going on.
As an example consider a (3-color) cellular automaton. The rule is shown on the left, and the behavior one generates by repeatedly applying that rule (starting from a single-cell initial condition) is shown on the right:
The rule has the property that the pattern it generates (from a single-cell initial condition) survives for exactly 40 steps, and then dies out (i.e. every cell becomes white). And the important point is that this rule can be found by a discrete adaptive process. The idea is to start, say, from a null rule, and then at each step to randomly change a single outcome out of the 27 in the rule (i.e. make a “single-point mutation” in the rule). Most such changes will cause the “lifetime” of the pattern to get further from our target of 40—and these we discard. But gradually we can build up “beneficial mutations”
that through “progressive adaptation” eventually get to our original lifetime-40 rule:
We can make a plot of all the attempts we made that eventually let us reach lifetime 40—and we can think of this progressive “fitness” curve as being directly analogous to the loss curves in machine learning that we saw before:
If we make different sequences of random mutations, we’ll get different paths of adaptive evolution, and different “solutions” for rules that have lifetime 40:
Two things are immediately notable about these. First, that they essentially all seem to be “using different ideas” to reach their goal (presumably analogous to the phenomenon of different branches in the tree of life). And second, that none of them seem to be using a clear “mechanical procedure” (of the kind we might construct through traditional engineering) to reach their goal. Instead, they seem to be finding “natural” complicated behavior that just “happens” to achieve the goal.
It’s nontrivial, of course, that this behavior can achieve a goal like the one we’ve set here, as well as that simple selection based on random point mutations can successfully reach the necessary behavior. But as I discussed in connection with biological evolution, this is ultimately a story of computational irreducibility—particularly in generating diversity both in behavior, and in the paths necessary to reach it.
But, OK, so how does this model of adaptive evolution relate to systems like neural nets? In the standard language of neural nets, our model is like a discrete analog of a recurrent convolutional network. It’s “convolutional” because at any given step the same rule is applied—locally—throughout an array of elements. It’s “recurrent” because in effect data is repeatedly “passed through” the same rule. The kinds of procedures (like “backpropagation”) typically used to train traditional neural nets wouldn’t be able to train such a system. But it turns out that—essentially as a consequence of computational irreducibility—the very simple method of successive random mutation can be successful.
Let’s say we want to set up a system like a neural net—or at least a mesh neural net—but we want it to be completely discrete. (And I mean “born discrete”, not just discretized from an existing continuous system.) How can we do this? One approach (that, as it happens, I first considered in the mid-1980s—but never seriously explored) is to make what we can call a “rule array”. Like in a cellular automaton there’s an array of cells. But instead of these cells always being updated according to the same rule, each cell at each place in the cellular automaton analog of “spacetime” can make a different choice of what rule it will use. (And although it’s a fairly extreme idealization, we can potentially imagine that these different rules represent a discrete analog of different local choices of weights in a mesh neural net.)
As a first example, let’s consider a rule array in which there are two possible choices of rules:
A particular rule array is defined by which of these rules is going to be used at each (“spacetime”) position in the array. Here are a few examples. In all cases we’re starting from the same single-cell initial condition. But in each case the rule array has a different arrangement of rule choices—with cells “running” rule 4 being given a background, and those running rule 146 a one:
We can see that different choices of rule array can yield very different behaviors. But (in the spirit of machine learning) can we in effect “invert this”, and find a rule array that will give some particular behavior we want?
A simple approach is to do the direct analog of what we did in our minimal modeling of biological evolution: progressively make random “single-point mutations”—here “flipping” the identity of just one rule in the rule array—and then keeping only those mutations that don’t make things worse.
As our sample objective, let’s ask to find a rule array that makes the pattern generated from a single cell using that rule array “survive” for exactly 50 steps. At first it might not be obvious that we’d be able to find such a rule array. But in fact our simple adaptive procedure easily manages to do this:
As the dots here indicate, many mutations don’t lead to longer lifetimes. But every so often, the adaptive process has a “breakthrough” that increases the lifetime—eventually reaching 50:
Just as in our model of biological evolution, different random sequences of mutations lead to different “solutions”, here to the problem of “living for exactly 50 steps”:
Some of these are in effect “simple solutions” that require only a few mutations. But most—like most of our examples in biological evolution—seem more as if they just “happen to work”, effectively by tapping into just the right, fairly complex behavior.
Is there a sharp distinction between these cases? Looking at the collection of “fitness” (AKA “learning”) curves for the examples above, it doesn’t seem so:
It’s not too difficult to see how to “construct a simple solution” just by strategically placing a single instance of the second rule in the rule array:
But the point is that adaptive evolution by repeated mutation normally won’t “discover” this simple solution. And what’s significant is that the adaptive evolution can nevertheless still successfully find some solution—even though it’s not one that’s “understandable” like this.
The cellular automaton rules we’ve been using so far take 3 inputs. But it turns out that we can make things even simpler by just putting ordinary 2-input Boolean functions into our rule array. For example, we can make a rule array from And and Xor functions (r = 1/2 rules 8 and 6):
Different And+Xor ( + ) rule arrays show different behavior:
But are there for example And+Xor rule arrays that will compute any of the 16 possible (2-input) functions? We can’t get Not or any of the 8 other functions with —but it turns out we can get all 8 functions with (additional inputs here are assumed to be ):
And in fact we can also set up And+Xor rule arrays for all other “even” Boolean functions. For example, here are rule arrays for the 3-input rule 30 and rule 110 Boolean functions:
It may be worth commenting that the ability to set up such rule arrays is related to functional completeness of the underlying rules we’re using—though it’s not quite the same thing. Functional completeness is about setting up arbitrary formulas, that can in effect allow long-range connections between intermediate results. Here, all information has to explicitly flow through the array. But for example the functional completeness of Nand (r = 1/2 rule 7, ) allows it to generate all Boolean functions when combined for example with First (r = 1/2 rule 12, ), though sometimes the rule arrays required are quite large:
OK, but what happens if we try to use our adaptive evolution process—say to solve the problem of finding a pattern that survives for exactly 30 steps? Here’s a result for And+Xor rule arrays:
And here are examples of other “solutions” (none of which in this case look particularly “mechanistic” or “constructed”):
But what about learning our original f[x] = function? Well, first we have to decide how we’re going to represent the numbers x and f[x] in our discrete rule array system. And one approach is to do this simply in terms of the position of a black cell (“one-hot encoding”). So, for example, in this case there’s an initial black cell at a position corresponding to about x = –1.1. And then the result after passing through the rule array is a black cell at a position corresponding to f[x] = 1.0:
So now the question is whether we can find a rule array that successfully maps initial to final cell positions according to the mapping x f[x] we want. Well, here’s an example that comes at least close to doing this (note that the array is taken to be cyclic):
So how did we find this? Well, we just used a simple adaptive evolution process. In direct analogy to the way it’s usually done in machine learning, we set up “training examples”, here of the form:
Then we repeatedly made single-point mutations in our rule array, keeping those mutations where the total difference from all the training examples didn’t increase. And after 50,000 mutations this gave the final result above.
We can get some sense of “how we got there” by showing the sequence of intermediate results where we got closer to the goal (as opposed to just not getting further from it):
Here are the corresponding rule arrays, in each case highlighting elements that have changed (and showing the computation of f[0] in the arrays):
Different sequences of random mutations will lead to different rule arrays. But with the setup defined here, the resulting rule arrays will almost always succeed in accurately computing f[x]. Here are a few examples—in which we’re specifically showing the computation of f[0]:
And once again an important takeaway is that we don’t see “identifiable mechanism” in what’s going on. Instead, it looks more as if the rule arrays we’ve got just “happen” to do the computations we want. Their behavior is complicated, but somehow we can manage to “tap into it” to compute our f[x].
But how robust is this computation? A key feature of typical machine learning is that it can “generalize” away from the specific examples it’s been given. It’s never been clear just how to characterize that generalization (when does an image of a cat in a dog suit start being identified as an image of a dog?). But—at least when we’re talking about classification tasks—we can think of what’s going on in terms of basins of attraction that lead to attractors corresponding to our classes.
It’s all considerably easier to analyze, though, in the kind of discrete system we’re exploring here. For example, we can readily enumerate all our training inputs (i.e. all initial states containing a single black cell), and then see how frequently these cause any given cell to be black:
By the way, here’s what happens to this plot at successive “breakthroughs” during training:
But what about all possible inputs, including ones that don’t just contain a single black cell? Well, we can enumerate all of them, and compute the overall frequency for each cell in the array to be black:
As we would expect, the result is considerably “fuzzier” than what we got purely with our training inputs. But there’s still a strong trace of the discrete values for f[x] that appeared in the training data. And if we plot the overall probability for a given final cell to be black, we see peaks at positions corresponding to the values 0 and 1 that f[x] takes on:
But because our system is discrete, we can explicitly look at what outcomes occur:
The most common overall is the “meaningless” all-white state—that basically occurs when the computation from the input “never makes it” to the output. But the next most common outcomes correspond exactly to f[x] = 0 and f[x] = 1. After that is the “superposition” outcome where f[x] is in effect “both 0 and 1”.
But, OK, so what initial states are “in the basins of attraction of” (i.e. will evolve to) the various outcomes here? The fairly flat plots in the last column above indicate that the overall density of black cells gives little information about what attractor a particular initial state will evolve to.
So this means we have to look at specific configurations of cells in the initial conditions. As an example, start from the initial condition
which evolves to:
Now we can ask what happens if we look at a sequence of slightly different initial conditions. And here we show in black and white initial conditions that still evolve to the original “attractor” state, and in pink ones that evolve to some different state:
What’s actually going on inside here? Here are a few examples, highlighting cells whose values change as a result of changing the initial condition:
As is typical in machine learning, there doesn’t seem to be any simple characterization of the form of the basin of attraction. But now we have a sense of what the reason for this is: it’s another consequence of computational irreducibility. Computational irreducibility gives us the effective randomness that allows us to find useful results by adaptive evolution, but it also leads to changes having what seem like random and unpredictable effects. (It’s worth noting, by the way, that we could probably dramatically improve the robustness of our attractor basins by specifically including in our training data examples that have “noise” injected.)
In doing machine learning in practice, the goal is typically to find some collection of weights, etc. that successfully solve a particular problem. But in general there will be many such collections of weights, etc. With typical continuous weights and random training steps it’s very difficult to see what the whole “ensemble” of possibilities is. But in our discrete rule array systems, this becomes more feasible.
Consider a tiny 2×2 rule array with two possible rules. We can make a graph whose edges represent all possible “point mutations” that can occur in this rule array:
In our adaptive evolution process, we’re always moving around a graph like this. But typically most “moves” will end up in states that are rejected because they increase whatever loss we’ve defined.
Consider the problem of generating an And+Xor rule array in which we end with lifetime-4 patterns. Defining the loss as how far we are from this lifetime, we can draw a graph that shows all possible adaptive evolution paths that always progressively decrease the loss:
The result is a multiway graph of the type we’ve now seen in a great many kinds of situations—notably our recent study of biological evolution.
And although this particular example is quite trivial, the idea in general is that different parts of such a graph represent “different strategies” for solving a problem. And—in direct analogy to our Physics Project and our studies of things like game graphs—one can imagine such strategies being laid out in a “branchial space” defined by common ancestry of configurations in the multiway graph.
And one can expect that while in some cases the branchial graph will be fairly uniform, in other cases it will have quite separated pieces—that represent fundamentally different strategies. Of course, the fact that underlying strategies may be different doesn’t mean that the overall behavior or performance of the system will be noticeably different. And indeed one expects that in most cases computational irreducibility will lead to enough effective randomness that there’ll be no discernable difference.
But in any case, here’s an example starting with a rule array that contains both And and Xor—where we observe distinct branches of adaptive evolution that lead to different solutions to the problem of finding a configuration with a lifetime of exactly 4:
How should one actually do the learning in machine learning? In practical work with traditional neural nets, learning is normally done using systematic algorithmic methods like backpropagation. But so far, all we’ve done here is something much simpler: we’ve “learned” by successively making random point mutations, and keeping only ones that don’t lead us further from our goal. And, yes, it’s interesting that such a procedure can work at all—and (as we’ve discussed elsewhere) this is presumably very relevant to understanding phenomena like biological evolution. But, as we’ll see, there are more efficient (and probably much more efficient) methods of doing machine learning, even for the kinds of discrete systems we’re studying.
Let’s start by looking again at our earlier example of finding an And+Xor rule array that gives a “lifetime” of exactly 30. At each step in our adaptive (“learning”) process we make a single-point mutation (changing a single rule in the rule array), keeping the mutation if it doesn’t take us further from our goal. The mutations gradually accumulate—every so often reaching a rule array that gives a lifetime closer to 30. Just as above, here’s a plot of the lifetime achieved by successive mutations—with the “internal” red dots corresponding to rejected mutations:
We see a series of “plateaus” at which mutations are accumulating but not changing the overall lifetime. And between these we see occasional “breakthroughs” where the lifetime jumps. Here are the actual rule array configurations for these breakthroughs, with mutations since the last breakthrough highlighted:
But in the end the process here is quite wasteful; in this example, we make a total of 1705 mutations, but only 780 of them actually contribute to generating the final rule array; all the others are discarded along the way.
So how can we do better? One strategy is to try to figure out at each step which mutation is “most likely to make a difference”. And one way to do this is to try every possible mutation in turn at every step (as in multiway evolution)—and see what effect each of them has on the ultimate lifetime. From this we can construct a “change map” in which we give the change of lifetime associated with a mutation at every particular cell. The results will be different for every configuration of rule array, i.e. at every step in the adaptive evolution. But for example here’s what they are for the particular “breakthrough” configurations shown above (elements in regions that are colored gray won’t affect the result if they are changed; ones colored red will have a positive effect (with more intense red being more positive), and ones colored blue a negative one:
Let’s say we start from a random rule array, then repeatedly construct the change map and apply the mutation that it implies gives the most positive change—in effect at each step following the “path of steepest descent” to get to the lifetime we want (i.e. reduce the loss). Then the sequence of “breakthrough” configurations we get is:
And this in effect corresponds to a slightly more direct “path to a solution” than our sequence of pure single-point mutations.
By the way, the particular problem of reaching a certain lifetime has a simple enough structure that this “steepest descent” method—when started from a simple uniform rule array—finds a very “mechanical” (if slow) path to a solution:
What about the problem of learning f[x] = ? Once again we can make a change map based on the loss we define. Here are the results for a sequence of “breakthrough” configurations. The gray regions are ones where changes will be “neutral”, so that there’s still exploration that can be done without affecting the loss. The red regions are ones that are in effect “locked in” and where any changes would be deleterious in terms of loss:
So what happens in this case if we follow the “path of steepest descent”, always making the change that would be best according to the change map? Well, the results are actually quite unsatisfactory. From almost any initial condition the system quickly gets stuck, and never finds any satisfactory solution. In effect it seems that deterministically following the path of steepest descent leads us to a “local minimum” from which we cannot escape. So what are we missing in just looking at the change map? Well, the change map as we’ve constructed it has the limitation that it’s separately assessing the effect of each possible individual mutation. It doesn’t deal with multiple mutations at a time—which could well be needed in general if one’s going to find the “fastest path to success”, and avoid getting stuck.
But even in constructing the change map there’s already a problem. Because at least the direct way of computing it scales quite poorly. In an n×n rule array we have to check the effect of flipping about n2 values, and for each one we have to run the whole system—taking altogether about n4 operations. And one has to do this separately for each step in the learning process.
So how do traditional neural nets avoid this kind of inefficiency? The answer in a sense involves a mathematical trick. And at least as it’s usually presented it’s all based on the continuous nature of the weights and values in neural nets—which allow us to use methods from calculus.
Let’s say we have a neural net like this
that computes some particular function f[x]:
We can ask how this function changes as we change each of the weights in the network:
And in effect this gives us something like our “change map” above. But there’s an important difference. Because the weights are continuous, we can think about infinitesimal changes to them. And then we can ask questions like “How does f[x] change when we make an infinitesimal change to a particular weight wi?”—or equivalently, “What is the partial derivative of f with respect to wi at the point x?” But now we get to use a key feature of infinitesimal changes: that they can always be thought of as just “adding linearly” (essentially because ε2 can always be ignored compared to ε). Or, in other words, we can summarize any infinitesimal change just by giving its “direction” in weight space, i.e. a vector that says how much of each weight should be (infinitesimally) changed. So if we want to change f[x] (infinitesimally) as quickly as possible, we should go in the direction of steepest descent defined by all the derivatives of f with respect to the weights.
In machine learning, we’re typically trying in effect to set the weights so that the form of f[x] we generate successfully minimizes whatever loss we’ve defined. And we do this by incrementally “moving in weight space”—at every step computing the direction of steepest descent to know where to go next. (In practice, there are all sorts of tricks like “ADAM” that try to optimize the way to do this.)
But how do we efficiently compute the partial derivative of f with respect to each of the weights? Yes, we could do the analog of generating pictures like the ones above, separately for each of the weights. But it turns out that a standard result from calculus gives us a vastly more efficient procedure that in effect “maximally reuses” parts of the computation that have already been done.
It all starts with the textbook chain rule for the derivative of nested (i.e. composed) functions:
This basically says that the (infinitesimal) change in the value of the “whole chain” d[c[b[a[x]]]] can be computed as a product of (infinitesimal) changes associated with each of the “links” in the chain. But the key observation is then that when we get to the computation of the change at a certain point in the chain, we’ve already had to do a lot of the computation we need—and so long as we stored those results, we always have only an incremental computation to perform.
So how does this apply to neural nets? Well, each layer in a neural net is in effect doing a function composition. So, for example, our d[c[b[a[x]]]] is like a trivial neural net:
But what about the weights, which, after all, are what we are trying to find the effect of changing? Well, we could include them explicitly in the function we’re computing:
And then we could in principle symbolically compute the derivatives with respect to these weights:
For our network above
the corresponding expression (ignoring biases) is
where ϕ denotes our activation function. Once again we’re dealing with nested functions, and once again—though it’s a bit more intricate in this case—the computation of derivatives can be done by incrementally evaluating terms in the chain rule and in effect using the standard neural net method of “backpropagation”.
So what about the discrete case? Are there similar methods we can use there? We won’t discuss this in detail here, but we’ll give some indications of what’s likely to be involved.
As a potentially simpler case, let’s consider ordinary cellular automata. The analog of our change map asks how the value of a particular “output” cell is affected by changes in other cells—or in effect what the “partial derivative” of the output value is with respect to changes in values of other cells.
For example, consider the highlighted “output” cell in this cellular automaton evolution:
Now we can look at each cell in this array, and make a change map based on seeing whether flipping the value of just that cell (and then running the cellular automaton forwards from that point) would change the value of the output cell:
The form of the change map is different if we look at different “output cells”:
Here, by the way, are some larger change maps for this and a couple of other cellular automaton rules:
But is there a way to construct such change maps incrementally? One might have thought that there would immediately be at least for cellular automata that (unlike the cases here) are fundamentally reversible. But actually such reversibility doesn’t seem to help much—because although it allows us to “backtrack” whole states of the cellular automaton, it doesn’t allow us to trace the separate effects of individual cells.
So how about using discrete analogs of derivatives and the chain rule? Let’s for example call the function computed by one step in rule 30 cellular automaton evolution w[x, y, z]. We can think of the “partial derivative” of this function with respect to x at the point x as representing whether the output of w changes when x is flipped starting from the value given:
(Note that “no change” is indicated as False or , while a change is indicated as True or . And, yes, one can either explicitly compute the rule outcomes here, and then deduce from them the functional form, or one can use symbolic rules to directly deduce the functional form.)
One can compute a discrete analog of a derivative for any Boolean function. For example, we have
and
which we can write as:
We also have:
And here is a table of “Boolean derivatives” for all 2-input Boolean functions:
And indeed there’s a whole “Boolean calculus” one can set up for these kinds of derivatives. And in particular, there’s a direct analog of the chain rule:
where Xnor[x,y] is effectively the equality test x == y:
But, OK, how do we use this to create our change maps? In our simple cellular automaton case, we can think of our change map as representing how a change in an output cell “propagates back” to previous cells. But if we just try to apply our discrete calculus rules we run into a problem: different “chain rule chains” can imply different changes in the value of the same cell. In the continuous case this path dependence doesn’t happen because of the way infinitesimals work. But in the discrete case it does. And ultimately we’re doing a kind of backtracking that can really be represented faithfully only as a multiway system. (Though if we just want probabilities, for example, we can consider averaging over branches of the multiway system—and the change maps we showed above are effectively the result of thresholding over the multiway system.)
But despite the appearance of such difficulties in the “simple” cellular automaton case, such methods typically seem to work better in our original, more complicated rule array case. There’s a bunch of subtlety associated with the fact that we’re finding derivatives not only with respect to the values in the rule array, but also with respect to the choice of rules (which are the analog of weights in the continuous case).
Let’s consider the And+Xor rule array:
Our loss is the number of cells whose values disagree with the row shown at the bottom. Now we can construct a change map for this rule array both in a direct “forward” way, and “backwards” using our discrete derivative methods (where we effectively resolve the small amount of “multiway behavior” by always picking “majority” values):
The results are similar, though in this case not exactly the same. Here are a few other examples:
And, yes, in detail there are essentially always local differences between the results from the forward and backward methods. But the backward method—like in the case of backpropagation in ordinary neural nets—can be implemented much more efficiently. And for purposes of practical machine learning it’s actually likely to be perfectly satisfactory—especially given that the forward method is itself only providing an approximation to the question of which mutations are best to do.
And as an example, here are the results of the forward and backward methods for the problem of learning the function f[x] = , for the “breakthrough” configurations that we showed above:
We’ve now shown quite a few examples of machine learning in action. But a fundamental question we haven’t yet addressed is what kind of thing can actually be learned by machine learning. And even before we get to this, there’s another question: given a particular underlying type of system, what kinds of functions can it even represent?
As a first example consider a minimal neural net of the form (essentially a single-layer perceptron):
With ReLU (AKA Ramp) as the activation function and the first set of weights all taken to be 1, the function computed by such a neural net has the form:
With enough weights and biases this form can represent any piecewise linear function—essentially just by moving around ramps using biases, and scaling them using weights. So for example consider the function:
This is the function computed by the neural net above—and here’s how it’s built up by adding in successive ramps associated with the individual intermediate nodes (neurons):
(It’s similarly possible to get all smooth functions from activation functions like ELU, etc.)
Things get slightly more complicated if we try to represent functions with more than one argument. With a single intermediate layer we can only get “piecewise (hyper)planar” functions (i.e. functions that change direction only at linear “fault lines”):
But already with a total of two intermediate layers—and sufficiently many nodes in each of these layers—we can generate any piecewise function of any number of arguments.
If we limit the number of nodes, then roughly we limit the number of boundaries between different linear regions in the values of the functions. But as we increase the number of layers with a given number of nodes, we basically increase the number of sides that polygonal regions within the function values can have:
So what happens with the mesh nets that we discussed earlier? Here are a few random examples, showing results very similar to shallow, fully connected networks with a comparable total number of nodes:
OK, so how about our fully discrete rule arrays? What functions can they represent? We already saw part of the answer earlier when we generated rule arrays to represent various Boolean functions. It turns out that there is a fairly efficient procedure based on Boolean satisfiability for explicitly finding rule arrays that can represent a given function—or determine that no rule array (say of a given size) can do this.
Using this procedure, we can find minimal And+Xor rule arrays that represent all (“even”) 3-input Boolean functions (i.e. r = 1 cellular automaton rules):
It’s always possible to specify any n-input Boolean function by an array of 2n bits, as in:
But we see from the pictures above that when we “compile” Boolean functions into And+Xor rule arrays, they can take different numbers of bits (i.e. different numbers of elements in the rule array). (In effect, the “algorithmic information content” of the function varies with the “language” we’re using to represent them.) And, for example, in the n = 3 case shown here, the distribution of minimal rule array sizes is:
There are some functions that are difficult to represent as And+Xor rule arrays (and seem to require 15 rule elements)—and others that are easier. And this is similar to what happens if we represent Boolean functions as Boolean expressions (say in conjunctive normal form) and count the total number of (unary and binary) operations used:
OK, so we know that there is in principle an And+Xor rule array that will compute any (even) Boolean function. But now we can ask whether an adaptive evolution process can actually find such a rule array—say with a sequence of single-point mutations. Well, if we do such adaptive evolution—with a loss that counts the number of “wrong outputs” for, say, rule 254—then here’s a sequence of successive breakthrough configurations that can be produced:
The results aren’t as compact as the minimal solution above. But it seems to always be possible to find at least some And+Xor rule array that “solves the problem” just by using adaptive evolution with single-point mutations.
Here are results for some other Boolean functions:
And so, yes, not only are all (even) Boolean functions representable in terms of And+Xor rule arrays, they’re also learnable in this form, just by adaptive evolution with single-point mutations.
In what we did above, we were looking at how machine learning works with our rule arrays in specific cases like for the function. But now we’ve got a case where we can explicitly enumerate all possible functions, at least of a given class. And in a sense what we’re seeing is evidence that machine learning tends to be very broad—and capable at least in principle of learning pretty much any function.
Of course, there can be specific restrictions. Like the And+Xor rule arrays we’re using here can’t represent (“odd”) functions where . (The Nand+First rule arrays we discussed above nevertheless can.) But in general it seems to be a reflection of the Principle of Computational Equivalence that pretty much any setup is capable of representing any function—and also adaptively “learning” it.
By the way, it’s a lot easier to discuss questions about representing or learning “any function” when one’s dealing with discrete (countable) functions—because one can expect to either be able to “exactly get” a given function, or not. But for continuous functions, it’s more complicated, because one’s pretty much inevitably dealing with approximations (unless one can use symbolic forms, which are basically discrete). So, for example, while we can say (as we did above) that (ReLU) neural nets can represent any piecewise-linear function, in general we’ll only be able to imagine successively approaching an arbitrary function, much like when you progressively add more terms in a simple Fourier series:
Looking back at our results for discrete rule arrays, one notable observation that is that while we can successfully reproduce all these different Boolean functions, the actual rule array configurations that achieve this tend to look quite messy. And indeed it’s much the same as we’ve seen throughout: machine learning can find solutions, but they’re not “structured solutions”; they’re in effect just solutions that “happen to work”.
Are there more structured ways of representing Boolean functions with rule arrays? Here are the two possible minimum-size And+Xor rule arrays that represent rule 30:
At the next-larger size there are more possibilities for rule 30:
And there are also rule arrays that can represent rule 110:
But in none of these cases is there obvious structure that allows us to immediately see how these computations work, or what function is being computed. But what if we try to explicitly construct—effectively by standard engineering methods—a rule array that computes a particular function? We can start by taking something like the function for rule 30 and writing it in terms of And and Xor (i.e. in ANF, or “algebraic normal form”):
We can imagine implementing this using an “evaluation graph”:
But now it’s easy to turn this into a rule array (and, yes, we haven’t gone all the way and arranged to copy inputs, etc.):
“Evaluating” this rule array for different inputs, we can see that it indeed gives rule 30:
Doing the same thing for rule 110, the And+Xor expression is
the evaluation graph is
and the rule array is:
And at least with the evaluation graph as a guide, we can readily “see what’s happening” here. But the rule array we’re using is considerably larger than our minimal solutions above—or even than the solutions we found by adaptive evolution.
It’s a typical situation that one sees in many other kinds of systems (like for example sorting networks): it’s possible to have a “constructed solution” that has clear structure and regularity and is “understandable”. But minimal solutions—or ones found by adaptive evolution—tend to be much smaller. But they almost always look in many ways random, and aren’t readily understandable or interpretable.
So far, we’ve been looking at rule arrays that compute specific functions. But in getting a sense of what rule arrays can do, we can consider rule arrays that are “programmable”, in that their input specifies what function they should compute. So here, for example, is an And+Xor rule array—found by adaptive evolution—that takes the “bit pattern” of any (even) Boolean function as input on the left, then applies that Boolean function to the inputs on the right:
And with this same rule array we can now compute any possible (even) Boolean function. So here, for example, it’s evaluating Or:
Our general goal here has been to set up models that capture the most essential features of neural nets and machine learning—but that are simple enough in their structure that we can readily “look inside” and get a sense of what they are doing. Mostly we’ve concentrated on rule arrays as a way to provide a minimal analog of standard “perceptron-style” feed-forward neural nets. But what about other architectures and setups?
In effect, our rule arrays are “spacetime-inhomogeneous” generalizations of cellular automata—in which adaptive evolution determines which rule (say from a finite set) should be used at every (spatial) position and every (time) step. A different idealization (that in fact we already used in one section above) is to have an ordinary homogeneous cellular automaton—but with a single “global rule” determined by adaptive evolution. Rule arrays are the analog of feed-forward networks in which a given rule in the rule array is in effect used only once as data “flows through” the system. Ordinary homogeneous cellular automata are like recurrent networks in which a single stream of data is in effect subjected over and over again to the same rule.
There are various interpolations between these cases. For example, we can imagine a “layered rule array” in which the rules at different steps can be different, but those on a given step are all the same. Such a system can be viewed as an idealization of a convolutional neural net in which a given layer applies the same kernel to elements at all positions, but different layers can apply different kernels.
A layered rule array can’t encode as much information as a general rule array. But it’s still able to show machine-learning-style phenomena. And here, for example, is adaptive evolution for a layered And+Xor rule array progressively solving the problem of generating a pattern that lives for exactly 30 steps:
One could also imagine “vertically layered” rule arrays, in which different rules are used at different positions, but any given position keeps running the same rule forever. However, at least for the kinds of problems we’ve considered here, it doesn’t seem sufficient to just be able to pick the positions at which different rules are run. One seems to either need to change rules at different (time) steps, or one needs to be able to adaptively evolve the underlying rules themselves.
Rule arrays and ordinary cellular automata share the feature that the value of each cell depends only on the values of neighboring cells on the step before. But in neural nets it’s standard for the value at a given node to depend on the values of lots of nodes on the layer before. And what makes this straightforward in neural nets is that (weighted, and perhaps otherwise transformed) values from previous nodes are taken to be combined just by simple numerical addition—and addition (being n-ary and associative) can take any number of “inputs”. In a cellular automaton (or Boolean function), however, there’s always a definite number of inputs, determined by the structure of the function. In the most straightforward case, the inputs come only from nearest-neighboring cells. But there’s no requirement that this is how things need to work—and for example we can pick any “local template” to bring in the inputs for our function. This template could either be the same at every position and every step, or it could be picked from a certain set differently at different positions—in effect giving us “template arrays” as well as rule arrays.
So what about having a fully connected network, as we did in our very first neural net examples above? To set up a discrete analog of this we first need some kind of discrete n-ary associative “accumulator” function to fill the place of numerical addition. And for this we could pick a function like And, Or, Xor—or Majority. And if we’re not just going to end up with the same value at each node on a given layer, we need to set up some analog of a weight associated with each connection—which we can achieve by applying either Identity or Not (i.e. flip or not) to the value flowing through each connection.
Here’s an example of a network of this type, trained to compute the function we discussed above:
There are just two kinds of connections here: flip and not. And at each node we’re computing the majority function—giving value 1 if the majority of its inputs are 1, and 0 otherwise. With the “one-hot encoding” of input and output that we used before, here are a few examples of how this network evaluates our function:
This was trained just using 1000 steps of single-point mutation applied to the connection types. The loss systematically goes down—but the configuration of the connection types continues to look quite random even as it achieves zero loss (i.e. even after the function has been completely learned):
In what we’ve just done we assume that all connections continue to be present, though their types (or effectively signs) can change. But we can also consider a network where connections can end up being zeroed out during training—so that they are effectively no longer present.
Much of what we’ve done here with machine learning has centered around trying to learn transformations of the form x f[x]. But another typical application of machine learning is autoencoding—or in effect learning how to compress data representing a certain set of examples. And once again it’s possible to do such a task using rule arrays, with learning achieved by a series of single-point mutations.
As a starting point, consider training a rule array (of cellular automaton rules 4 and 146) to reproduce unchanged a block of black cells of any width. One might have thought this would be trivial. But it’s not, because in effect the initial data inevitably gets “ground up” inside the rule array, and has to be reconstituted at the end. But, yes, it’s nevertheless possible to train a rule array to at least roughly do this—even though once again the rule arrays we find that manage to do this look quite random:
But to set up a nontrivial autoencoder let’s imagine that we progressively “squeeze” the array in the middle, creating an increasingly narrow “bottleneck” through which the data has to flow. At the bottleneck we effectively have a compressed version of the original data. And we find that at least down to some width of bottleneck, it’s possible to create rule arrays that—with reasonable probability—can act as successful autoencoders of the original data:
The success of LLMs has highlighted the use of machine learning for sequence continuation—and the effectiveness of transformers for this. But just as with other neural nets, the forms of transformers that are used in practice are typically very complicated. But can one find a minimal model that nevertheless captures the “essence of transformers”?
Let’s say that we have a sequence that we want to continue, like:
We want to encode each possible value by a vector, as in
so that, for example, our original sequence is encoded as:
Then we have a “head” that reads a block of consecutive vectors, picking off certain values and feeding pairs of them into And and Xor functions, to get a vector of Boolean values:
Ultimately this head is going to “slide” along our sequence, “predicting” what the next element in the sequence will be. But somehow we have to go from our vector of Boolean values to (probabilities of) sequence elements. Potentially we might be able to do this just with a rule array. But for our purposes here we’ll use a fully connected single-layer Identity+Not network in which at each output node we just find the sum of the number of values that come to it—and treat this as determining (through a softmax) the probability of the corresponding element:
In this case, the element with the maximum value is 5, so at “zero temperature” this would be our “best prediction” for the next element.
To train this whole system we just make a sequence of random point mutations to everything, keeping mutations that don’t increase the loss (where the loss is basically the difference between predicted next values and actual next values, or, more precisely, the “categorical cross-entropy”). Here’s how this loss progresses in a typical such training:
At the end of this training, here are the components of our minimal transformer:
First come the encodings of the different possible elements in the sequence. Then there’s the head, here shown applied to the encoding of the first elements of the original sequence. Finally there’s a single-layer discrete network that takes the output from the head, and deduces relative probabilities for different elements to come next. In this case the highest-probability prediction for the next element is that it should be element 6.
To do the analog of an LLM we start from some initial “prompt”, i.e. an initial sequence that fits within the width (“context window”) of the head. Then we progressively apply our minimal transformer, for example at each step taking the next element to be the one with the highest predicted probability (i.e. operating “at zero temperature”). With this setup the collection of “prediction strengths” is shown in gray, with the “best prediction” shown in red:
Running this even far beyond our original training data, we see that we get a “prediction” of a continued sine wave:
As we might expect, the fact that our minimal transformer can make such a plausible prediction relies on the simplicity of our sine curve. If we use “more complicated” training data, such as the “mathematically defined” () blue curve in
the result of training and running a minimal transformer is now:
And, not surprisingly, it can’t “figure out the computation” to correctly continue the curve. By the way, different training runs will involve different sequences of mutations, and will yield different predictions (often with periodic “hallucinations”):
In looking at “perceptron-style” neural nets we wound up using rule arrays—or, in effect, spacetime-inhomogeneous cellular automata—as our minimal models. Here we’ve ended up with a slightly more complicated minimal model for transformer neural nets. But if we were to simplify it further, we would end up not with something like a cellular automaton but instead with something like a tag system, in which one has a sequence of elements, and at each step removes a block from the beginning, and—depending on its form—adds a certain block at the end, as in:
And, yes, such systems can generate extremely complex behavior—reinforcing the idea (that we have repeatedly seen here) that machine learning works by selecting complexity that aligns with goals that have been set.
And along these lines, one can consider all sorts of different computational systems as foundations for machine learning. Here we’ve been looking at cellular-automaton-like and tag-system-like examples. But for example our Physics Project has shown us the power and flexibility of systems based on hypergraph rewriting. And from what we’ve seen here, it seems very plausible that something like hypergraph rewriting can serve as a yet more powerful and flexible substrate for machine learning.
There are, I think, several quite striking conclusions from what we’ve been able to do here. The first is just that models much simpler than traditional neural nets seem capable of capturing the essential features of machine learning—and indeed these models may well be the basis for a new generation of practical machine learning.
But from a scientific point of view, one of the things that’s important about these models is that they are simple enough in structure that it’s immediately possible to produce visualizations of what they’re doing inside. And studying these visualizations, the most immediately striking feature is how complicated they look.
It could have been that machine learning would somehow “crack systems”, and find simple representations for what they do. But that doesn’t seem to be what’s going on at all. Instead what seems to be happening is that machine learning is in a sense just “hitching a ride” on the general richness of the computational universe. It’s not “specifically building up behavior one needs”; rather what it’s doing is to harness behavior that’s “already out there” in the computational universe.
The fact that this could possibly work relies on the crucial—and at first unexpected—fact that in the computational universe even very simple programs can ubiquitously produce all sorts of complex behavior. And the point then is that this behavior has enough richness and diversity that it’s possible to find instances of it that align with machine learning objectives one’s defined. In some sense what machine learning is doing is to “mine” the computational universe for programs that do what one wants.
It’s not that machine learning nails a specific precise program. Rather, it’s that in typical successful applications of machine learning there are lots of programs that “do more or less the right thing”. If what one’s trying to do involves something computationally irreducible, machine learning won’t typically be able to “get well enough aligned” to correctly “get through all the steps” of the irreducible computation. But it seems that many “human-like tasks” that are the particular focus of modern machine learning can successfully be done.
And by the way, one can expect that with the minimal models explored here, it becomes more feasible to get a real characterization of what kinds of objectives can successfully be achieved by machine learning, and what cannot. Critical to the operation of machine learning is not only that there exist programs that can do particular kinds of things, but also that they can realistically be found by adaptive evolution processes.
In what we’ve done here we’ve often used what’s essentially the very simplest possible process for adaptive evolution: a sequence of point mutations. And what we’ve discovered is that even this is usually sufficient to lead us to satisfactory machine learning solutions. It could be that our paths of adaptive evolution would always be getting stuck—and not reaching any solution. But the fact that this doesn’t happen seems crucially connected to the computational irreducibility that’s ubiquitous in the systems we’re studying, and that leads to effective randomness that with overwhelming probability will “give us a way out” of anywhere we got stuck.
In some sense computational irreducibility “levels the playing field” for different processes of adaptive evolution, and lets even simple ones be successful. Something similar seems to happen for the whole framework we’re using. Any of a wide class of systems seem capable of successful machine learning, even if they don’t have the detailed structure of traditional neural nets. We can see this as a typical reflection of the Principle of Computational Equivalence: that even though systems may differ in their details, they are ultimately all equivalent in the computations they can do.
The phenomenon of computational irreducibility leads to a fundamental tradeoff, of particular importance in thinking about things like AI. If we want to be able to know in advance—and broadly guarantee—what a system is going to do or be able to do, we have to set the system up to be computationally reducible. But if we want the system to be able to make the richest use of computation, it’ll inevitably be capable of computationally irreducible behavior. And it’s the same story with machine learning. If we want machine learning to be able to do the best it can, and perhaps give us the impression of “achieving magic”, then we have to allow it to show computational irreducibility. And if we want machine learning to be “understandable” it has to be computationally reducible, and not able to access the full power of computation.
At the outset, though, it’s not obvious whether machine learning actually has to access such power. It could be that there are computationally reducible ways to solve the kinds of problems we want to use machine learning to solve. But what we’ve discovered here is that even in solving very simple problems, the adaptive evolution process that’s at the heart of machine learning will end up sampling—and using—what we can expect to be computationally irreducible processes.
Like biological evolution, machine learning is fundamentally about finding things that work—without the constraint of “understandability” that’s forced on us when we as humans explicitly engineer things step by step. Could one imagine constraining machine learning to make things understandable? To do so would effectively prevent machine learning from having access to the power of computationally irreducible processes, and from the evidence here it seems unlikely that with this constraint the kind of successes we’ve seen in machine learning would be possible.
So what does this mean for the “science of machine learning”? One might have hoped that one would be able to “look inside” machine learning systems and get detailed narrative explanations for what’s going on; that in effect one would be able to “explain the mechanism” for everything. But what we’ve seen here suggests that in general nothing like this will work. All one will be able to say is that somewhere out there in the computational universe there’s some (typically computationally irreducible) process that “happens” to be aligned with what we want.
Yes, we can make general statements—strongly based on computational irreducibility—about things like the findability of such processes, say by adaptive evolution. But if we ask “How in detail does the system work?”, there won’t be much of an answer to that. Of course we can trace all its computational steps and see that it behaves in a certain way. But we can’t expect what amounts to a “global human-level explanation” of what it’s doing. Rather, we’ll basically just be reduced to looking at some computationally irreducible process and observing that it “happens to work”—and we won’t have a high-level explanation of “why”.
But there is one important loophole to all this. Within any computationally irreducible system, there are always inevitably pockets of computational reducibility. And—as I’ve discussed at length particularly in connection with our Physics Project—it’s these pockets of computational reducibility that allow computationally bounded observers like us to identify things like “laws of nature” from which we can build “human-level narratives”.
So what about machine learning? What pockets of computational reducibility show up there, from which we might build “human-level scientific laws”? Much as with the emergence of “simple continuum behavior” from computationally irreducible processes happening at the level of molecules in a gas or ultimate discrete elements of space, we can expect that at least certain computationally reducible features will be more obvious when one’s dealing with larger numbers of components. And indeed in sufficiently large machine learning systems, it’s routine to see smooth curves and apparent regularity when one’s looking at the kind of aggregated behavior that’s probed by things like training curves.
But the question about pockets of reducibility is always whether they end up being aligned with things we consider interesting or useful. Yes, it could be that machine learning systems would exhibit some kind of collective (“EEG-like”) behavior. But what’s not clear is whether this behavior will tell us anything about the actual “information processing” (or whatever) that’s going on in the system. And if there is to be a “science of machine learning” what we have to hope for is that we can find in machine learning systems pockets of computational reducibility that are aligned with things we can measure, and care about.
So given what we’ve been able to explore here about the foundations of machine learning, what can we say about the ultimate power of machine learning systems? A key observation has been that machine learning works by “piggybacking” on computational irreducibility—and in effect by finding “natural pieces of computational irreducibility” that happen to fit with the objectives one has. But what if those objectives involve computational irreducibility—as they often do when one’s dealing with a process that’s been successfully formalized in computational terms (as in math, exact science, computational X, etc.)? Well, it’s not enough that our machine learning system “uses some piece of computational irreducibility inside”. To achieve a particular computationally irreducible objective, the system would have to do something closely aligned with that actual, specific objective.
It has to be said, however, that by laying bare more of the essence of machine learning here, it becomes easier to at least define the issues of merging typical “formal computation” with machine learning. Traditionally there’s been a tradeoff between the computational power of a system and its trainability. And indeed in terms of what we’ve seen here this seems to reflect the sense that “larger chunks of computational irreducibility” are more difficult to fit into something one’s incrementally building up by a process of adaptive evolution.
So how should we ultimately think of machine learning? In effect its power comes from leveraging the “natural resource” of computational irreducibility. But when it uses computational irreducibility it does so by “foraging” pieces that happen to advance its objectives. Imagine one’s building a wall. One possibility is to fashion bricks of a particular shape that one knows will fit together. But another is just to look at stones one sees lying around, then to build the wall by fitting these together as best one can.
And if one then asks “Why does the wall have such-and-such a pattern?” the answer will end up being basically “Because that’s what one gets from the stones that happened to be lying around”. There’s no overarching theory to it in itself; it’s just a reflection of the resources that were out there. Or, in the case of machine learning, one can expect that what one sees will be to a large extent a reflection of the raw characteristics of computational irreducibility. In other words, the foundations of machine learning are as much as anything rooted in the science of ruliology. And it’s in large measure to that science we should look in our efforts to understand more about “what’s really going on” in machine learning, and quite possibly also in neuroscience.
In some ways it seems like a quirk of intellectual history that the kinds of foundational questions I’ve been discussing here weren’t already addressed long ago—and in some ways it seems like an inexorable consequence of the only rather recent development of certain intuitions and tools.
The idea that the brain is fundamentally made of connected nerve cells was considered in the latter part of the nineteenth century, and took hold in the first decades of the twentieth century—with the formalized concept of a neural net that operates in a computational way emerging in full form in the work of Warren McCulloch and Walter Pitts in 1943. By the late 1950s there were hardware implementations of neural nets (typically for image processing) in the form of “perceptrons”. But despite early enthusiasm, practical results were mixed, and at the end of the 1960s it was announced that simple cases amenable to mathematical analysis had been “solved”—leading to a general belief that “neural nets couldn’t do anything interesting”.
Ever since the 1940s there had been a trickle of general analyses of neural nets, particularly using methods from physics. But typically these analyses ended up with things like continuum approximations—that could say little about the information-processing aspects of neural nets. Meanwhile, there was an ongoing undercurrent of belief that somehow neural networks would both explain and reproduce how the brain works—but no methods seemed to exist to say quite how. Then at the beginning of the 1980s there was a resurgence of interest in neural networks, coming from several directions. Some of what was done concentrated on very practical efforts to get neural nets to do particular “human-like” tasks. But some was more theoretical, typically using methods from statistical physics or dynamical systems.
Before long, however, the buzz died down, and for several decades only a few groups were left working with neural nets. Then in 2011 came a surprise breakthrough in using neural nets for image analysis. It was an important practical advance. But it was driven by technological ideas and development—not any significant new theoretical analysis or framework.
And this was also the pattern for almost all of what followed. People spent great effort to come up with neural net systems that worked—and all sorts of folklore grew up about how this should best be done. But there wasn’t really even an attempt at an underlying theory; this was a domain of engineering practice, not basic science.
And it was in this tradition that ChatGPT burst onto the scene in late 2022. Almost everything about LLMs seemed to be complicated. Yes, there were empirically some large-scale regularities (like scaling laws). And I quickly suspected that the success of LLMs was a strong hint of general regularities in human language that hadn’t been clearly identified before. But beyond a few outlier examples, almost nothing about “what’s going on inside LLMs” has seemed easy to decode. And efforts to put “strong guardrails” on the operation of the system—in effect so as to make it in some way “predictable” or “understandable”—typically seem to substantially decrease its power (a point that now makes sense in the context of computational irreducibility).
My own interaction with machine learning and neural nets began in 1980 when I was developing my SMP symbolic computation system, and wondering whether it might be possible to generalize the symbolic pattern-matching foundations of the system to some kind of “fuzzy pattern matching” that would be closer to human thinking. I was aware of neural nets but thought of them as semi-realistic models of brains, not for example as potential sources of algorithms of the kind I imagined might “solve” fuzzy matching.
And it was partly as a result of trying to understand the essence of systems like neural nets that in 1981 I came up with what I later learned could be thought of as one-dimensional cellular automata. Soon I was deeply involved in studying cellular automata and developing a new intuition about how complex behavior could arise even from simple rules. But when I learned about recent efforts to make idealized models of neural nets using ideas from statistical mechanics, I was at least curious enough to set up simulations to try to understand more about these models.
But what I did wasn’t a success. I could neither get the models to do anything of significant practical interest—nor did I manage to derive any good theoretical understanding of them. I kept wondering, though, what relationship there might be between cellular automata that “just run”, and systems like neural nets that can also “learn”. And in fact in 1985 I tried to make a minimal cellular-automaton-based model to explore this. It was what I’m now calling a “vertically layered rule array”. And while in many ways I was already asking the right questions, this was an unfortunate specific choice of system—and my experiments on it didn’t reveal the kinds of phenomena we’re now seeing.
Years went by. I wrote a section on “Human Thinking” in A New Kind of Science, that discussed the possibility of simple foundational rules for the essence of thinking, and even included a minimal discrete analog of a neural net. At the time, though, I didn’t develop these ideas. By 2017, though, 15 years after the book was published—and knowing about the breakthroughs in deep learning—I had begun to think more concretely about neural nets as getting their power by sampling programs from across the computational universe. But still I didn’t see quite how this would work.
Meanwhile, there was a new intuition emerging from practical experience with machine learning: that if you “bashed” almost any system “hard enough”, it would learn. Did that mean that perhaps one didn’t need all the details of neural networks to successfully do machine learning? And could one perhaps make a system whose structure was simple enough that its operation would for example be accessible to visualization? I particularly wondered about this when I was writing an exposition of ChatGPT and LLMs in early 2023. And I kept talking about “LLM science”, but didn’t have much of a chance to work on it.
But then, a few months ago, as part of an effort to understand the relation between what science does and what AI does, I tried a kind of “throwaway experiment”—which, to my considerable surprise, seemed to successfully capture some of the essence of what makes biological evolution possible. But what about other adaptive evolution—and in particular, machine learning? The models that seemed to be needed were embarrassingly close to what I’d studied in 1985. But now I had a new intuition—and, thanks to Wolfram Language, vastly better tools. And the result has been my effort here.
Of course this is only a beginning. But I’m excited to be able to see what I consider to be the beginnings of foundational science around machine learning. Already there are clear directions for practical applications (which, needless to say, I plan to explore). And there are signs that perhaps we may finally be able to understand just why—and when—the “magic” of machine learning works.
Thanks to Richard Assar of the Wolfram Institute for extensive help. Thanks also to Brad Klee, Tianyi Gu, Nik Murzin and Max Niederman for specific results, to George Morgan and others at Symbolica for their early interest, and to Kovas Boguta for suggesting many years ago to link machine learning to the ideas in A New Kind of Science.