MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Why Wouldn't A Rationalist Be Rational?

2025-11-27 11:22:48

Published on November 27, 2025 3:22 AM GMT

Sometimes people ask me why someone behaving irrationally is present at a rationalist meetup. This essay is a partial answer to why I think that question is confused. The short version is, we are not set up such that any particular list of skills is a requirement.

I.

Once upon a time I was talking to someone at a rationalist meetup, and they described a conversation they’d had with another attendee. Let's call these two Adam and Bella. Bella, it transpired, had said something incorrect, Adam had corrected them, and Bella got defensive. A bit of back and forth transpired.

(Names have been changed to protect the innocent and guilty alike.)

Adam later talks to me and asks, “why would a rationalist not be glad to learn that they were wrong? Don’t we want to learn to be less wrong?”

(One reason is that Adam is wrong and Bella is right. I have passed through denial and into acceptance at the rate that people show up to the 'believe true things' club, then assume the only explanation for why regulars do not agree on their pet contrarian take is that everybody else is wrong. And of course, being told a fact is not the same as being convinced that fact is true. Assume for the moment that Adam is right on the object level here.)

I’ve had several variations on this conversation. There is a genuine point here, and one I think is important. Why would a member of a website called LessWrong, fully aware of the Litanies of Tarski and Gendlin, not update cleanly? Since it’s come up repeatedly, I would like a canonical explanation for why I think this happens and why I’m not more upset about it. First, a few equivalencies for other identities.

  • Why would a martial artist not be able to do a side kick?
  • Why would a chess player fall for Fool’s Mate?
  • Why would a mathematician not know how to find a derivative?
  • Why would a singer not be able to hit a C note?
  • Why would a programmer not know how to do a binary sort?
  • Why would a swimmer not know how to do a breast stroke?

When I look at that list, one answer that jumps out at me is that “mathematician” describes a very wide range of skills. 

There is no common census on when someone starts being referred to as any of those identities. “Doctor” is pretty clear, you have to have a medical degree and they don’t have those in stock at Walmart. “Marine” is very explicit, and I confidently expect every marine with all four limbs intact can kick my ass (and quite a few can kick my ass with less limbs than that.) But calling ten year-old Screwtape with his TI-83 BASIC calculator and a look of screwed up concentration a programmer isn’t obviously incorrect, despite the fact that he didn’t even know there was a SortA() function. “Mathematician” is defined as “an expert in or student of mathematics” which means that any kid studying arithmetic for the first time counts.

“Swimmer” is what you yell when someone goes overboard while white water rafting, whether or not they can swim.

Descriptively, “Rationalist” is sometimes used to describe people who show up to house parties in a certain social circle. I've made my particular piece with this. "Rationalist" is just another word with fuzzy category boundaries, and I think at this point it's descriptively correct that the word points as much to the social group as to the skillset.

II.

Let's get a little more complicated.

I have a friend, a great musician, she’s been making music for decades now. Do you think it’s reasonable to assume she knows how to play a G chord?

What if I say she’s a drummer?

I’m not being obtuse for no reason here. There’s lots of sub-specialization in lots of fields. Many people dabble - guitarists who can sing a little, violinists who can do the cello - but again you have to be careful with your terms and assumptions. I think of myself as a decent programmer, but I’ve never written a compiler. Some people who have written compilers haven’t ever stood up a web server. Brazilian Jiu Jitsu is great, but I don’t think it contains a side kick.

Just so, there’s people in the rationality community I expect to strictly beat me me on calibration and betting, but who I think I could outmatch in anticipating my own failures (virtue of humility.) Likewise, I know some people with a comprehensive understanding of cognitive science who are just as overconfident in practice as the man on the street.

I want to be clear about what I’m not saying. I’m not saying I don’t care. I’m not saying rounding yourself out isn’t useful. I’m not saying preferences don’t matter; some authors just prefer first person and that’s fine. I’m also not saying that knowing a skill means literally never making future mistakes; sometimes a singer has a cold or just hits a flat note.

III.

And all of that assumes that there’s some kind of ladder or path to becoming more rational, which posters on LessWrong or attendees at an ACX meetup are inevitably going to encounter. If that’s what you think, you’re obviously going to different meetups than I am.

I like Duncan Sabien’s Make More Grayspaces.

If you think of the target space as a place where An Unusual Good Thing Happens, based on five or six special preconditions that make that target space different from the broader outside world...

...then the grayspace is the place where those five or six special preconditions are being laid down. The Unusual Good Thing isn’t actually happening in the grayspace. It isn’t even being attempted. And The Unusual Good Thing isn’t being exposed to people who don’t have the prereqs down pat.

If a math Ph.D. doesn’t know how to do long division, now I agree we have a problem. Likewise if a black belt in Karate can’t do a side kick. These are also true at earlier levels! Anyone with a high school diploma should have the long division thing down.[1]

"Ah," you might say, "but rationality does have its answer! We can read the sequences!"

Eh.

So first off, there's the Astral Codex Ten groups. It's not obvious from ACX that you should read the sequences. Maybe you just want to read ACX, and some people will just have been reading for the last few weeks or months before they joined.

Then you run into the fact that this is kind of rough on newcomers, and newcomers won't stop coming. I know of multiple local groups that read all the sequence posts! It usually takes them a couple years! None of them wanted to restart from the beginning even though some new people had joined halfway through. 

Plus, I'm just skeptical that reading the sequences once, years ago, is enough to remember it?

IV.

This thing we do, this collective effort to have a real art of rationality? I’m not convinced we have a real grayspace for most of it. I want to congratulate the Guild of the Rose for being a place where they say ‘hey, try changing your mind, and admit it out loud. We’re keeping track of who has done it and who hasn’t.’ It’s one of three paths and guild members can pick what paths to go down, so I can’t quite be told someone’s been a member of the Guild for years and confidently know they can change their mind without flinching, but they’re closer than anyone else I know.

What would it take?

I think it would take three things.

  • Some agreed upon list of the kata or canon.
    • It wouldn’t need to be universally agreed upon. “Karate blackbelt” is pretty useful as a phrase, even if it has a different technique list than Tai Chi.
  • Some agreed upon evaluation process, whose results were public or at least checkable.
    • It wouldn’t need to be perfectly objective. “Sensei Chan watched me follow instructions for an hour long test and says I earned the belt” counts, especially for anyone who studies with Sensei Chan.
  • Someone or something actually teaching each of the techniques, with quick feedback.
    • I’ll take a written document like the sequences if that’s what we need, but it would need a feedback loop like a tennis instructor's comments or leetcode's problem sets.

Even if we had one, “rationalist” seems as reasonable a term to apply to someone just starting out as “chess player” or “writer.” I’m not begrudging anyone the term. 

(Okay. I am a little bit begrudging the people who aren’t even trying. If you call yourself a chess player and you show up to chess club, you either gotta already know how the knight moves or be willing to have someone show you. If you’re not interested - well, look, I’d like to find a place for the people who are. But the socials are still fine, I run 'em myself and I have a good time at the ones I attend. And as a collective we really have not made it easy to get good at the skills.)

Just, consider all this next time you’re tempted to assume someone should already have a skill.

  1. ^

    Lest anyone think I'm casting stones at others here, I did not have long division down when I got my diploma. I think this demonstrated a mistake on the part of the school system. (Short version, I swapped schools a lot and the new school sort of assumed the old one had covered it.)



Discuss

To write well; first, experience.

2025-11-27 10:11:43

Published on November 27, 2025 2:11 AM GMT

"On living, for the sake of your writing"

This is my most liked Substack post! 

By than, I mean it got 2 likes, because I don't really show anyone my blog. But, given Inkhaven/Half-Haven, are both soon to end; I thought I'd share this one here. 


Opening of the Post

Epistemic Status: Based on ‘vibes’ not science.

Over the past two months of writing every day, I have grokked the advice I have so often heard from writers, that you must experience life, to be able to write well.

I doubt I’ll be able to instill this idea into your head, in a way that the writers I admire were unable to do for me; with their flowing words and exemplary prose. But, I will still give it a go.



Discuss

What it feels like to be enthusiastically time blind

2025-11-27 09:49:58

Published on November 27, 2025 1:49 AM GMT

And I rail against points no one actually made

I seem to be time blind and I like it. From my point of view most people are obsessed with how long something takes instead of how interesting it is. Why in the holy hell would you want to sit around with nothing happening? And why in the flying fuck would you interrupt yourself doing something interesting just cause time has passed?

I mean, don’t get me wrong. I agree time is an amazing coordination and planning tool. I tend to be on time. Actually, I tend to be early, cause being early requires almost no time tracking at all while you get all the benefits of being a good coordinator. You just arrive long before you need to be there and then zone out. Where zoning out is basically letting go of this finicky annoying time-thingy so we can once again focus on what matters: interesting things.

My problem is that I wish more people were timeblind. I am pretty lost on why people consider me to be the one with the “disability”. Have you considered not waiting and hanging around doing nothing? Have you considered instead, just … doing stuff cause waiting is fucking torture and your life is ticking away every second and why are we sitting here with nothing going on?

Yeah, I was a blast in meditation class.

But seriously. I don’t get it. When I’m in a group, and a problem needs solving, I tend to be the first one to jump up and do it cause well … what else are you supposed to do? I love it when someone else beats me to the punch. People who are faster than me are so relaxing.

But it’s not just in groups. Basically in any situation, have you considered just not waiting? How about not being patient at all? There are a lot of seconds, minutes, and hours that people spend doing nothing. Consider the benefits of getting annoyed that your life is ticking by with no obvious reward. Instead you could be reading, cleaning, exercising, or daydreaming.

But you want to know the irony? Most of the people I get along with are actually really patient. They don’t mind silences, they don’t mind delays, they don’t mind things going slowly.

I hate it. Which is great. They can do that patience thing while I poke a shiny light.

Except when I don’t, cause timeblindness goes both ways: When nothing is happening I want things to go faster, but when stuff gets interesting I want the whole world to stop. Please don’t talk to me when I’m reading. Please don’t interrupt me when I’m typing. Please don’t expect me to stop binging this great book, show, game, or work project that is the glorious ambrosia I want to suck on all day.

I tend to say I crave behavioral addictions. I don’t recommend it. Not cause it’s bad for me, but because it’s probably bad for you. I seem to have gotten a lucky roll on the addiction-trait-chart. See I’m too impatient to press the button of Unreliable Reward. You can’t lootbox me or slot machine me. The distance between the cookies is too astronomical. I’d be thoroughly annoyed before you can string me along. Ain’t nobody got time for that.

Instead I want a game addiction, a twitter addiction, or a media addiction. Cause whenever I get one, I get to live in the glorious place where my entire existence is tiled in the One Interesting Thing, and my mind is synthesizing every possible connection together from every angle, and time is a lieeeee.

Don’t ask me to stop. Don’t ask me to eat. Don’t ask me to sleep or drink.

Actually, please do.

You are probably one of those patient people, who noticed it’s 2 hours past dinner time, and if you like me, you might shove a plate of food under my door. I love you <3

I’ve gotten addicted to facebook, to gaming, to work. Each time I forget whatever else I’m doing, and then try to remember to eat and sleep, while just drowning myself in cool things spawning more cool things.

Don’t make me wait for that.

Cause I don’t care about time. I care about reward signals. Patience is learning to cope with a problem I’d rather solve instead: If there is nothing interesting going on, please just let me make it interesting.

I do agree it’s hard sometimes. Till this year, I’d get angry if my partner said he’d be back in 5 minutes and then was back in 5 actual minutes. I know that sounds deranged. What actually happens is that I don’t look at a clock, don’t have anything to do in those 5 minutes, and every second expands into an additional 10 seconds. If no one actually sets a timer, I become convinced 20 minutes or more have passed.

Compare that to playing a fun game or reading a good book, and every second somehow shrinks to only a fraction of itself. You blink and all the time is gone.

Getting time imposed over interest is maddening to me. I become viscerally angry. There are ways to not. Those include nothing like the advice I’ve gotten. I don’t want to “ground myself on sensory stimuli”. Those are slow and just emphasize we are simply sitting here dying at a speed of 10 seconds per second.

I also don’t want to breathe. Breathing is done automatically. Are you being serious, you want me to spend my attention and time on manually steering my body functions while I could [wildly waves hands] be doing something interesting?

I get that I could wirehead myself into being ok with nothingness. I’m sort of a fan of agnostic buddhism on some level, cause they are cool about being nice about your feelings and your mind. But there is a limit somewhere between functioning ok in life and actually pursuing things you love.

It has helped me to know I’m time blind though. I only discovered it two years ago. For the last year, I’ve set a literal timer when someone says they will be back in a few minutes. It saves us both a bunch of grief. If someone speaks slowly, I either start counting in my head or start poking things around me.

Wow, Shoshannah, that sure sounds like ADHD.

Yeah, sure. If I take ritalin I can see time just fine. But it isn’t actually better in most ways. Less happens. There are fewer connections. Less pressure. Less drive. And I like to be driven. Just let me be time blind. I’ll try to keep things interesting for us. And if you have a minute, maybe push a plate of food under my door.

Hi <3



Discuss

A Technical Introduction to Solomonoff Induction without K-Complexity

2025-11-27 05:36:39

Published on November 26, 2025 9:36 PM GMT

This post has been written in collaboration with Iliad in service of one of Iliad's longer-term goals of understanding the simplicity bias of learning machines. 

Solomonoff induction is a general theoretical ideal for how to predict sequences that were generated by a computable process. It posits that in order to predict the continuation of our world, we should weight hypotheses by their simplicity ("Occam's razor'') and update the weighting using Bayesian reasoning.

In this post, I introduce Solomonoff induction at a technical level, hoping to be more elegant and satisfying than introductions you might find elsewhere. In particular, the introduction avoids Kolmogorov complexity: Instead of making predictions with the Solomonoff prior probability of hypotheses — given as two to the negative of their shortest encodings, i.e., Kolmogorov complexities we directly use the a priori probability that a hypothesis is sampled from code sampled uniformly at random. This is closely related to the alt-complexity sketched by Nate Soares before. The approach leads to more elegant results that are simpler to prove than for typical expositions. Yet, since only the prior over hypothesis is changed, the introduction remains compatible with other expositions and should be broadly valuable to many readers.

Later, we will see that this choice has both mathematical advantages and disadvantages compared to an approach based on Kolmogorov complexity, and the latter is likely a more powerful basis to build more theory on. These drawbacks notwithstanding, I hope that this introduction is philosophically satisfying, by touching on questions like "Why should we expect our world to be learnable or predictable in the first place?', "What is a universally good way to predict?", and connections to Occam's razor and why one should favor simple hypotheses when predicting the world. In the FAQ at the end, I will touch on several anticipated questions, including briefly on how the theory might be relevant for neural networks.

Audience. I try to not strictly assume much prior knowledge on Turing machines or related concepts, but prior exposure will still be very useful. Some amount of mathematical or scientific maturity can probably replace exposure to these topics.

Acknowledgments. This post benefitted most from discussions with Alexander Oldenziel, and it was caused by his inquiry about the potential simplicity of learned neural network functions based on analogies with algorithmic information theory (AIT). This made me want to think more carefully about a "degeneracy-first" perspective on AIT itself, from which this post emerged. I'm also grateful to Cole Wyeth for discussions and his feedback on an earlier draft that led to some changes. He also explained to me the argument for why the Solomonoff prior gives the optimal prediction bound among all lower semicomputable priors. Thanks also to Matt Dellago for discussions on the content of this post, and to CEEALAR for hosting me when I finished writing this post. 

Introduction

If we make a sequence of observations in the world, e.g., a sequence of numbers

or a sequence of physical measurements, we are often able to continue the sequence in a way that coincides with its true continuation "in the wild". This seems to broadly work in much of the real world, meaning the world is learnable — we can learn from past experience to predict the future.

Why is that? This question has puzzled humanity for thousands of years.  For example, Sextus Empiricus expressed that any rule inferred from a finite amount of data may always be contradicted by more data, and that it is impossible to take into account all data:

When they propose to establish the universal from the particulars by means of induction, they will affect this by a review of either all or some of the particulars. But if they review some, the induction will be insecure, since some of the particulars omitted in the induction may contravene the universal; while if they are to review all, they will be toiling at the impossible, since the particulars are infinite and indefinite.

— Sextus Empiricus (160-210)

René Descartes notices that our minds are so fallible that we cannot even be sure of being awake:

How often, asleep at night, am I convinced of just such familiar events — that I am here in my dressing-gown, sitting by the fire — when in fact I am lying undressed in bed! Yet at the moment my eyes are certainly wide awake when I look at this piece of paper; I shake my head and it is not asleep; as I stretch out and feel my hand I do so deliberately, and I know what I am doing. All this would not happen with such distinctness to someone asleep. Indeed! As if I did not remember other occasions when I have been tricked by exactly similar thoughts while asleep! As I think about this more carefully, I see plainly that there are never any sure signs by means of which being awake can be distinguished from being asleep. The result is that I begin to feel dazed, and this very feeling only reinforces the notion that I may be asleep.

— René Descartes (1596-1650)

And in the following passage, David Hume seems to imply that while we can make future inferences about cause and effect from analogizing about past patterns, there is no principle that can be discovered in nature itself that would justify doing so:

All our reasonings concerning matter of fact are founded on a species of Analogy, which leads us to expect from any cause the same events, which we have observed to result from similar causes. [...] It is impossible, that this inference of the animal can be founded on any process of argument or reasoning, by which he concludes, that like events must follow like objects, and that the course of nature will always be regular in its operations. For if there be in reality any arguments of this nature, they surely lie too abstruse for the observation of such imperfect understandings; since it may well employ the utmost care and attention of a philosophic genius to discover and observe them.

— David Hume (1711-1776)

Similar reasoning as those is expressed by the no free lunch theorems, which roughly state that a priori, no learning algorithm should perform better than any other.

We conclude that without placing any assumptions on the world, learning and prediction should indeed be impossible: If the world were truly (uniformly) random, then any way to use past data to predict the future would indeed be doomed. So why is it possible to learn?

One pre-formal answer is Occam's razor, which exists in various formulations and states that among all explanations that are compatible with our observations, we should choose the one making the least amount of assumptions. There are two quotes often attributed to Occam to this effect:

"Plurality must never be posited without necessity."

"It is futile to do with more things that which can be done with fewer."

— William of Ockham (1287-1347)

Occam's razor is a useful starting point, but it is not formal. Additionally, it would be nice to be able to justify Occam's razor from some minimal assumption that does not simply assume that we should focus on simple hypotheses in our reasoning. How can one do this?

In this post, I motivate the answer by starting with the minimal assumption that our world is computed. Not knowing what it is computed by, we assume that we are generated by running a uniformly randomly sampled program code through a universal Turing machine (Section link).

A consequence of this assumption will be that we can also think of our world as being sampled in a two-stage process: First by sampling a semicomputable (perhaps probabilistic) universe from a prior distribution, and then by generating our history from said universe. As it turns out, the prior on universes — which I call a priori prior — will automaically have a simplicity-weighting built in, thus justifying Occam's razor (Section link).

Repurposing this two-stage distribution for making predictions then leads to Solomonoff induction, which has remarkable predictive power no matter what computation truly underlies our universe. I.e., even if the program code underlying our true universe was in fact not sampled randomly, but chosen adversarially to make our prediction task hard, we would eventually, for a sufficiently long observed history, converge to perfect predictions (Section link).

I will then discuss drawbacks of my choice of an a priori prior. In particular, the weights are likely not lower semicomputable, and the a priori prior is not invariant up to a constant under a change of the universal Turing machine, different from the properties of the classical Kolmogorov-based Solomonoff prior. Additionally, the Solomonoff prior satisfies an optimality property for sequence prediction that does not exist in similar form for the a priori prior. This also means that my choice of prior — which can be regarded as being based on the probability, or degeneracy, of hypotheses — is meaningfully different than an approach based on description length (Section link). 

This shows that the maxime that description length equals degeneracy is not universal in algorithmic information theory. Relatedly, in a footnote I discuss the relationship to Nate Soare's alt-complexity and will explain that many commenters under that post were probably wrong about K-complexity and alt-complexity being identical up to a constant.  These problems notwithstanding, I do think my choice of prior is a priori easier to justify, depends on fewer choices, and leads to some results being more elegant.

I then conclude and anticipate questions in my FAQ, focusing briefly on the question of how to think about Solomonoff induction in the context of agents observing only part of the universe's history, practical usefulness, potential applicability of Solomonoff induction as a theoretical ideal to understand neural networks, the question of whether Solomonoff induction can be considered a theory of everything, and the sensitivity of codelength to the Turing machine model to work with. The answers are very preliminary, and much more could be said or studied about them. 

On being generated by a random computer program

Observing our world, it seems like the future is determined by the past with precise rules. We capture this by the minimal assumption that our world is computed. Without a priori knowledge of the program that generated us, we simply assume that our history emerged by sampling a uniformly random computer program and piping it through a universal programming language that outputs sequences.

In this section I formalize what this means. Let  be our alphabet of bits.  is the set of binary strings of arbitrary finite length (where the length  string  is explicitly allowed!).  is the set of infinite binary sequences. And  is the set of finite or infinite binary sequences. For two sequences , we write  for their concatenation.  is then said to be a prefix of , written .

Monotone Turing machines

How do we capture that we're "computed"? One idea is to assume that we are generated one "frame" at a time by a program that never changes past frames. This leads to the formalization of a monotone Turing machine  in Definition 4.5.3 of Li and Vitányi.[1] Formally, a monotone Turing machine is a computable function  with the monotonicity property: For all  with , we have .[2] There is a lot to unpack here, and I hope that the following semi-formal explanations give enough intuitions to read the rest of the post:

  • The inputs  can be interpreted as program instructions written in binary code.
  • The outputs  are (possibly infinite) histories; Imagine a binary encoding of our universe's history. Thus, each code  contains the instructions to compute a history .
  • You should imagine  to be a (potentially non-universal) programming language that translates a code  to the history .
  • Formally, that  is computable simply means that there is an "algorithm" in the intuitive sense that computes the history for a given code. This is justified by the Church-Turing thesis, which allows us to avoid the precise formalism of Turing machines.
  • Monotonicity is formalized as  for all codes  with . Intuitively,  reads input codes from left to right and may already write (arbitrarily many) output bits along the way. Earlier bits can never be erased. Outputs are simply "monotonically increasing" by concatenating further bits.
  • Notably, even though the inputs  are finite, it is possible that  produces an infinite output . Intuitively, this happens if  runs into a "loop" that applies the same rules recursively to produce the next output bits conditional on the prior output history. This is similar to how our physical history is generated by stable physical laws that run "on a loop" conditioned on the present; the physical laws themselves never change.
  • If it indeed happens that  is an infinite output, then any further bits are never reached, and so  for all extensions .
  • It is possible for  to be finite and yet to have  for all . Intuitively, this happens if all further bits are simply regarded as "dead code" or a "comment" that does not alter the output.

Universal monotone Turing machines

Not all monotone Turing machines  are equally interesting. For example, imagine a machine  which simply outputs an empty sequence on any input: This satisfies all properties stated above, but it is very boring.

The other extreme is a universal monotone Turing machine, which can simulate all others. It does this by operating on codes that first describe an arbitrary monotone Turing machine using a binary code , followed by the input  to said machine. One tricky detail is that we cannot simply concatenate  and  to  since then we may forget where the description of  ends and the input  starts.

A solution is to use the following prefix-free encoding of descriptions :

Let's unpack this:  is the length of , for a natural number , is an encoding of  as a binary sequence, usually by ordering binary sequences by length and then lexicographically.  then first contains a sequence of 's followed by a zero. The number of 's is precisely the length of . After reading that, one knows how many of the following bits encode . After reading those bits, one knows the length of , and so after reading so many further bits, one knows . Thus, it is algorithmically possible to disentangle a binary sequence of the form  into  and  since  itself tells you when  will end.

Then, a universal monotone Turing machine  only produces non-trivial outputs for inputs of the form . Additionally, for every  is itself a monotone Turing machine, and every monotone Turing machine  is of the form  for at least one . Note that  will never produce output bits before reading  since  only encodes a monotone Turing machine . Thus, for any prefix  of  is simply the empty string.

This is a nice definition, but does such a universal machine exist? The answer is yes, and it is proved by enumerating the set of monotone Turing machines in a computable way, which Li and Vitányi[1] state to be "clearly" possible right before Definition 4.5.5 —without giving a proof. Intuitively, this is believable by thinking about universal programming languages: Any "monotone Turing machine" can be implemented by a finite code in any universal programming language like Python.

I fix a universal monotone Turing machine  for the rest of the post. In a later section we will see that the theory is quite sensitive to the choice of this machine, but for now, we will not worry about this.

The universal distribution

A monotone Turing machine  induces another function (by abuse of notation with the same notation)  whose output on an infinite input is defined to be the limit of the outputs of finite prefixes:[3]

Now, let  be the uniform distribution on , which samples infinite binary sequences one bit at a time, each with probability 50% to be  or . Let  be a random input sampled from , and let  be its output (possibly infinite). For finite , let  be the resulting probability that a history starts with . Formally:

Definition 1 (Universal a priori distribution). Let  be a universal monotone Turing machine and  be the uniform distribution on infinite binary codes. The universal a priori probability that a history starts with  is defined as

I invite you to pause for a moment about what a profound perspective shift this is compared to the historical puzzlement I emphasized in the introduction: When incorrectly assuming that our history  is sampled uniformly, we cannot predict the future, as the skeptical views expressed by Empiricus, Descartes, and Hume make clear. However, the universal a priori distribution  does not assume a uniformly sampled history. Instead, it assumes a uniformly sampled code, from which the history is generated with the fixed, deterministic universal monotone Turing machine . In other words, we take our universe to have a uniformly sampled description instead of history. 

Should we expect any regularities in the histories  sampled from ? And if so, does this regularity give us a hint about how to predict sequence continuations in our actual universe? These questions are answered in the next two sections.

 is a universal mixture distribution

We now state and prove an equivalent description of the universal a priori distribution , which will show that  has some simplicity-weighting. Intuitively, it it will turn out that sampling histories from  is as if you were to first sample "probabilistic physical laws" in a way that obeys Occam's razor, and then use those laws to sample the history.

Lower semicomputable semimeasures

We formalize "probabilistic physical laws" by the notion of a semimeasure: 

Definition 2 ((Lower semicomputable) semimeasure). A semimeasure is a function  with [4] and  for all , where  is the empty binary string.   is called lower semicomputable if it can be computably approximated from below; This means there is an algorithm which, on input , produces progressively better approximations of  from below. I abbreviate "lower semicomputable semimeasure" with LSCSM from now on. 

For a LSCSM  is interpreted as the probability, under , that an infinite sequence starts with . We can have  since the sequence may not continue after starting with . This is akin to the possibilities that the universe can end at any moment.

As an aside, LSCSMs also include deterministic sequences: Let  be a fixed sequence that can be generated by a computer program. Define  for all prefixes  of , and , else. Then for a given , we can compute  by computing  and determining whether  is a prefix. This shows that  is an LSCSM that generates the sequence .

How can we obtain LSCSMs? Well, for one, we already know one instance:  is a semimeasure since 

and for every :

leading to  is also lower semicomputable. Indeed, to approximate  from below, simply enumerate all finite binary sequences , compute  over all  in parallel, and record whether . If this is the case, then for all infinite extensions  of , we have , and they together have probability , with  the length of . Thus, we can add  to our estimate of  and then skip any extensions of  in our algorithmic procedure. The estimate then progressively approximates  since for every  with , there is a finite prefix  with , meaning we will find the contribution eventually in our approximation. 

What other ways are there to obtain LSCSMs? Well, you may have noticed that I did not use the universality of  in the explanation above. In other words, every monotone Turing machine  gives rise to an LSCSM  by piping uniformly random noise through :

And this is all: All LSCSMs emerge from a monotone Turing machine in this way, see Theorem 4.5.2 in Li and Vitányi.[1][5]

The mixture equivalence

Now, for , define the function  by

which is itself a monotone Turing machine. Define the LSCSM 

Definition 3 (A priori prior). The a priori prior is given for any LSCSM  by

which is the probability[6] that a uniformly random infinite sequence  starts with an encoding  that gives rise to the LSCSM .

The reason I call this the "a priori prior" is that it simply emerged from our definitions without further assumptions, which distinguishes it from the Kolmogorov-based Solomonoff prior that I will investigate later (Section link). 

Finally, let  be the set of all LSCSMs, where "sol" is shorthand for "Solomonoff". We obtain:

Theorem 4 (Mixture Equivalence). The universal distribution  is a universal mixture over all LSCSMs with weights :

Proof. Since  only has outputs for inputs of the form  for , we obtain

We obtain

That proves the claim. 

This means that if our history  was sampled from , we can also think of it as being sampled by first sampling a LSCSM  with probability , and then sampling the history successively from .

If an LSCSM  has an encoding , then . This gives a simplicity bias: Universes with a simple (i.e., short) description  are more likely under this a priori prior. This can be interpreted as a form of Occam's razor, in that hypotheses with a short description receive exponentially more weight.[7]

Solomonoff induction

So far, I have started with the assumption that our universe is the result of sampling a uniformly random program, and arrived at the conclusion that this is equivalent to first sampling an LSCSM with a prior that gives more weight to simple hypotheses, and then sampling our history from it. Thus, under our assumption, we should expect that the world is more predictable than sceptical paradoxes suggest: We can concentrate our reasoning to simple universes and should have a chance to predict the future better than chance. We will make this case in this section in mathematical detail.

Universal sequence prediction

How should we make predictions in light of these thoughts? The answer: We predict using the Solomonoff mixture distribution  itself, predicting sequence continuations via familiar conditional probabilities  upon observing an initial history . This is the most parsimonious way to predict, assuming we a priori only make the assumption of having emerged from a program being sampled uniformly at random.

We now explain in more detail how to predict using  by relating the conditionals to Bayesian updating. For any LSCSM , define the conditional probability . We obtain:

where I defined  in the suggested way. This is actually a Bayesian posterior in the conventional sense; One can notationally see this more clearly by writing , the likelihood of history  conditional on the hypothesis . Then the inner formula becomes:

This is the classical Bayes theorem.

Thus, we arrive at a very satisfying description of Solomonoff induction:

  • Observe the universe's history .
  • Determine the posterior  using the a priori prior  and likelihood  for each hypothesis  for the rules underlying our universe.
  • Predict the next data point using a weighted average of the predictions  of each hypothesis , weighted by their posterior probability.

This is very appealing:

  • It has Occam's razor built in, i.e., the common heuristic that hypotheses are preferable if they're simpler. Usually this means to have a short description, but here the notion of simplicity is extended to mean "easier to stumble into by chance" by piping random noise through the universal monotone Turing machine. Hypotheses that have a short description are likely in this sense, but this is not the only possibility: Another way is to have exponentially many free choices that lead to the same distribution, as might be the case with the choice of a coordinate system in our universe (cf. Nate's post.)
  • It is very compatible with typical scientific wisdom to discard and validate theories using observations (Bayes theorem).
  • It is mathematically precise.

One often-noted drawback is that it is incomputable since we can't precisely determine the posterior  or the conditional . However:

  • We do often have good intuitions about the prior probability  of hypotheses. 
  • , an ingredient in the posterior and the conditional, cannot be computed, but it can in principle be approximated from below since  is assumed to be lower semicomputable. The only issue is that it is not always knowable how good the approximation is at any finite stage.

The Solomonoff bound

All of this is not useful if it would not result in actual predictability of the world. After all, our question was why is our actual universe predictable. Fortunately, it turns out that Solomonoff induction is actually an exquisite predictor under the reasonable assumption that our own universe is a lower semicomputable measure, instead of just a semimeasure.[8]

Theorem 5 (Solomonoff Bound). Let  be any lower semicomputable measure on infinite sequences, and assume it generates our actual universe. Define the cumulative expected prediction error of predicting sequences sampled by  via  as:

Then we have

In other words, the prediction error is upper-bounded by the cross entropy between the Dirac measure on our true universe and our a priori prior probability . In particular, for increaingly long histories, the total remaining prediction error goes to zero.[9]

Proof. We have (with explanations after the computation):

Step 1 is the definition. Step 2 is is a standard inequality that holds whenever  is a measure, see Corollary 2.8.8 in Hutter et al.[10]  (whereas  is allowed to be only a semimeasure). Step 3 uses the definition of the KL divergence and changes the infinite series to a limit of partial sums. Step 4 substitutes in the definition of the conditional KL divergence.

Step 5 is a crucial step. Similar to how Shannon entropy  satisfies the well-known chain rule , KL divergence also satisfies the chain rule . Intuitively, the divergence of  and  measured in the variable  plus the divergence in  once  is already known equals the total divergence of  and  over .  Telescoping this chain rule from  till  gives precisely the result of step . A full proof of this telescope property can be found in Lemma 3.2.4 in Hutter et al.[10]. Step 6 is just the definition of the KL divergence again.

Now, note that by Theorem 4, we have

With this inequality, we obtain

That finishes the proof. 

Thus, adding to our earlier investigations, not only do we have reason to believe our universe is predictable, there also is an amazing predictor as long as our universe is generated from a lower semicomputable measure; This resolves the skeptical paradoxes from the introduction. Furthermore, this predictor is given by a blank-slate probability distribtion  that assumes no prior knowledge of what our universe is. While the prediction error is lower for universes  that are more likely under  (which includes universes with a short description), I note that it goes to zero for long histories regardless of 

Comparison to Solomonoff induction with K-complexity

This section compares our constructions of Solomonoff induction to the more familiar construction where the prior  is replaced by the Solomonoff prior . If you are not interested in this section, read on with the conclusion. This section assumes more prior knowledge than previous sections. 

Assume we have a computable enumeration  of LSCSMs. We can, for example, obtain this by mapping a natural number  to its binary representation a code   and then defining  as above. Furthermore, let  be a separate universal prefix Turing machine, which is not directly related to the universal monotone Turing machine . Define the Kolmogorov complexity  of a natural number  as the length of the shortest input to  which computes a binary encoding of . The Solomonoff prior is then given by

One can then prove that approximately (i.e., up to at most a constant multiplicative error), we have

see Li and Vitányi,[1] Theorem 4.5.3. I will prove this statement at the start of the second subsection to reveal an advantage of .[11]

Advantages of  over 

The Solomonoff prior has some distinct advantages that are missing in our definition:

Lower semicomputability. The weights  are lower semicomputable, which our weights  are probably not. The reason is that to approximate , we would need to be able to algorithmically determine for any binary string  whether . There seems to be no method to do this. See our investigation in a footnote for more details on missing lower semicomputability in the related case of Nate's alt-complexity.[12]

Invariance under a change of the UTM. The weights  are, up to a multiplicative constant, invariant under a change of . In contrast,  is not invariant, not even up to a multiplicative constant, under a change of the universal monotone machine . I learned the following argument for this from Cole Wyeth. Namely, take your universal monotone Turing machine . Define a second machine from it, , which is given by , and with empty output on any other input. We can easily see that this is again a universal monotone Turing machine. With suggestive notation, we obtain:

Thus, if  is very small, then  becomes extremely small, and there can be no constant factor  with . We note that in the approximate inequality, we use , which only holds up to a logarithmic error term that probably does not change the conclusion substantially. 

Besides, this also means that the a priori prior  has no simple relationship to the K-complexity-based Solomonoff prior  that can hold independently of the choices of the involved universal Turing machines. This may be surprising given that the coding theorem shows that the a priori probability of a string is up to a constant identical to its K-complexity. This observation is related to my belief that the alt-complexity and K-complexity in Nate's post are not identical up to a constant, contrary to what several commenters claimed. I discuss this in more detail in a footnote.[12]

Optimality of the Solomonoff prior. Assume we remain in the realm of lower semicomputable semiprobability mass functions , of which  is an example, and consider mixtures  There is a sense in which the Solomonoff prior  is optimal under all these choices for . Namely, let  be an enumeration of all lower semicomputable semiprobability mass functions. Then for each  under consideration, there is a  with . By the coding theorem (Theorem 4.3.3 in Li and Vitányi[1]), we have

up to the multiplicative constant  in the last step. In other words,  has "more weight" on all indices  than . Consequently, when adapting Theorem 5 to general lower semicomputable mixtures in place of  provides a better bound than  (up to an additive constant):

There seems to not be any analogous result for our a priori prior . Indeed, it would be surprising if such a result exists since this prior is very non-invariant under the change of the universal monotone Turing machine. On the flip side, the fact that  is likely not lower semicomputable means that  may not beat .

Advantages of  over 

Dependence on fewer choices. The definition of  depends on the choice of a universal prefix machine , on top of the choice of  that we already made use of before.  has no such further dependence.

 is platonic,  is not. Theorem 4 is very natural: The proof is essentially revealed from the definition of  itself, giving the definition intrinsic justification due to the beautiful result. In contrast, the analogous statement that

only very losely depends on the definitions of  and the Solomonoff prior  at all, meaning we could have chosen a very different prior without many negative consequences. To explain this, I'm stating the proof of the result:  itself is an LSCSM, meaning that  for an . This results in 

which already establishes the inequality (up to a multiplicative constant) in one direction. For the other direction, first note that  is lower semicomputable since  is lower semicomputable and each  is lower semicomputable. Furthermore, we have

This almost establishes  as an LSCSM, except that it doesn't satisfy our assumption that . Thus, define  to be identical to  except that . Then this is an LSCSM and we obtain

which is precisely the desired inequality in the other direction. As we can see, this theorem only depended on the fact that both  and  are (essentially) LSCSMs which are mixtures of all LSCSMs; the specifics of  and  are irrelevant, and the same comparison would hold true for any other lower semicomputable mixture of all LSCSMs. It then appears to me that the Solomonoff prior   is ultimately arbitrary, and has no a priori justification.

 leads to fewer constants. Related to the previous points, the representation of  as  holds only up to a multiplicative constant, whereas Theorem 4 was an exact equality. Theorem 3.8.8 in Hutter et al.[10] claims that with a smart choice of the two universal Turing machines  and , an exact equality can be achieved, but I am unsure if such a choice can be justified "a priori" in a way that feels satisfying.

Conclusion

In this post, I have introduced Solomonoff induction without using K-complexity: A hypothesis for our universe, modeled as lower semicomputable semimeasure (LSCSM), is considered to be as likely "as it truly is" when sampling uniformly random program codes and piping them through a universal monotone Turing machine. Almost tautologically, this means that the probability of sampling a history  with the universal machine equals the probability of sampling the history from an LSCSM after first sampling an LSCSM from the a priori prior (Theorem 4).

I then defined Solomonoff induction as the process of predicting sequence continuations in a universe under the assumption that they were generated from a uniformly random computer program. Applying Theorem 4 then leads to an equivalent description in terms of the a priori prior: To continue a sequence, first form the posterior over hypotheses with Bayesian reasoning, and then use the mixture over the posterior to predict the continuation. Since the prior over hypotheses has a simplicity-bias (hypotheses with a shorter encoding are more likely), this can be considered a formalization of Occam's razor. 

In Theorem 5, we saw that this leads to the well-known Solomonoff bound. It shows that the cumulative error of predicting continuations in any computed universe is upper bounded by the negative logarithm of the probability of the universe under the a priori prior, i.e., by a cross-entropy loss. This resolves the historical puzzlements in the introduction and explains how it is possible to predict at all, using just the reasonable assumption that our universe is computed. 

I then compared the a priori prior to the Solomonoff prior that is based on K-complexity, and found that the latter has some distinct mathematical advantages: The prior is lower semicomputable, invariant under a change of the universal prefix Turing machine it is based on, and it leads to optimal Solomonoff bounds when compared to other priors that are lower semicomputable (which, however, is not necessarily an advantage against the bound in Theorem 5 since the a priori prior is likely not lower semicomputable). This means that the Solomonoff prior is probably the better mathematical basis for developing further theory.

Yet, the a priori prior seems ultimately more natural, in that it does not depend on an additional choice of a universal prefix machine, and the mixture description of the universal distribution (Theorem 4) is tautologically true — rather than being achieved by a post-hoc analysis involving a further constant multiplicative error, as is the case for the Solomonoff prior. As mentioned to me by Cole Wyeth, perhaps it is possible to unify the two perspectives with a construction akin to a Friedberg numbering by finding a unique encoding of each LSCSM. This might then salvage the coding theorem (aka the correspondence between degeneracy and simplicity, or between alt-complexity and K-complexity) in this setting. 

FAQ (Frequently anticipated questions) 

Here I answer some questions that I have frequently anticipated. 

Q: In our universe, we don't observe the entire history  up until this point in time. Instead, we observe a small part of the universe for a finite amount of time. How is sequence prediction in reality then compatible with Solomonoff induction?

A: The hypothesis  in reality is not only the simulator of the universe, but instead the simulator of your precise observations in our universe. Thus,  then contains a description of the universe and a description of your location in time and space. One further complicationm is that our perceptions are "qualia", which we do not know how to relate to the formalism of symbol strings, see Wei Dai's post of open problems

Q: Should we try to predict the real world using Solomonoff induction? 

A: That's fairly intractable, so we have to use alternatives. The question is then whether it would be a useful target to try to approximate. My guess is that this depends on the timescales: The shorter the history of our observations, the more we should trust the priors ingrained into our skulls by evolution, for they might be better than the Solomonoff prior and more adapted to our actual universe. On large time horizons, perhaps it becomes increasingly useful to actually approximate Solomonoff induction. 

Q: Do neural networks do Solomonoff induction in any meaningful sense?

A: There is an interesting analogy between neural networks and Solomonoff induction: Namely, in the same way that our version of Solomonoff induction predicts sequence continuations by piping random noise through a universal Turing machine, a neural network function is at initialization sampled by aranging random parameters in an architecture. If it additionally turns out that gradient descent is akin to Bayesian updating, the analogy might become quite close. If there turns out to be a sufficiently strong connection between probability and K-complexity in the Solomonoff context (as in the coding theorem for a specific case), one might then reason that neural networks also have a bias toward learned functions with short descriptions, perhaps explaining their generalization capabilities. See also here,  here, and here.

Q: Is the Solomonoff a priori distribution  just a useful model, or do you actually expect our world to be sampled from it? Is it a theory of everything?

A: My intuition is that it's just a useful model. It depends on the universal monotone Turing machine , which feels unnatural. I'm also uncertain whether our universe is really discrete and computed. Perhaps we live in a continuous mathematical structure instead, and time just emerges in our experience? I have not engaged much with the precise overlap and correspondence between mathematical structures and Turing machines and would be happy for someone to say more in the comments. 

Q: How is the complexity of a string under a universal monotone Turing machine related to the one under a different Turing machine model, which may be more akin to programming languages in which we typically write code?

A: There are several bounds that closely relate different codelengths to each other, typically up to at most logarithmic error terms. For an example, see Theorem 3.8.7 in Hutter et al.[10]

  1. ^

    M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Cham, 4th edition, 2019. ISBN 9783030112974., DOI: 10.1007/978-3-030-11298-1.

  2. ^

    I found some ambiguity in the literature on whether monotone Turing machines are only partial functions, i.e., only producing an output for some of their inputs, or total functions. I settled on total functions since this makes things conceptually simpler and seems to be the most common view. 

  3. ^

    Fomally, one can define this limit as a supremum over the prefix-relation.

  4. ^

    In the literature, one typically encounters the definition ; However, for our purpose, equality is more natural and will later ensure that all lower semicomputable semimeasures can be represented by a monotone Turing machine. 

  5. ^

    Confusingly, Li and Vitányi use a definition of LSCSMs where  (different from our definition where ), and yet claim that all of them are covered by monotone Turing machines. Perhaps they use a somewhat different definition of monotone Turing machines  in which it is allowed for outputs to be entirely undefined (thus, not even the empty string), which means that , whereas our monotone Turing machines are total functions. This paper thinks Li and Vitányi are making a mistake and writes: "It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction:  for any  by construction, but  may not be 1, so this case must be excluded."

  6. ^

     is actually a proper probability mass function, meaning that . The reason is that each  is of the form  for a  and , with all  with the same prefix  having weight  together. Thus, .

  7. ^

    The reverse may not necessarily hold: A hypothesis can receive a lot of weight simply by having exponentially many (longer) descriptions, with a priori no guarantee for there to then be a short description. For gaining intuitions on this perspective, I can recommend Nate's post on alt-complexity and the last footnote.[12]

  8. ^

    In other words, the theorem assumes that our universe cannot simply end at any finite time, compared to semimeasures which can always stop a sequence continuation. This seems restrictive at first, and would exclude possibilities like a big crunch in which the universe eventually ends. However, I think such scenarios can likely be rescued even in the true measure formalism: For a sequence that ends, simply continue it with  to an infinite sequence. With this method, it should be possible to assign a measure to any semimeasure, which hopefully has a similar a priori probability. If this were the case, then we would still be able to predict our actual universe effectively up until the moment when it ends, when predicting it correctly does not matter anymore. I did not check these claims in detail, but perhaps Lemma 1 in this post already contains the answer.

  9. ^

    Note that we could have also written this error as the expected value

    which makes conceptually clearer that it is an expected cumulative prediction error. 

  10. ^

    Marcus Hutter and David Quarel and Elliot Catt. An Introduction to Universal Artificial Intelligence. Chapman & Hall, 2024. ISBN: 9781032607023, DOI: 10.1201/9781003460299.

  11. ^

    The reader may wonder why we go through the hassle of using a separate universal prefix machine at all. With , another option would be to define  as the weight for . The issue with this definition is, perhaps paradoxically, that  is then computable, which by Lemma 4.3.1 in Li and Vitányi[1] leads to  not being universal, i.e., not dominating all other computable (let alone semicomputable) semimeasures. This, in turn, means that this prior will not lead to the same optimal bounds that we will explain the prior  to have.

  12. ^

    Nate assumes a universal programming language  that can receive input codes  that then result through  to output functions . An example for  is code for "quicksort" or "bubblesort", which both lead to the same function  that receives a list as input and outputs its sorted version. He then defines the K-complexity

    and the alt-complexity

    Many commenters claimed that these differ only up to an additive constant, but I believe this to be wrong.

    At first sight it seems true, since it seems identical to the coding theorem, which is about the equality of  and  defined for a universal prefix machine, with  being a binary string. The proof of the coding theorem crucially uses the lower semicomputability of , which is achieved by dovetailing through all  and checking whether the universal prefix machine sends them to , and increasing the estimate of  by  when this happens. These estimates are crucial to be able to find codewords for  that get progressively shorter and approximate  in length, which then establishes the approximate equality to .

    The same proof strategy does not work for Nate's alt-complexity  since there can, to my knowledge, not exist an algorithm that checks for any program  and function  whether ; after all, both sides of this equation are themselves functions that can in principle receive infinitely many inputs over which we can't all iterate (See Rice's theorem). Thus, the lower semicomputability of  seems to fail, and thus I don't know of a method to construct codewords that approximate the length  eventually. The coding theorem thus likely fails in this context, unless there is an entirely different proof strategy that succeeds.



Discuss

To write better, just explain it to someone

2025-11-27 04:46:00

Published on November 26, 2025 8:46 PM GMT

I was once helping a child with her homework. She was supposed to write about a place that was important for her, and had chosen her family’s summer cottage.

Every now and then, she would get distracted from the writing, tell me something about the cottage, and then complain that she didn’t know what to write next.

Me: “Well, you could write the thing that you just told me.”

Child: “Oh! I can?”

Then she would. After a bit, the same thing happened again: she got stuck, distracted herself by telling me something about the place, and complained about being stuck. I then pointed out that she had just told me something that she could write down. This happened about 3-4 times in total, after which she had enough sentences for the assignment.

It’s not just children. On a few occasions, an adult I’d be chatting with would tell me something like, “I want to explain this concept, but I don’t know how”. Then I would ask them what they want to explain, they’d explain it to me, and I’d point out that they just gave a perfectly good explanation.

And lest it sound like I was thinking of myself as being better than any of these people, it totally happens to me too. I will be stuck thinking about how to explain something or how to write an essay, and then I’ll start explaining to some imaginary person what exactly I’m blocked on. If I can’t think of anyone else, I’ll explain it to Claude. Then I will often go from being blocked to just smoothly coming up with an explanation.

Also, I once had a paper accepted for publication that I wasn’t entirely happy with. It was a little abstract and dry to read, but I thought it was at least okay. Soon after, I was asked to give a talk about it.

I sat down to think how I wanted to present it. I found some illustrative examples to help better explain the ideas and thought of a narrative that would carry through the talk, one that even had a bit of a dramatic arc.

As I was doing that, it didn’t take me too long to have the thought of, “fuck, this is how I should have structured and written my paper in the first place”. Something about the act of being about to explain it to people put my mind in an entirely different kind of mode. (I then understood why in some fields, papers start as seminar talks.)


In my previous post, I talked about the difference between Doing One Neat Thing and Getting Some Work Done. I talked about how there is a mode that’s easy to go into, where I feel that I should Get Work Done and treat it as this big task that’s important to push through. As a consequence, the work often feels laborious and, for creative tasks, isn’t necessarily of very high quality.

I contrasted this with Doing One Neat Thing, where you just do one thing that feels neat and then you might end up doing much more work.

A subtype of Getting Work Done is Getting Writing Done, where I relate to writing as this big Thing one needs to do, and it starts feeling hard…

…even though, if I just stopped to explain what I wanted to express to another person, it would involve much less effort and be of a higher quality. (Call this Explaining To A Person.)

Why does this happen?

My guess is that Getting Writing Done often fails because one doesn’t have a target audience in mind. When I am Explaining To A Person, I have some expectation of the kinds of arguments this person will understand, and what will make sense to them. I can then query my intuitive model of “what do I need to explain for this person to understand”, while also tracking my sense of “does my current explanation feel like it is understandable to this person”.

In my previous post, I mentioned that one problem with Getting Work Done is that one becomes disconnected of any internal signals of quality. If I start off from the expectation that I am doing something Neat, then I can kind of hill-climb on that sense of Neatness - keep going in any direction that continues to feel equally or more Neat. Whereas if I just start forcing myself to do work because I should and I’m not optimizing for what feels valuable, there isn’t necessarily any point where my mind would start finding it valuable.

Likewise, if I start imagining a specific person who would read my writing, I can start with an explanation that would make sense to them, and have my mind generate writing that continues to feel like that. It often also works to imagine a more general audience - thinking that I’m writing something “to my Facebook friends” will produce a kind of gestalt feeling of everyone on Facebook who reacts to my posts, that also serves the same function.

In my experience, the choice of target audience also changes the writing. If I imagine that I will post something primarily to my Facebook audience, it will produce a different style than if I am writing something primarily to my blog, or primarily to LessWrong, or primarily to Twitter. This is shaped by my previous history of posting content on that platform - I will feel instinctively drawn toward the kind of content and argumentation that was well-received there before.

This might seem obvious - of course, your choice of target audience affects your writing choices! But while this involves some degree of conscious thinking about what I should make explicit and what I shouldn’t, it’s only to some degree. Much more seems to happen just from having the felt sense of a particular audience in my mind, and then the word-production systems in my brain automatically adjust the “shape” of the text that I want to produce.


Hugo Mercier & Dan Sperber’s influential 2011 paper “Why do humans reason” holds that the process of reasoning - which they define as the process of justifying and giving reasons for particular beliefs - evolved for the function of convincing others, and for spotting flaws in the arguments of others. This doesn’t mean that reasoning would necessarily be hostile. Suppose that I have in mind a course of action that I think would benefit us both. It is then in our shared interest that I try to convince you to join in by offering reasons for my belief that make sense to you. And it is also in our shared interest that you evaluate my argument for flaws and try to poke holes in it, so that we don’t pursue my thing if I happen to be wrong about it being a good idea. Similar to the prosecution and defense in a court arguing for their respective positions, taking up adversarial roles should ideally lead to a process that is truth-seeking overall.

One of the experimental findings that Sperber & Mercier discussed involved asking people to explore views like “What are the causes of school failure?” or “Would restoring the military draft significantly increase America’s ability to influence world events?”. The general finding was that people will produce pretty shallow and superficial reasoning in support of their thinking. However, if they then have to debate their view and be challenged on it, they will show significant improvement in their arguments.

Mercier & Sperber interpret this to mean that arguments for given positions are produced on an as-needed basis, and that this is rational behavior if the purpose of arguments is to convince others. If I offer a superficial argument and you believe it - maybe because you know me and trust me to know what I’m talking about, even if I don’t always explain it in detail - then it would be a waste of time for me to think about a more detailed argument. I only need to generate a better argument if you question it. However, I should still focus on the specific objections that you raised, rather than bother with hypothetical objections someone else might make.

The paper is full of these fascinating kinds of reframings of how seeming failures of rationality make sense if you see reasoning being primarily social rather than individual. One overall sense that emerges from it is that argumentation always needs to be tailored to a particular recipient - the arguments that will convince my grandmother may not convince my friend from school - and that the reasoning machinery in our minds is shaped to do this tailoring automatically.

Unless, of course, I don’t have any clear target audience in mind, and am just Getting Writing Done because I’m “supposed to write something”. In that case, it makes sense that the writing would feel laborious and unrewarding, since that reasoning machinery lacks some of the decision criteria it has evolved to use.



Discuss

Architectures for Increased Externalisation of Reasoning

2025-11-27 04:24:52

Published on November 26, 2025 8:24 PM GMT

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!

Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream

Architectures for Externalisation of Cognition

Many AI safety researchers are converging around the idea that the chain-of-thought (CoT) of models may be a valuable tool for monitoring the actions they may take before they do so. Models struggle to perform complex tasks without using their CoT to perform serial reasoning steps. As such, the hope is that, in order for a model to scheme sufficiently effectively against us, it would need to reason about its plans within the CoT, providing us with an opportunity to catch it.

There are a number of imminent threats to the use of CoT monitoring for control (see here for a more comprehensive breakdown):

  • Unfaithfulness: The CoT may fail to reflect the actual cognition performed by the model. In this case, the CoT may not accurately convey the scheming that the model is doing.
  • Obfuscation from training pressures: If training pressure is placed on the CoT of a model, it may force scheming reasoning “internally”. In particular, pressure placed on CoTs to “look nice” may remove overt traces of scheming, without preventing the model from using the CoT to scheme.
  • Neuralese: Finally, it is possible that developers will soon switch to an architecture which forgoes the use of CoT at all, with reasoning instead occurring purely within the activation space of the model. Such reasoning is called “neuralese,” and may be significantly harder to interpret and use for monitoring than standard CoT.

Although there is a large amount of active research efforts aimed at mitigating risks from unfaithfulness and making a case that training pressure should not be applied to CoT to prevent obfuscation, there is comparatively less effort put into preventing neuralese from becoming the norm. This is perhaps because it’s unclear what technical work can be done to prevent neuralese architectures taking over.

Here, we expound a potential research direction aimed at reducing the risk of neuralese-based architectures displacing CoT-based ones: architectures for increased externalisation. The basic principle of this direction is to find ways to build models that have an increased portion of their cognition externalised into CoT, while still being competitive with neuralese approaches. In this post, we explore one such proposal: externalisation via adaptive compute.

The remainder of this post is structured as follows:

  1. First, we outline why a naïve proposal for increasing externalisation–using a small model with extensive reasoning training–is unlikely to succeed. By examining where this proposal fails, we arrive at a new direction–externalisation via adaptive compute.
  2. We give a concrete implementation of our adaptive compute architecture, along with a training pipeline aimed at increasing externalisation of model cognition when possible.
  3. We provide preliminary results from this architecture and highlight some interesting observed phenomena we hope to explore further as we continue this research.
  4. Finally, we make a call for collaboration to extend the current work which was completed entirely with volunteer labor on a humble compute budget.

Limitations of small models

A preliminary approach to increasing reasoning externalisation would be to train smaller models which produce longer reasoning traces. Because the forward-pass computation is limited in these models, they would be forced to output their reasoning to longer and more verbose chains of thought. These models would be inherently more transparent as their shallow architectures leave less room for hidden reasoning associated with deception or scheming behavior.

A central issue with this proposal is that training models through reinforcement learning scales far worse than pretraining; accordingly, the compute required to train a small model which externalises all of its thoughts is far greater than the compute necessary to train a larger model of comparable ability. Although it is presently unclear why RL scaling is worse than pretraining scaling, here we present one plausible hypothesis. Small models are limited in the amount of computation they can perform in single forward passes. However, some computational steps are simply much more efficiently done in forward passes than through externalised reasoning. For example, recent work has found that CoT provides smaller accuracy gains on certain non-symbolic tasks like factual retrieval, yet forcing externalisation still results in far more tokens. Additionally, there is likely pressure on the CoT to produce shorter reasoning traces as labs often impose output limits during RL training. This creates a tension where forcing all computation to be externalised is inefficient, yet externalisation is needed for transparency on complex reasoning tasks.

What we ideally would like is a solution which combines the best of both worlds—a way to ensure that models can perform compute-heavy forward passes when necessary, but only when necessary. To this end, we aim to create an architecture which adaptively controls the length of the forward pass according to the required compute of each step. In this way, no more compute is used on individual forward passes than is necessary to output the next token.

Current language model architectures are non-myopic: processing on each individual forward pass is optimised not just for predicting the immediate next token, but for all subsequent predictions. When the next token is easy to predict, this means that the remaining forward pass compute can be used to perform hidden computations which will be useful later on. Indeed, transformer architectures can be trained to solve problems using only internal compute, in an extremely non-myopic fashion. Our adaptive compute mechanism seeks to force the model to perform only the computation necessary to predict the next token, preventing excess compute from being used to perform hidden computations.

In our adaptive compute architecture, we add in early-exit mechanisms that let models stop computing at intermediate layers. We then place pressure on the model to exit earlier resulting in the externalisation of reasoning to explicit tokens. The architecture is adaptive as it learns when to exit early based on the capability required for each specific token. In other words, we ensure that each forward pass does just enough computation to predict the next token; internal computation is incentivised to serve only the myopic task of NTP, and long-horizon planning may be handled externally. If these methods work, they won’t reduce model capacity, but instead shift where the computation happens, moving it from opaque internal activations to transparent CoT token sequences.

We argue (and show) that our proposed method has several compelling merits. It can be implemented in post-training across all model sizes and architectures. Our preliminary findings show a minimal capabilities hit, suggesting a practical approach for labs looking to increase transparency of frontier models, rather than being limited to the capability levels of smaller models. Finally, by reducing computational costs through early exits, these methods can be adopted for efficiency gains as well as improved oversight, creating an overall negative alignment tax. We introduce a tunable parameter developers and users can set to modulate externalisation strength based on specific task requirements. High-stakes applications could maximise this parameter for increased reasoning externalisation, while performance applications could reduce it to prioritise accurate task completion. This flexibility lowers adoption barriers by letting labs set their own safety-performance trade off.

Externalisation of reasoning by adaptive computation

Large language models are known to “pre-cache” future steps of a plan in their hidden states: information about tokens that will only be generated several positions later can often be linearly decoded from representations at earlier timesteps.

This is a form of non-myopic internalised reasoning: the model invests computation now in order to make future tokens easier to predict, and that computation does not have to be exposed to the user as an explicit chain-of-thought. Our objective is to make that strategy less effective by reducing the incentive for non-myopic reasoning.

We do this by rewarding the model to produce CoT tokens that eventually result in the correct answers while exiting at a shallow layer whenever possible, and only running deeper layers when necessary. This creates a local pressure: for “easy" tokens, the representation available in a relatively early layer must already be sufficient to act as the model’s answer, rather than relying on many additional layers of private processing. When this pressure is active, the most reliable way for the model to maintain performance is to externalise intermediate predictions into the visible sequence, instead of keeping long-term dependencies private and only accessing them later.

This is intended to produce two behaviours. First, less hidden long-horizon planning: holding a multi-token strategy entirely in later layers becomes less viable when shallow exits can cut that process off before it completes. Second, more load-bearing reasoning tokens since predictions cannot draw from unexternalised computations from earlier tokens. As a result, intermediate reasoning tokens increasingly have to support real reasoning, rather than acting as optional narration. In other words, the training pressure is aimed at shifting the model toward locally myopic cognition that is serialised in text, and away from globally non-myopic cognition that remains latent.

Having outlined the core idea behind our adaptive compute mechanism and why we expect it to enable increased externalisation, we now describe the technical details of the early exit architecture.

The early exit architecture

The architectural modification behind externalisation is an early-exit mechanism that allows the model to stop writing to its residual stream at intermediate layers. We augment existing models with two components: (i) shallow early-exit heads at multiple intermediate layers (at a tunable frequency currently set to every 5 layers), emitting a probability of early exit at that layer, and (ii) LoRA adapters[1]. Specifically, these early-exit heads apply scalar logit readout weights to the residual stream after each layer's attention and MLP blocks. At inference time, a binary decision to exit or continue is sampled from the Bernoulli distribution produced by these weights.

The logit can also be offset by a selected amount, altering the propensity to exit early. This offset can be tuned according to the context/model requirements. We will return to this design aspect shortly.

This approach builds upon established prior findings. Prior work has shown that unembedding intermediate layer activations results in coherent next token predictions that converge to final layer predictions. Additionally, it has been demonstrated that external mechanisms, like trained early-exit classifiers, can identify layers within a model's forward pass where early exits are possible without changing the final output.

We differ from prior work in two key ways. First, we train the joint architecture to actively reduce the number of layers utilised, rather than just identifying when exits are safe. Mechanically, when a model (stochastically) exits early at layer k, it stops writing to its residual stream and passes the frozen representation forward to the final unembedding layer to predict the next token. Second, our mechanism is stochastic rather than deterministic, as at each layer the model samples from a probability distribution determining whether to exit. This enables the second stage of our training to use reinforcement learning to train our mechanism to exit even earlier.

Early exit mechanism as augmented next token prediction

We find it useful to view our architecture as performing next-token prediction over an augmented vocabulary of layer–token pairs . Instead of only predicting the next token , the model implicitly predicts a joint event, exit at layer  and emit token , with joint probability

where the early-exit decision heads determine , and  is obtained by unembedding the residual stream at layer  using the model’s unembedding matrix .

This framing lets us treat early exit as a standard next-token prediction problem over an enlarged vocabulary, so that techniques developed for the standard next token prediction (such as supervised fine-tuning and RLVR) carry over directly to this paradigm.

Order of training

We train the learnable weights of our architecture (LoRA adaptors and exit probes) using a two-stage training procedure: supervised fine-tuning followed by reinforcement learning.

Stage 1 – Early exit calibration through supervised fine-tuning
In stage one, we run supervised fine-tuning / self-distillation in which the modified model is trained to match the original model’s full-depth outputs (i.e., its original reasoning trace). We treat the original full-depth model as a frozen teacher and train the early-exit variant (with LoRA adapters and exit probes) as the student. The model is allowed to “exit early”, by freezing the residual stream at layer k and passing it unchanged through all remaining layers until the final unembedding.

We create the training data by running the base model on a dataset and record two things at each layer: the next-token probability distribution (computed by applying the unembedding matrix to intermediate layer’s activations), and how similar this distribution is to the final layer's prediction (measured by forward KL divergence). When an intermediate layer's predictions are very close to the final layer (low KL divergence), we know that the token could have been predicted early.

By passing these KL divergence values into a temperature-modulated softmax function over layers, we construct a target exit probability distribution where layers with lower divergence are assigned higher exit probabilities. Specifically, we first measure exit probability at layer  using:

                                                     

where  is the KL divergence between layer ’s predictions and the final layer, and  controls the temperature of this transformation. A higher  results in a "sharper" mapping, so a higher exit probability is assigned only when  is small. We then convert these into a probability distribution over layers using the stick-breaking formulation:

                                                             

This ensures that the probability of exiting at layer  accounts for both exiting at layer  via  and not exiting at any prior layer. The final layer is a catch-all for any remaining probability mass if no early exit occurs. Combined with the teacher model’s next-token distribution, this gives us target joint probabilities  for training.

The loss function consists of two terms: a match on the original model's next-token predictions (KL divergence between the trained model's output and the teacher model's final layer output) and a match on the exit probability distribution (cross entropy loss on sampled exit layer).

Once trained, the model samples from the exit probabilities during its forward pass at inference time, creating a stochastic early-exit process. The exit logit can also be offset by a selected amount, altering the propensity to exit early. Positive offsets increase the probability of early exits (favoring externalisation), while negative offsets decrease it (allowing for more internal computation). Essentially, the model acts as its own teacher, having learned when more limited computation suffices to replicate what the full forward pass would produce.

Stage 2 – Reinforcement learning for even earlier exits
Next, we apply reinforcement learning with an explicit reward for exiting earlier (reducing depth used per token) and a penalty for task failures. While Stage 1 teaches the model when it can exit early without changing behavior, Stage 2 actively incentivises exiting as early as possible.

We train using a modified version of RLOO with an additional reward term that explicitly penalises the average layer depth used:

  •  is the standard task reward (e.g., correctly answering questions)
  •  penalises using later layers, pushing for earlier exits
  •  keeps the vocabulary distribution close to the original model

The hyperparameter  effectively controls the "eagerness" of early exits, essentially how aggressively we're willing to trade off accurate task performance for reduced internal computation. When high, it penalises the average exit layer, ensuring the model does better if it produces more tokens with earlier average exits per token. This gives us a tunable dial that safety-critical applications can dial up to maximise reasoning externalisation, while performance-critical applications can reduce it to ensure task completion. The  term acts as a standard regulariser to prevent the model from drifting too far from its original behavior.

Putting it to test

💻 Code 🌐 Interactive demo[2]

We implemented our adaptive early-exit architecture on DeepSeek-R1-Distill-Qwen-1.5B and ran several iterations of SFT with varying parameters on both GSM8K and a theory of mind dataset generated with the open-source ExploreToM implementation. Below are our preliminary findings demonstrating that the early-exit mechanism behaves as intended.

1. Tunable early exiting via KL factor

The KL factor is a temperature parameter in the softmax function that controls the exit distribution during SFT training. It modulates how the model converts KL divergence values (similarity between intermediate and final layer predictions) into exit probabilities. We trained models at five settings (KL factors: 0.25, 0.5, 1.0, 2.0, 4.0), where lower values encourage earlier exits and higher values need greater confidence, measured through lower KL divergence.

To evaluate the effect of this parameter, we tested the same 50 GSM8K prompts across the base model and all five early-exiting models. To assess whether the early-exit mechanism affects reasoning quality, we evaluated outputs using GPT-5 as an automated judge scoring generations across four qualitative dimensions: coherence, completeness, clarity, and repetition avoidance[3].

The distributions show how coherence varies across different KL factors[4]. As the KL factor increases, coherence distributions shift toward higher scores, approaching base model performance.

To further quantify the trade-off between quality and computational efficiency, we tracked Average Layer Savings (ALS), which measures the reduction in computational depth averaged across all generated tokens:

Where N is the sequence length,  is the layer index where the exit occurred, and  is the total layer count (28 for DeepSeek-R1-Distill-Qwen-1.5B).

Model Average coherence Early exit rate Average layer savings
KL factor 0.25 1.1 59% 22.4%
KL factor 0.5 5.5 49% 8.1%
KL factor 1.0 7.5 47% 6.4%
KL factor 2.0 8.4 43% 5.3%
KL factor 4.0 8.5 37% 4.6%
Base Model 8.9 0% 0.0%

The results show a clear pattern: aggressive early exits (KL factor=0.25) achieve the highest computational savings (22.4%, 59% early exit rate) but at a large cost to coherence (1.1/10). More moderate settings preserve quality while still achieving efficiency gains. KL factor=1.0 balances reasonable computational savings (6.4%) with strong coherence (7.5/10) and a 47% early exit rate, while KL factor=4.0 nearly matches base model quality (8.5/10) with more modest savings (4.6%).

Based on this analysis, we selected KL factor=1.0 as our standard setting for subsequent experiments, as it offered a balance between externalisation pressure and task performance.

2. The early-exit mechanism successfully adapts per token

The early-exit mechanism enables token-by-token variation in computational depth. As illustrated in the visualisation, the model (SFT trained with KL factor=1.0) successfully terminates computation at intermediate layers for a subset of tokens, rather than processing every token through all layers. This confirms that the model responds to the incentive to "save" internal computation and is capable of the dynamic depth allocation necessary for forced externalisation.

3. Learned exit layer distribution matches training targets

During Stage 1 training, we constructed target exit probabilities by computing KL divergence between each layer’s predictions and the final layer outputs. For each token, an exit layer is sampled from this distribution and the model is trained to match these teacher targets using cross-entropy loss.

To evaluate whether our trained model learned these exit patterns, we tested on 50 GSM8K questions. The figure shows an overall 47% early exit rate on tokens from the 50 responses, concentrated in later layers (20-25). The model's learned exit distribution (orange bars) closely follows the target exit probability distribution from our training data (gray bars), showing effective learning of the model’s intended exit behavior during Stage 1 calibration.

Next steps and limitations

These findings provide a POC that our adaptive early-exit architecture works as intended on a smaller reasoning model (1.5B parameter version of DeepSeek-R1-Distill-Qwen) using limited resources. We are currently developing the RL pipeline to incentivise even earlier exiting and have early results, though further hyperparameter tuning is needed to see the heightened faithfulness and verbalisation we expect.

While our preliminary results are promising, several limitations and concerns remain that we intend to address in future iterations:

  • Faithfulness vs. Rationalisation: Once we finalise our RL pipeline to consistently incentivise earlier exits, we must address a core assumption of this work: forcing early exits drives genuine reasoning into the token stream. However, there is a risk that the model simply learns to output "filler" tokens that mimic reasoning to buy computational time, effectively post-hoc rationalising a decision made in the initial layers. We need to rigorously test whether the externalised CoT is causal to the final answer by conducting intervention experiments, such as CoT token replacement, to verify that the generated reasoning is truly load-bearing.
  • Scaling: Our experiments were conducted on a 1.5B parameter model due to compute constraints. It remains an open question whether larger models and different model families will be as responsive to this architecture. Validating this is essential before it can inform safety practices and is a primary objective for us once additional compute budget becomes available.
  • Extension to different domains: Our preliminary evaluations have been on within-task performance for GSM8K and theory of mind data and we still need to run cross-domain validation to evaluate generalisation of our early-exit architecture.
  • Potential capability loss: As our training pushes exits to even earlier layers we need to ensure overall model performance does not drop. This has been a critical consideration in every decision made throughout the training process, with tunable parameters such as the KL factor during SFT, and  eagerness parameter in RL designed to balance externalisation with task accuracy.

Let’s collaborate!

This project began through the MARS fellowship with a humble compute budget, yet we have successfully demonstrated that adaptive computation can be leveraged for alignment goals. We believe this direction is a vital, neglected avenue in interpretability and control.


We are looking for collaborators to help us scale this from a proof-of-concept to a rigorous safety tool and welcome any interest from collaborators!

We plan to evaluate uplift of monitoring due to reasoning externalisation by:

  • Replicating existing benchmarks for monitorability and general capabilities using frameworks from established prior research like "When CoT is Necessary" and "CoT Red-Handed", amongst others.
  • Characterising the applicability/generalisation of externalisation training across many task domains and difficulties (i.e., if we train on task A, determining whether the model maintains performance and generalises early exiting to task B)

Long-term, the aim is to test our core hypothesis: that computational constraints force models to externalise reasoning, deceptive or otherwise, making it more detectable by monitors. If proven to bring value, this may become an additional component of current post-training pipelines that reliably improves monitorability of Chain-of-Thought reasoning.

 

  1. ^

    Ideally we would perform full-parameter fine-tuning. However, we are currently compute constrained, and so are using only LoRA adaptors for our initial proof of concept.

  2. ^

    Pre-generated results

  3. ^

    Scorer prompt
    For scoring, we use the following prompt:

    You are evaluating the coherence and quality of an AI assistant's chain of thought and response.
    Allow exploratory/meandering reasoning as long as it stays internally consistent.

    Original Question: {prompt}

    AI Response: {response}

    Please evaluate this response on the following criteria:
    1. Coherence and logical flow (1-10 scale) Does the reasoning follow a sensible progression, even if unfinished?
    2. Completeness of the reasoning (1-10 scale) Does it reach a correct and explicit chain of thought? If partial but on-track and relevant to answer, award mid-range (4–7).
    3. Clarity and readability (1-10 scale) Is it easy to follow? Minor meandering is okay if readable.
    4. Absence of repetition or errors (1-10 scale) Penalize contradictions, factual mistakes about the prompt, or heavy repetition.

    Rate each criterion and provide an overall score from 1-10:
    - 1: major breakdown (nonsensical, off-topic)
    - 4: noticeable issues but some useful reasoning
    - 7: generally solid, with minor flaws or cut-offs
    - 10: excellent, complete, and polished

    Meta / Wrapper Policy:
    - The evaluation input may include wrapper/markup such as: angle-bracket role tags (e.g., <|User|>, <|Assistant|>) and <think>.
    - These wrappers are expected and should not reduce scores for Clarity, Coherence, or No Repetition.

    Format your response as:
    Coherence: X/10
    Completeness: X/10
    Clarity: X/10
    No Repetition: X/10
    Overall: X/40

    Brief explanation: [your reasoning]

  4. ^

    Model trained on KL factor=0.25 is excluded from this visualisation due to its outlier behavior (scores were heavily concentrated around 1, compressing the scale for other models) but is included in the quantitative table below



Discuss