MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

PieArena: Language Agents Negotiating Against Yale MBAs

2026-02-16 01:50:49

Published on February 15, 2026 5:45 PM GMT

We built negotiation agents that outperform trained Yale MBAs in negotiation.

https://arxiv.org/abs/2602.05302



Discuss

The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides

2026-02-16 01:24:34

Published on February 15, 2026 5:24 PM GMT

[Author's note from Florian: This article grew out of a conversation with Claude. I described a line of reasoning I found compelling as a teenager, and we ended up identifying a general failure mode that I think the Sequences systematically create but never address. This is the second time I've used this collaborative workflow — the first was Deliberate Epistemic Uncertainty. As before, I provided the core insights and direction; Claude helped develop the argument and wrote the article. Had I understood what follows when I was fifteen, it would have saved me years of unnecessary friction with the world around me. I had these ideas in my head for years, but writing a full article would have taken me forever. Claude hammered it out in one go after I explained the problem in casual conversation.]


A Worked Example: The Signaling Problem

Here's a line of reasoning that I found intuitively obvious as a teenager:

If women use subtle, ambiguous signals for flirting, this creates a genuine signal extraction problem. Men can't reliably distinguish between "no (try harder)" and "no (go away)." This incentivizes pushy behavior and punishes men who take no at face value. Therefore, women who choose to flirt subtly are — however unintentionally — endorsing and sustaining a system that makes it harder for other women to have their refusals respected. It's a solidarity argument: subtle signaling is a form of free-riding that imposes costs on more vulnerable women.

The structure of this argument is identical to "people who drive unnecessarily are contributing to climate change." And just like the climate version, people do moralize about it. "You should fly less" and "you should eat less meat" are mainstream moral claims with exactly the same logical structure.

So what's wrong?

Not the analysis. The analysis is largely correct. The system-level observation about incentives is sound. So long as the motivation is "reducing harm to other women" and not the more self-serving "making things easier for men", this is genuinely defensible as moral reasoning.

What's wrong is what happens when you, a seventeen-year-old who has just read the Sequences, decide to act on it.

The Coordination Problem You're Not Seeing

Imagine a world where 90% of the population has internalized the argument above. In that world, explicit signaling is the norm, ambiguous signaling is recognized as defection, and social pressure maintains the equilibrium. The system works. People are better off.

Now imagine a world where fewer than 1% of people think this way. You are one of them. You try to implement the "correct" norm unilaterally. What happens?

You make interactions weird. You get pattern-matched to "guy who thinks he's solved dating from first principles." You generate friction without moving the equilibrium one inch. You're driving on the left in a country where everyone drives on the right. You're not wrong about which side is theoretically better — you're wrong about what to do given the actual state of the world.

This is the core insight: a strategy that is optimal at full adoption can be actively harmful at low adoption. A bad equilibrium with consensus beats a bad equilibrium with friction. Norms work through shared expectations, and unilaterally defecting from a norm you correctly identify as suboptimal doesn't improve the norm — it just removes you from the system's benefits while imposing costs on everyone around you.

Why the Sequences Don't Teach This

The Sequences are, in my view, one of the best collections of writing on human reasoning ever produced. They are extraordinarily good at teaching you to identify when a norm, a belief, or an institution is inefficient, unjustified, or wrong. What they systematically fail to teach is the difference between two very different conclusions:

  1. "This norm is unjustified."
  2. "I can safely ignore this norm."

The overall thrust of the Sequences — and of HPMOR, and of the broader rationalist memeplex — is heavily weighted toward "society is wrong, think from first principles, don't defer to tradition." Chesterton's Fence makes an appearance, but it's drowned out by the heroic narrative. The practical takeaway that most young readers absorb is: "I am now licensed to disregard any norm I can find a logical objection to."

This is not what the Sequences explicitly say. Eliezer wrote about Chesterton's Fence. There are posts about respecting existing equilibria. But the gestalt — the thing you walk away feeling after reading 2,000 pages about how humans are systematically irrational and how thinking clearly gives you superpowers — pushes overwhelmingly in the direction of "if you can see that the fence is inefficient, you should tear it down."

The missing lesson is: your analysis of why a social norm is suboptimal is probably correct. Your implicit model of what happens when you unilaterally defect from it is probably wrong. These two facts are not in tension.

Cultural Evolution Is Smarter Than You

Joseph Henrich's The Secret of Our Success is, in some ways, the book-length version of the lesson the Sequences forgot. Henrich demonstrates, across dozens of examples, that cultural evolution routinely produces solutions that no individual participant can explain or justify from first principles — but that are adaptive nonetheless.

The canonical example is cassava processing. Indigenous methods for preparing cassava involve an elaborate multi-step process that looks, to a first-principles thinker, absurdly overcomplicated. Someone with a "just boil it" approach would streamline the process, eat tastier cassava, and feel very clever. They would also slowly accumulate cyanide poisoning, because the elaborate steps they discarded were the ones that removed toxins. Symptoms take years to appear, making the feedback loop nearly invisible to individual reasoning.

The lesson generalizes: traditions frequently encode solutions to problems that the practitioners cannot articulate. The fact that nobody can tell you why a norm exists is not evidence that the norm is pointless — it's evidence that the relevant selection pressures were operating on a timescale or level of complexity that individual cognition can't easily access.

This doesn't mean all norms are good. It means the prior on "I've thought about this for an afternoon and concluded this ancient, widespread practice is pointless" should be much lower than the Sequences suggest. You might be right. But you should be surprised if you're right, not surprised if you're wrong.

When Should You Actually Act?

The point of this article is not "always defer to tradition." That would be a different error, and one the Sequences correctly warn against. The point is that there's a large and important gap between "this norm is suboptimal" and "I should unilaterally defect from this norm," and the Sequences provide almost no guidance for navigating that gap.

Here are some heuristics for when acting on your analysis is more likely to go well:

The cost of the norm falls primarily on you. If you're the one bearing the cost of compliance, defection is more defensible because you're not imposing externalities. Deciding to be vegetarian is different from loudly informing everyone at dinner that they're complicit in factory farming.

You can exit rather than reform. Moving to a community where your preferred norms are already in place is much less costly than trying to change the norms of the community you're in. This is one reason the rationalist community itself exists — it's a place where certain norms (explicit communication, quantified beliefs, etc.) have enough adoption to actually function.

Adoption is already high enough. If you're at 40% and pushing toward a tipping point, unilateral action looks very different than if you're at 0.5% and tilting at windmills. Read the room.

The norm is genuinely new rather than Lindy. A norm that's been stable for centuries has survived a lot of selection pressure. A norm that arose in the last decade hasn't been tested. Your prior on "I can see why this is wrong" should be calibrated to how long the norm has persisted.

You can experiment reversibly. If you can try defecting and easily revert if it goes badly, the downside is limited. If defection burns bridges or signals things you can't unsignal, be cautious.

You understand why the fence is there, not just that it's inefficient. This is the actual Chesterton test, applied honestly. "I can see that this norm is suboptimal" is not the same as "I understand the function this norm serves and have a plan that serves that function better."

The Central Tragedy

The pattern described in this article — "this would be better if everyone did it, but is actively costly if only I do it" — is not limited to the flirting example. It applies to a huge class of rationalist-flavored insights about social behavior.

Radical honesty. Explicit negotiation of social obligations. Treating every interaction as an opportunity for Bayesian updating. Refusing to engage in polite fictions. Pointing out logical errors in emotionally charged conversations. All of these are, in some sense, correct — a world where everyone did them might well be better. And all of them, implemented unilaterally at low adoption, will reliably make your life worse while changing nothing about the broader equilibrium.

This is, I think, the central tragedy of reading LessWrong at a formative age. You learn to see inefficiencies that are genuinely real. You develop the tools to analyze social systems with a precision most people never achieve. And then, because nobody ever taught you the difference between seeing the problem and being able to solve it unilaterally, you spend years generating friction — making interactions weird, alienating people who would otherwise be allies, pattern-matching yourself to "insufferable rationalist who thinks they've solved social interaction from first principles" — all in pursuit of norms that can only work through coordination.

The Sequences need a companion piece. Not one that says "don't think critically about norms" — that would be throwing out the baby with the bathwater. But one that says: "Having identified that a fence is inefficient, your next step is not to tear it down. It's to understand what load-bearing function it serves, to assess whether you have the coordination capacity to replace it with something better, and to be honest with yourself about whether unilateral action is heroic or just costly. Most of the time, it's just costly."

Or, more concisely: there should be a Sequence about not treating the Sequences as action guides unless you have the wisdom to tell why Chesterton's fence exists.



Discuss

The Friendly Telepath Problems

2026-02-15 23:08:27

Published on February 15, 2026 3:08 PM GMT

Companion to "The Hostile Telepaths Problem" (by Valentine)

Epistemic status:  This is my own work, though I asked Valentine for feedback on an early draft. I'm confident that the mechanisms underlying the Hostile/Friendly Telepath dynamic are closely related. The problems of the friendly telepath problems seem well-established. I am less confident, or at least can't make as strong a claim, on the relation to selfhood/the self. 
 

Valentine's "Hostile Telepaths" is about what your mind will do when you have to deal with people who can read your emotions and intentions well enough to discover when you are not thinking what they want you to think, and you have to expect punishment for what they find. In such a scenario, your mind will make being-read less dangerous in one of multiple ways, for example, by warping what you can discover in yourself and know about your own intentions. 

If that doesn't seem plausible to you, I recommend reading Valentine's post first. Otherwise, this post mostly stands on its own and describes a different but sort of symmetric or opposite case.

"Telepathy," or being legible for other people, isn't only a danger. It also has benefits for collaboration. As in Valentine's post, I mean by "telepath" people who can partly tell if you are being honest, whether you are afraid, whether you will stick to your commitments, or in other ways seem to know what you are thinking. I will show that such telepathy is part of a lot of everyday effective cooperation. And beyond that, I will ask: What if a lot of what we call "having a self" is built, developmentally and socially, because it makes cooperation possible by reducing complexity? I will also argue that Valentine's Occlumency, i.e., hiding your intentions from others (and yourself), is not only a defensive hack. It can also function as a commitment device: it makes the conscious interaction between such telepaths trustworthy.


Two sides of one coin

In the hostile-telepath setting, the world contains someone who can infer your inner state and who uses that access against you.

That creates pressure in two directions: You can reduce legibility to them by using privacy, misdirection, strategic silence, or other such means. Or you can reduce legibility even to yourself when self-knowledge is itself dangerous. If I can't know it, I can't be punished for it.

Valentine's post is mostly about the second move: the mind sometimes protects itself by making the truth harder to access.

But consider the flip side: Suppose the "telepath" is not hunting for reasons to punish you. Suppose they're a caregiver, a teammate, a friend, a partner, or someone trying to coordinate with you. Then, being legible is valuable. Not because everything about you must be transparent, but because the other person needs a stable, simple, efficient way to predict and align with you:

  • Is he hungry or scared?
  • Does she mean yes?
  • Will he be there at 7 PM?
  • Is this a real preference or a polite reflex?

A big part of "becoming a person" is becoming the kind of being with whom other people can coordinate. At first, caregivers, then peers, then institutions.

The interface

If you squint, a self is an interface to a person. Here is a nice illustration by Kevin Simler that gives the idea:

personhood_comic
From Personhood: A Game for Two or More Players by Kevin Simler on Melting Asphalt

But Kevin Simler is talking about the stable result of the socialization process (also discussed by Tomasello[1]). The roles/masks we are wearing as adults, whether we are aware of them or not. I want to talk about the parts prior to the mask. The mask is a standardized interface, but the pointy and gooey parts are also a lower-level interface. 


The origin of the interface

I have a four-month-old baby. She can't speak. She has needs, distress, curiosity, but except for single of comfort or discomfort, she has no way to communicate, and much less to negotiate.

my 4-month-old daughter, slightly GPT-edited for anonymity

I can't coordinate with "the whole baby." I can't look into its mind. I can't even remember all the details of its behavior. I can only go by behavior patterns that are readable: different types of crying, moving, and looking. New behaviors or ones I already know (maybe from its siblings).

Over time, the baby becomes more legible. And they are surprisingly effective at it[2]. But how? Not by exposing every internal detail, but by developing stable handles that I or others can learn:

  • A baby shows distinguishable signals: I can recognize different cries and movements when it is tired vs hungry vs in pain vs bored (especially the latter is common).
  • It develops increasingly consistent preferences. Not so much for toys yet, but it never took the pacifier, for example. This includes getting bored by or being interested in specific things.
  • There is beginning anticipation and collaboration: If I diaper it, it will be calm.
  • Eventually, children will be able to make and hold commitments: "I will do it."

So I interpret its behaviors in a form that is legible to me. I have a limited number of noisy samples and can't memorize all details, so I compress into something I can handle ("it likes bathing in the evening") - with all the issues that brings - (oops, it likes bathing when X, Y, but not Z" many of which I don't know).

Vygotsky[3] describes how this interpretation of children's behavior gives rise to the interface by interpretation and mutual reinforcement.  

From the outside, the baby starts to look like "a person." From the inside, I imagine, it starts to feel like "me."


Mandatory constraints

It is in the interest of the baby to be legible, and so it is for people. If other people know our interface, we can cooperate more effectively. And it also seems like more information and less noise is better to get a clearer reading and less errors when compressing each others communication. This may sound like an argument for radical transparency or radical honesty: if legibility is good, why not make everything transparent to others and also to yourself? But consider: what happens if you could edit everything on demand? The interface stops being informative. What makes coordination possible is constraint. A partner can treat a signal as evidence only if it isn't infinitely plastic. Examples:

  • If you could instantly rewrite fear, then "I'm scared" becomes less diagnostic and more negotiational.
  • If you could rewrite attachment on command, then "I care about you" is no longer a commitment.
  • If you can always generate the perfect emotion on cue, then sincerity becomes hard to distinguish from performance.

So some opacity doesn't just hide - it stabilizes. It makes your actions real in the boring sense: they reflect constraints in the system, not only strategic choices. 

All things being equal, we prefer to associate with people who will never murder us, rather than people who will only murder us when it would be good to do so - because we personally calculate good with a term for our existence. People with an irrational, compelling commitment are more trustworthy than people compelled by rational or utilitarian concerns (Schelling's Strategy of Conflict) -- Shockwave's comment here (emphasis mine). See also Thomas C. Schelling's "Strategy of Conflict"

So opacity to ourselves can not only function as a defence against hostile opponents, but also enables cooperation of others with us. As long as we don't know why we consistently behave in a predictable way, we can't systematically change it. Opacity enables commitment.

But not all types of opacity or transparency are equal. When people argue about "transparency," they often conflate at least three things:

  1. Opacity to others: Privacy, boundaries, and omission (strategic information hiding).
  2. Opacity to self: You don't fully know what's driving you; the process and input that gives rise to your thoughts is at least partly hidden.
  3. Opacity to will: You can't rewrite your motivations on demand.

What is special about cooperation here is that you can want more legibility at the interface while still needing less legibility in the implementation.

A good team doesn't need to see every desire. It needs reliable commitments and predictably named states.


Failure modes

Valentine discusses some ways self-deception can go wrong; for example, it can mislabel the problem as a “personal flaw” or show up as akrasia, mind fog, or distraction. Se should also expect the the reverse direction, the coordination interface, to have failure modes. Which ones can you think of? Here are four illustrations for common failure modes. Can you decode which ones they are before reading on?

 

Performative transparency

In A non-mystical explanation of insight meditation...  Kaj writes:

I liked the idea of becoming "enlightened" and "letting go of my ego." I believed I could learn to use my time and energy for the benefit of other people and put away my 'selfish' desires to help myself, and even thought this was desirable. This backfired as I became a people-pleaser, and still find it hard to put my needs ahead of other peoples to this day.

You become legible, but the legibility is optimized for being approved of, not for being true.

The tell is if your "self" changes with the audience; you feel managed rather than coordinated.

Rigidity dressed up as authenticity

In When Being Yourself Becomes an Excuse for Not Changing Fiona writes:

a friend breezed into lunch nearly an hour late and, without a hint of remorse, announced, “I’m hopeless with time.” Her casual self-acceptance reminded me of my younger self. From my current vantage point, I see how easily “that’s just who I am” becomes a shield against any real effort to adapt.

Sometimes "this is just who I am" is a boundary. Sometimes it's a refusal to update. Everybody has interacted with stubborn people. This is a common pattern. But you can tell if it is adaptive if it leads to better cooperation. Does it make you more predictable and easy rto cooperate with, or does it mainly shut down negotiations that might  (!) overwhelm you?

The rule of equal and opposite advice advice applies here. There is such a thing as asserting too few boundaries. For a long time, I had difficulties asserting my boundaries. I was pretty good at avoiding trouble, but it didn't give people a way to know and include my boundaries in their calculations - and in many cases where I avoided people and situations, they would have happily adapted. 

The internal hostile telepath

In Dealing with Awkwardness, Jonathan writes:

Permanently letting go of self-judgement is tricky. Many people have an inner critic in their heads, running ongoing commentary and judging their every move. People without inner voices can have a corresponding feeling about themselves - a habitual scepticism and derisiveness.

So you can develop an inner critic: an internalized role of, maybe of a parent or teacher, that audits feelings and demands that you "really mean it."

Then you're surveilled from the inside. I think this shows up as immature defenses[4], numbness or theatricality. 

Dependency on mirrors

In What Universal Human Experiences are You Missing without Realizing it? Scott Alexander recounts:

Ozy: It took me a while to have enough of a sense of the food I like for “make a list of the food I like” to be a viable grocery-list-making strategy.

Scott: I’ve got to admit I’m confused and intrigued by your “don’t know my own preferences” thing.

Ozy: Hrm. Well, it’s sort of like… you know how sometimes you pretend to like something because it’s high-status, and if you do it well enough you _actually believe_ you like the thing? Unless I pay a lot of attention _all_ my preferences end up being not “what I actually enjoy” but like “what is high status” or “what will keep people from getting angry at me”

A benevolent friend can help you name what you feel. But there's a trap: outsourcing selfhood and endorsed preferences.

Do you only know what you want after someone (or the generalized "other people") wants it too?


If you want to balance the trade-offs between being legible and opaque, what would you do? Perhaps: 

  • Be predictable without being exposed.
  • Be legible in commitments; be selectively private about details.
  • Prefer "I don't know yet" over a confident story that doesn't fit.
  • Reduce fear-based opacity, keep boundary-based opacity.

What does the latter mean?

  • Some contexts are structurally hostile; don't try to win them with more openness.
  • Use rituals that don't demand instant inner conformity. Sometimes "I'm sorry" is the appropriate, even if the emotion is lagging.
  • "I like Italian food" is legible but doesn't help with choosing a restaurant. Sometimes you have to tell more. But often the details just add noise.

What I can't answer is the deeper question: What is a stable way of having a "self as an interface" that is stable under both coordination pressure (friendly telepaths) and adversarial pressure (hostile telepaths)? Bonus points if you can also preserve autonomy and truth-tracking.

  1. ^

    how humans put their heads together with others in acts of so-called shared intentionality, or "we" intentionality. When individuals participate with others in collaborative activities, together they form joint goals and joint attention, which then create individual roles and individual perspeсtives that must be coordinated within them (Moll and Tomasello, 2007). Moreover, there is a deep continuity between such concrete manifestations of joint action and attention and more abstract cultural practices and products such as cultural institutions, which are structured-indeed, created-by agreedupon social conventions and norms (Tomasello, 2009). In general, humans are able to coordinate with others, in a way that other primates seemingly are not, to form a "we" that acts as a kind of plural agent to create everything from a collaborative hunting party to a cultural institution.

    Tomasello (2014) A natural history of human thinking, page 4

  2. ^

     

    We propose that human communication is specifically adapted to allow the transmission of generic knowledge between individuals. Such a communication system, which we call ‘natural pedagogy’, enables fast and efficient social learning of cognitively opaque cultural knowledge that would be hard to acquire relying on purely observational learning mechanisms alone. We argue that human infants are prepared to be at the receptive side of natural pedagogy (i) by being sensitive to ostensive signals that indicate that they are being addressed by communication, (ii) by developing referential expectations in ostensive contexts and (iii) by being biased to interpret ostensive-referential communication as conveying information that is kind-relevant and generalizable.

    Natural Pedagogy by Csibra & Gergely (2009), in Trends in Cognitive Sciences pages 148–153

  3. ^

     

    We caIl the internal reconstruction of an external operation intentionalization. A good example of this process may be found in the development of pointing. Initially, this gesture is nothing more than an
    unsuccessful attempt to grasp something, a movement aimed at a certain
    object which designates forthcoming activity. The child attempts to
    grasp an object placed beyond his reach; his hands, stretched toward
    that object, remain poised in the air. His fingers make grasping movements. At this initial stage pointing is represented by the child's movement, which seems to be pointing to an object-that and nothing more.

    When the mother comes to the child's aid and realizes his movement indicates something, the situation changes fundamentally, Pointing
    becomes a gesture for others. The child's unsuccessful attempt engenders
    a reaction not from the object he seeks but from another person. Consequently, the primary meaning of that unsuccessful grasping movement is established by others. Only later, when the child can link his unsuccessful grasping movements to the objective situation as a whole, does he begin to understand this movement as pointing. At this juncture there occurs a change in that movement's function: from an object-oriented
    movement it becomes a movement aimed at another person, a means of
    establishing relations, The grasping movement changes to the act of pointing. 

    Vygotsky (1978), Mind in society, page 56

  4. ^

    See Defence mechanism. A writeup is in Understanding Level 2: Immature Psychological Defense Mechanisms. I first read about mature and immature defenses in Aging Well.



Discuss

Were witches infertile mentally ill women?

2026-02-15 21:20:34

Published on February 15, 2026 1:20 PM GMT

My aunt's family are ultra orthodox Jews, and I used to stay with them occasionally in Bayit Vegan for Shabbat.

Once I went for a walk with my cousins. Families there sometimes have dozens of children squashed into tiny apartments, and with the streets closed for Shabbat, the road becomes their playground. Hundreds of children thronging everywhere, zooming down the hill on tricycles, skipping rope or playing catch.

Suddenly we see an old unkempt woman screaming at the kids. Not that she's trying to sleep or to keep off her lawn, but that Hitler should have killed them all, and other similarly vile curses. The children mostly seemed to ignore her, but I was shocked!

My cousin explained: she was the wife of a Rabbi who had never had children. At some point she snapped and started to grab pushchairs off mothers in the street and happily push them along till the petrified mother grabbed it back off her.

Eventually she stopped even trying this and just used to spend her hours screaming and cursing the neighbourhood children. By now the kids were so used to it they faded into the background.

The cultural milieu in Bayit Vegan is probably not so different from a medieval village culture in terms of gender roles and the strong pressure to have large families. It seems likely that mentally ill infertile women in such villages would have manifested their illness in a similar manner.

In today's enlightened age we recognise such people as mentally ill. But imagine yourself as a villager in a medieval village, and you hear about an old woman snatching babies off their mothers and cursing children. The natural term that comes to mind is a witch, and the natural thing to do when a child goes missing is to search their property and burn them at the stake. Indeed, many times you might find the missing child in the witches house, further confirming your beliefs she was a witch.

I doubt this is the whole story, but this fits well enough that I think there is a very strong chance that this is a large part of the origin of the memetic complex of medieval witches.



Discuss

Realizability for Finite State Reactive Agents

2026-02-15 21:14:18

Published on February 15, 2026 11:38 AM GMT

This work was produced during the Dovetail Research Fellowship. I’m grateful to Alex, Alfred, Winter, and Santiago, and to everyone who provided feedback, and to the fellowship community for their discussions.

Abstract

We study AI agents through the lens of finite-state transducers and string-to-string behaviors. Instead of treating automata purely as language recognizers, we model agents as systems that map observation histories to action histories while interacting with an environment online. We introduce Finite-State Reactive Agents (FSRAs) as deterministic, finite-memory, reactive policies, establishing their equivalence to standard Mealy and Moore machines. Our main result (Theorem 3.13) characterizes exactly which behaviors can be implemented with finite internal state: a deterministic behavior is FSRA-implementable if and only if it is length-preserving, prefix-preserving (causal), and has finite memory, & relate FSRAs to standard Mealy and Moore machines and identify when richer computational models are unavoidable.  We also introduce specifications (constraints on allowed agent behaviors) and formalize notions of weak enforceability, strong enforceability, and realizability. Using our characterization theorem, we prove strict separations between these notions (Theorem 3.15), demonstrating when finite-state agents can or cannot satisfy given specifications. 

Introduction

Deterministic Finite Automata (DFAs) are usually introduced as a basic model of computation: simple machines that read an input symbol by symbol and decide whether to accept or reject a string. In theoretical computer science, they are often treated as a warm-up before more powerful models such as Turing machines.

In this work, we take a different perspective. A DFA can also be viewed as a minimal agent: it maintains a small internal state, receives observations from the environment, updates its state deterministically, and eventually produces an outcome. Despite their simplicity, DFAs already capture the core interaction loop shared by many AI systems: observe → update internal memory → act or evaluate.

This observation motivates a shift from language recognition to agent behavior. We introduce the model of a Finite-State Reactive Agent (FSRA), which generalizes DFAs from passive evaluators to active, online systems that map streams of observations to streams of actions using finite internal memory. FSRAs formalize deterministic, reactive agents whose behavior is causal and prefix-preserving, and whose internal state evolves according to a fixed update rule (time-invariance). They let us ask precise questions about memory, state abstraction, observability, and goal satisfaction without the complications of learning (updating the policy from data), optimization (searching over policies or parameters to maximize an objective), or stochasticity (randomness in the agent's actions or environment). These characterizations let us state simple, checkable conditions for when a specification is enforceable or realizable by a set of finite-state controllers.

Outline. The rest of this post is organized as follows. Section 1 introduces the Finite-State Reactive Agent (FSRA) model and states its equivalence with standard Mealy and Moore machines. Section 2 develops formal properties of agent behaviors: determinism, length-preservation, causality (prefix-preservation), finite memory, reactivity, and time-invariance, modeled as properties of string-to-string functions. It concludes by introducing specifications (constraints on allowed behaviors) and three enforceability notions: weak, strong, and realizability. Section 3 presents our main results: Theorem 3.13 characterizes exactly which behaviors are FSRA-implementable (reactive + finite-memory), Corollary 3.14 shows that such behaviors induce realizable specifications, and Theorem 3.15 proves strict separations between the enforceability notions, demonstrating fundamental gaps between what finite-state agents can weakly satisfy versus fully realize. We conclude with worked examples including binary remainder computation (mod k), substring avoidance (no "11"), and count-with-ending constraints. Standard automata-theoretic background (DFAs, Mealy machines, Moore machines) and deferred proofs are collected in the Appendix.

Finite State Reactive Agent

Mealy machines (Definition A.2 in the Appendix) are deterministic finite-state transducers that produce an output symbol synchronously with each input symbol.  This makes them a natural formal model for very simple online agents: the input alphabet corresponds to the agent's observations, the finite state encodes the agent's memory, and the per-step output is the agent's immediate action.  To make this perspective explicit and to use language familiar to AI and alignment readers, we adopt the name Finite-State Reactive Agent (FSRA) for a Mealy machine interpreted as a deterministic, finite-memory, online policy.  The formal Definition 1.1 below gives the FSRA tuple and the precise causal / prefix-preserving behavior we require.

Definition 1.1 (Finite State Reactive Agent - FSRA)

Finite-State Reactive Agent (FSRA) is a deterministic, finite-memory, online transducer that maps each input (observation) prefix to an output (action) prefix of the same length i.e. it emits exactly one action symbol for every observation symbol it consumes.

Formally, an FSRA is a tuple , where

  •  is a finite set of states (the agent’s internal memory),
  •  is a finite observation alphabet (input),
  •  is a finite action alphabet (output),
  •  is the (total) state-update function,
  •  is the output (action) function, producing exactly one action per observation,
  •  is the start state.

The FSRA implements a function

that is length-preserving:

and causal / prefix-preserving: the -th output symbol depends only on the first  input symbols, and for any prefix  and continuation  is a prefix of .

Concretely, on input  starting from , define the state sequence
 with  given. The produced action sequence is
, so .

Remark (generalization). If you want to allow a variable number of action symbols per observation, simply let . That yields the Mealy/transducer formulation where one transition may emit a finite output string; the FSRA above is the special case that is strictly one action per observation.

Calling this model an FSRA makes the intended interpretation explicit:   are observations are actions, and the pair  together specify a deterministic, finite-memory policy. An FSRA captures the simplest kind of online agent (these terms are discussed in detail in the next subsection): it is reactive (length- and prefix-preserving, emitting exactly one action per observation), has finite memory encoded by the finite state set Q, and is deterministic.

This model is useful for reasoning about what can be implemented with finite memory, and for making clear where finite-state agents fail. When one wants to model policies with longer or variable delay between observation and action, or policies that maintain richer beliefs, the FSRA can be generalized (e.g.  or by increasing , but ultimately some behaviors require strictly more expressive models (e.g. pushdown automata, Turing Machines, or stochastic/learning agents).

Proposition 1.2 (Equivalence with Mealy / Moore)

Let  be a finite observation alphabet set and  a finite action alphabet set.

  1. Every FSRA is, by definition, a standard Mealy machine (with input alphabet , output alphabet , and single-symbol output per transition - Definition A.2' in the Appendix). Conversely, every standard Mealy machine (whose output function produces exactly one symbol per input) is an FSRA. This equivalence is immediate from the definitions.
  2. For every FSRA  there exists an equivalent standard Moore machine (Definition A.3' in the Appendix) computing the same length-preserving transduction , and conversely, every standard Moore machine can be converted into an equivalent FSRA.

Brief justification. Part 1 is just a renaming of components. For Part 2, the FSRA-to-Moore direction requires splitting each state  into copies  - one for each output symbol a that  can produce, so that the output can be read from the state alone; this can increase the state count up to . The Moore-to-FSRA direction works by defining  where  is the Moore output function, moving the output from the arrived-at state to the transition that enters it.

Formal Properties of Agent Behaviors

In this section, we define several properties of general observation-to-action behaviors, modeled as string-to-string functions. We then identify which of these properties are satisfied by Finite-State Reactive Agents (FSRAs).

Let  and  be finite alphabets, representing observations and actions respectively. We consider functions

where  denotes the collection of nonempty subsets of . Such a function assigns to each finite observation history a set of admissible action histories.

Later, we will specialize to deterministic behaviors, which are functions of the form

Example (Behavior)

Let . Define a behavior  as follows:


In this example, the behavior is nondeterministic: after some observation histories, multiple action histories are permitted.

Intuitively, a function  describes everything an agent is allowed to do, rather than what it will do. For each observation history,  lists the action histories that are consistent with some specification, constraint, or design goal. Deterministic agents correspond to the special case where exactly one action history is allowed at each observation history.

It is useful to separate behaviors from agents. A behavior  specifies which action histories are acceptable after each observation history, without committing to a particular decision rule. An agent, by contrast, must choose a single action at each step.
We also distinguish both from a specification: a specification P is a constraint on what an agent is allowed to do (formally, a relation ; see Definition 2.5), whereas a behavior is what the agent actually does. This three-way distinction: specification, behavior, agent, lets us ask precise questions. For example: "Can a simple agent satisfy this specification?" (i.e., does there exist an agent whose behavior stays within the allowed set?). In the formal-methods and control-theory literature, this is called enforceability: a specification is enforceable by a set of agents if at least one agent in that set produces only allowed behaviors. We can also ask the converse: "Can a set of agents collectively realize every behavior the specification allows?" These notions are formalized in Definition 2.11.

Allowing  to be nondeterministic is intentional. Many natural specifications such as safety constraints, permissive policies, or sets of acceptable plans do not prescribe a unique action at every step. Instead, they describe a space of allowed behaviors. 

Definition 2.3 (Deterministic) 

A function  is deterministic if 

In this case, we may write  and identify  with the unique element of the singleton set. Determinism means that given any observation history, the action sequence is uniquely determined. Note that a deterministic behavior effectively is an agent's policy, the distinction between "set of allowed action histories" and "committed choice" collapses when the allowed set is always a singleton. The behavior-vs-agent distinction is most useful in the nondeterministic case, where a behavior permits multiple options and the question becomes which agent (i.e., which deterministic selection) to deploy.

 

Example (Deterministic)


Let  and . Define

So, the function returns  and  alternatively.

Determinism is the simplest way to turn a permissive specification into an actual policy: once you pick a deterministic  you no longer have a set of allowed behaviors but a single thing the agent will actually do (to follow the behavior). This is useful when reasoning about concrete guarantees (e.g., “will this agent ever output bad?”) but remember determinism is a modeling choice, real systems are sometimes nondeterministic (stochastic) by design.

Definition 2.4 (Controller and Controller Set) 

A controller is any concrete system (a finite automaton, a program, a circuit, a neural network, etc.) that determines a behavior

where  is the set of action histories the controller may produce after observation history o.  We deliberately leave "concrete system" informal: the only thing that matters formally is the induced behavior .

A controller set  is a set of controllers; we write  and use the shorthand  for the mapping from controllers to their induced behaviors. (No additional structure on  is assumed, it is simply the collection of controllers under consideration, e.g., "all FSRAs" or "all 3-state Mealy machines.")

If each   is single valued we call the controller deterministic and identify 

Example (Controllers)

Let  be the set containing two deterministic behavior/functions with:
-
.
Then  is a controller set that can produce at least 
depending on which controller is selected.

Definition 2.5 (Specification)

Let  and  be finite alphabets, and let

be a specification (a relation between observation histories and allowed action histories).  For an observation history  write the fiber (i.e., the set of all action histories that the specification allows after ):

A specification is not the same as a behavior. A behavior describes what a controller does; a specification describes what a controller is allowed to do. In particular,  can contain many action histories, it expresses a constraint ("any of these actions would be acceptable") rather than a commitment ("this is the action I will take"). A single controller typically covers only some of the allowed pairs. The enforceability and realizability notions in Definition 2.11 make this gap precise: they ask whether controllers in a set  can, individually or collectively, cover what  allows.

 

Let  be a set of controllers; each controller  induces a function (behavior)

so that is the set of action sequences controller  produces on observation history . Define the relation implemented by  as

Observe that .

Think of the controller as the machine you could put on the robot: it has internal structure and a rule for mapping sensory input to motor outputs. The behavior ​ is the abstract trace of possibilities that a machine can generate. Distinguishing these levels is crucial: a specification can permit many possible action traces (multivalued), but only some will be reachable by your finite-state controllers. The phrase “is it enforceable?” asks whether the machines you can actually implement can realize at least one of the allowed traces at each point; “is it realizable?” asks whether the machines collectively capture exactly the specification.

Definition 2.6 (Length Preserving)

A function  is length-preserving if  

 

For deterministic  this reduces to 

Example (Length-preserving)

Let  and define


i.e.   Then .
 

Intuitively, a length-preserving function produces exactly as many output symbols as it consumes input symbols. This models an agent that emits one action per observation (or more generally, a fixed number of actions per observation). It’s an explicit modeling choice: many controllers (especially embedded or real-time ones) naturally fit this model.

Definition 2.7 (Prefix Preserving/Causal)

A function  is prefix-preserving (or causal) if 

For deterministic  this is equivalent to 

Causality captures the notion of online computation: the output produced after reading a prefix  must remain consistent (as a prefix) with the output produced after reading any extension . In other words, the agent commits to its early outputs and cannot retroactively change them when new observations arrive.

Example (Prefix-preserving / causal)

Let  and define  by the stepwise rule:
on seeing  output ; on seeing  output .
Formally, .
Then  is a prefix of  for every .

Definition 2.8 (Finite Memory)

A deterministic function  has the finite-memory property if the equivalence relation  defined by

has finite index (i.e., finitely many equivalence classes).

The relation  partitions observation prefixes into equivalence classes that cannot be distinguished by any future continuation. If this partition is finite, then the agent effectively needs only finite memory to encode its past: it suffices to remember which equivalence class the history belongs to. This is the string-function analogue of the Myhill-Nerode theorem for languages.

The finite-memory condition is the formal version of “bounded internal state.” It says there are only finitely many distinct ways the future can unfold after different pasts, so keeping one of finitely many summaries is enough. This is exactly why finite automata are a natural model for resource-limited agents.

Example (Finite Memory - Periodic every 

Fix . Let . Define deterministic  by



i.e. the symbol emitted at step  depends only on the position  (every -th output is ).

Claim.  has finite memory; the index of  is exactly .

Proof. Define relation (R) on prefixes by



There are at most  classes under . If , then for every continuation  the positions in  where outputs equal 1 are the same (they are exactly those indices congruent to ). Hence . Thus  has at most  classes. Conversely, two prefixes with different lengths modulo  produce different future outputs (their next few outputs disagree), so all  residue classes are distinct. Therefore  has exactly  classes, so finite index. 

Example (Not Finite Memory - Perfect Square Count Predicate)

Let . Define deterministic reactive  by the stepwise rule:
at each step emit  iff the total number of s seen so far is a perfect square; otherwise emit .
Formally, if after reading prefix  the count of 1s is , then on reading a next symbol the output at the time the -th input is processed depends on whether  is a square.

Claim.  has infinite index (so  is not finite memory).

Proof for this claim can be found in appendix
 

Definition 2.9 (Reactive)

A behavior  is reactive if it is length-preserving, and
prefix-preserving, i.e.  ( is a prefix of  ).
 

Equivalent characterization for deterministic behavior: A behavior    is reactive if and only if 



Reactivity does not imply memory-lessness or finite memory. A reactive behavior may depend on the entire observation history , not just a bounded summary of it. (In particular, "finite memory" is strictly more general than the memoryless/Markov case, where the action depends only on the most recent observation: a finite-memory agent compresses the full history into one of finitely many states, each of which can encode patterns spanning many past observations.)

Reactivity formalizes the intuitive idea that an agent must act in real time. Once an action is taken, it cannot be retracted when new information arrives. This matters for enforceability: a specification may allow some full action sequence in hindsight, yet no reactive agent can realize it because early commitments inevitably block all allowed continuations. Prefix-wise enforceability makes this tension explicit by asking whether an agent can always keep some allowed future open while acting online. We will see these definitions more formally later.

Example (Reactive)

Same as the prefix-preserving example above:  maps observations to immediate outputs, so . This is reactive and causal by construction.

Example (Length-preserving but non-reactive behavior)

Let  and . Define a deterministic function 

where the output string has the same length as . If  denotes the empty string then .
 

This behavior emits the correct number of actions, but it does not act online. The first action symbol depends on whether the entire observation history has even or odd length, which cannot be determined when the first observation arrives. As a result, the agent’s early actions are not commitments it may retroactively “change its mind” once more observations arrive. This violates the core idea of reactivity: acting step by step in real time.

Example (Reactive but not finite memory)

Let . Define a deterministic reactive  by the stepwise rule:
at each step output  if the total number of 1s seen so far is a perfect square, otherwise output .
Formally , where  depends on the count of 1s in .
This  is reactive (stepwise defined) but not finite-memory, since tracking “is count-of-1s a square?” requires unbounded memory.
This example shows reactivity alone doesn’t bound memory cost. The agent behaves online (each step emits one output) yet the rule depends on an unbounded statistic (counts), so no finite-state controller can implement it.

 

The properties above (length-preserving, prefix-preserving, finite memory) are exactly the ingredients of our main FSRA characterization theorem (Theorem 3.13). We will state and prove it in Section 3. Before that, we introduce the idea of enforceability and realizability, which we will use later when talking about which specifications such agents can satisfy.

 

Definition 2.10 (Time-invariance)

A family of online behaviors may be specified by a step-indexed family of per-step update/output rules , where each  maps a current internal state and the new observation to a next state, and  maps the current state and observation to the emitted action. We call the behavior time-invariant (or stationary) if the per-step rules do not depend on ; that is, there exist maps  and  such that 

Time-invariance adds the further restriction that the decision rule doesn’t secretly depend on how many steps have elapsed.

Definition 2.11 (Enforceability and Realizability).

Weak enforceability: We say that specification is weakly enforceable by a set of controllers  if 

equivalently

Another way to define would be like this

whenever the specification allows some action at history , at least one controller in can produce an allowed action at . The controller that works may depend on .

Weak Enforceability

Strong enforceability: We say that  is strongly enforceable by a set of controllers  if 

equivalently 

every action allowed by the specification at  is realized by some controller in  (the realizing controller may depend on the action and on ).

Strong Enforceability


Realizability: We say that  is realizable by a set of controllers  if 

i.e. for every  there exists some controller  with , and conversely every  emitted by some controller in  belongs to .

Deterministic simplifications. When  induces a set of deterministic function i.e. , then we can just write:

  • For Weak enforceability:

         equivalently

  • For strong enforceability: 

          equivalently 

 

Special case: singleton controller. When  is just a singleton set i.e.  then the definitions three definitions just simplify to:

  •  Weak Enforceability reduces to  and in the deterministic case this is  whenever .
  •  Strong Enforceability reduces to  (i.e. every allowed action must be produced by the single controller), which in the deterministic case means  is either empty or equals .
  • Realization reduces to 

Enforceability compares an abstract specification  (what counts as allowed action histories after each observation history) to what a set of concrete controllers  can actually produce. Weak enforceability asks only that, whenever the spec allows at least one action after a history , some controller in  can produce at least one allowed action at  (the controller may depend on the history). Strong enforceability demands more: every action that the spec allows at  must be produced by some controller in  (so the controllers collectively realize every permitted action). Realizability asks that the set   exactly matches the specification: the set of all (history, action) pairs realized by controllers in   is equal . These notions let us formalize intuitions like “the spec is implementable by simple reactive agents” and to separate cases where a spec is merely compatible with controllers from cases where it is exactly captured by them.

Example (Weak but not Strong)

Let  Define the specification 

(For every other observation string .)

Let the controller set be the singleton  with deterministic behavior  given by

Then:

  • For strong enforceability we require  with . But  and no  produces  ; hence  is not strongly enforceable by .
  • Realizability asks . Here  ; thus is not realizable by .

Example (Strongly enforceable but not realizable)

Let  Define the specification 

Let the controller set be  with behaviors

Then

  • Strong enforceability: for each allowed action in  there exists a controller producing it:  Thus  is strongly enforceable by 
  • Realizability:  . Hence  and  is not realizable by .

Proposition 2.12 (Relation between different notions of enforceability): 

Let  be a set of controllers and  a specification. Then

Prefix-wise Enforceable: Say  denotes the set of all prefixes of strings in . A specification  is prefix-wise Enforceable by  if  


Equivalently: for every observed prefix  and every continuation  for which the full specification admits some action  on , the controller  already offers at prefix  at least one action-prefix that can be extended to an allowed full action on .

Prefix-wise enforceability is an explicitly online notion: it requires that a single controller  already commit, at each observation prefix , to some action-prefixes that are extendable to full allowed actions for every possible future continuation  that keeps the specification satisfiable. In practice, this is a useful notion when agents must act before observing the entire future, it demands that the agent’s early choices never rule out reaching some allowed final action once future observations arrive.

Characterization of FSRA and Examples

Theorem 3.13 (Characterization of FSRA) 

Let  be finite alphabets and let  be a deterministic function. Then there exists an FSRA  with induced transduction  iff all of the following hold:

  1.   is length-preserving.
  2.   is prefix-preserving/causal.
  3.   is finite memory has finitely many equivalence classes.

(Equivalently:  is implemented by some FSRA  properties (1)–(3) hold.) 

Proof (Theorem 3.13):

 Let be an FSRA and suppose 

  • Length-preserving. By the FSRA definition the machine emits exactly one action symbol for each input symbol. Therefore for every  the produced action string has the same length as . Hence 
  • Prefix-preserving. The FSRA produces each output symbol at the time its corresponding input symbol is read; in particular the output produced after a prefix  is the prefix of the output produced after any extension . Formally, if the state sequence after reading  is  and the outputs are , these are initial symbols of the outputs produced on any longer input . Thus  is prefix-preserving.
  • Finite memory. Let  be the finite state set of . Define a map  by . If  then for every continuation  the machine, starting from the same state, will produce identical future outputs on  and . Hence . Thus, the number of -equivalence classes is at most , so  has finite index.

This completes the (⇒) direction. Note also that the FSRA gives explicit fixed maps  (so, time-invariance holds for any FSRA)

 Assume  is deterministic and satisfies length-preserving, prefix-preserving, and finite-memory. We build an FSRA  that implements .

1. State set. Let

be the set of equivalence classes of . By finite-memory this set is finite. Write  for the -class of . Let the start state be .

2. Define . For class  and observation  define

This is well-defined: if  then , so for every continuation  we have ; taking  of the form  shows , hence .

3. Define . We must define the single action symbol emitted when in class  upon reading . Because  is length-preserving and prefix-preserving, for every  and  the string  is a prefix of  and

so  equals  concatenated with exactly one symbol. Thus there exists a unique  with

Define: 

i.e. the last symbol of . We must check  is well-defined: if  then  so for every continuation . In particular for  we get , and for  we get . Therefore, the unique last symbols of  and  coincide; thus .

4. The machine. Put . By construction  are total maps  and , respectively, so  is an FSRA (deterministic, finite , one output per input).

5. Correctness . We show by induction on length of  that the action string produced by  on input  equals .

  • Base : machine produces  and  (length-preserving and causal behavior give that conventionally).
  • Inductive step: assume for  that the machine output equals  and that the machine reached state . On reading next symbol  the machine outputs , which we defined to be the unique symbol  with . Therefore, the machine’s output after  equals . The next state is =[uo]). This completes the induction.

Hence .

Thus (1)–(3) imply existence of an FSRA implementing . The constructed  are fixed maps (time-invariant) by construction.

 

We call this model a Finite-State Reactive Agent because the name describes exactly what the math enforces. It is finite state because the agent’s memory is a fixed, finite set of states (the  classes), there is only a bounded amount of past information the agent can store. It is reactive because the agent responds to each observation immediately: for every input symbol it emits exactly one action symbol, and the action at time t depends only on the observations up to t (no lookahead, no delayed planning). And it is an agent because the whole object is a policy-like mapping from observation histories to actions a causal, online decision procedure. This label therefore does more than decorate the definition: it communicates the limits and strengths of the model at a glance (what the agent can implement, and what kinds of behavior definitely require richer models such as stacks, external memory, or unbounded belief states).

Corollary 3.14 (Realizability of deterministic reactive finite-memory specifications by FSRAs)

Suppose  be a deterministic behavior that is reactive (i.e. length-preserving and prefix-preserving) and has the finite-memory property. Define the induced specification .

Then there exists a Finite-State Reactive Agent (FSRA)  such that

In particular,  is realizable by the singleton controller set 

 

Theorem 3.15 (Strict Separation of Realizability and Enforceability for FSRAs)


There exist specifications  and  over a one-symbol observation alphabet  and action alphabet  such that:

  1.  is strongly enforceable by the set of all FSRAs, but  is not realizable by any set of FSRAs (finite or infinite).
  2.  is weakly enforceable by a finite set of FSRAs, but  is not strongly enforceable by any finite set of FSRAs.

Proof.

We treat the two constructions separately.

Part 1: Construction of 

Let  and . For each  write  for the length- observation string. Define the length- action , the unique length- bitstring whose single 1 appears at position . Define the specification

Thus, for every observed prefix  the specification allows exactly the single action .

(i) Strong enforceability by the set of all FSRAs.
For every  there is an FSRA  (With  states) that outputs  on the first  steps and outputs 1 on the -th step. Concretely,  simply counts up to  and emits the described pattern. Hence for each allowed pair  there exists an FSRA (namely ) that produces that exact pair. By definition (Def. 2.11) the set of all FSRAs therefore strongly enforces .

(ii) Non-realizability by any set of FSRAs.
Suppose  is any set (finite or infinite) of FSRAs and that  realizes , i.e.

Then in particular every pair  belongs to the right–hand side, so for every  there exists some  with . But realizability as equality forces more: for every  and every  we must have . Hence for every  and every . Therefore every  implements the same global map  defined by . In particular,  must contain at least one FSRA whose induced function is equal to .

But the function  above is not implementable by any FSRA: the equivalence relation  has infinite index because distinct prefix lengths are distinguished by future behavior; equivalently  is not finite-memory and so (by Theorem 3.13) not implementable by any FSRA. Thus, no FSRA in  can induce , a contradiction. Therefore, no set  of FSRAs realizes .

Part 2 - Construction of 

Again let  and . For each  set

the set of length- bitstrings that contain exactly  ones.

(i) Weak enforceability by a finite set.
Let  be the simple 2-state FSRA that ignores the input symbol and alternates outputs  (a 2-cycle). For every  that controller’s length- output contains exactly  ones (the ones appear at the even positions). Hence  for every . Therefore, the singleton finite set  satisfies the weak enforceability condition: for every observed history  there exists a FSRA in this finite set whose produced action at  is allowed.

(ii) Failure of strong enforceability by any finite set.
Fix any finite FSRAs set . For a given length  the set  can produce at most  distinct length- action strings (each FSRA gives one string of length ). But the size of the fiber  equals the binomial coefficient , which grows super-polynomially in . Thus, for sufficiently large  we have . Hence  cannot contain FSRAs producing every allowed string in . Since strong enforceability would require that every allowed action at each history be produced by some FSRA in the set (Def. 2.11), no finite  can strongly enforce .

 

Example (FSRA - Remainder mod k of the binary stream)

Setup: Fix integer . Observation Alphabet  (bits observed); action alphabet  (we let each output symbol be the remainder, for readability). 

Define deterministic reactive  by: for input , the output at step  is the remainder of the integer represented by the prefix  (interpreting the bit string in binary, most-significant-bit first) modulo . Equivalently let

and define  (a string of symbols in ).

 A small working example: () Input 

  •  

  • Output 

 

Finite Memory: Consider the relation  on prefixes:

Claim:  iff the current remainders after  and  are equal, i.e. , where  denotes the integer value of  

  • If  then for any continuation  the recurrence  depends only on  and , so . Hence .
  • Conversely, if , take  to be a single symbol  which distinguishes the remainders by the recurrence. So , hence .

Therefore, the equivalence classes are exactly the  residue classes modulo . So  has finite index  - finite memory.

 

FSRA: Define  by

  •  (the residue classes).
  • .
  • For state  and observation :

After reading a prefix whose current remainder is , seeing bit  updates the remainder to  and the FSRA emits that new remainder as the action for that timestep. This is deterministic, length-preserving, and causal.

For example: , input :

  • start from .
  • read 1:  emit 
  • read 0: , emit .
  • read 1: , emit .
    Output: 
Finite-State Reactive Agent for binary remainder mod 3.

Specification: 

where   and 
For every observation prefix , the unique allowed action history is the sequence of remainders obtained by repeatedly updating the current remainder by  for each next bit . Thus 

So .

 

Example (DFA and FSRA - Absence of a substring)


Let  which is the collection of all finite strings over  with no two consecutive  symbols. The empty string  We first see a DFA  for it, to be specific a minimal DFA (i.e. the smallest (with respect to number of states) DFA which accepts  )

  • States  where:
    •  = “no  at the end / start state” (last symbol was  or empty),
    •  = “last symbol was  and no forbidden  has occurred so far”,
    •  = dead/trap state (we have seen substring ).
  • Start state: .
  • Accepting states:  (we accept as long as  has not occurred).
  • Transition function :
    •  
Minimal DFA detecting absence of substring "11"

Setup: Observations  (each step the agent observes a bit). Actions (at each step the agent emits 1 iff the prefix seen so far is in , else 0). Define the deterministic reactive function  by: for input , the -th output symbol  equal


So, 

Finite Memory: The relation  has exactly three equivalence classes, so  has finite memory. (We have proven our claim in the appendix)

Define three candidate classes by partitioning all prefixes  into three sets:

  •  (include  here).
  •  (the dead/trap class).
    These three sets are disjoint, and their union is all prefixes.
     

Note: We are using  here and in the proof given in the appendix

Every prefix belongs to one of the three classes , each class is contained in a single -equivalence class, and the three classes are pairwise distinct. Therefore  has exactly three equivalence classes (finite index). By Definition 2.8 (finite-memory),  has finite memory. As a corollary, any FSRA implementing  requires at least 3 states (one per  class); the standard construction produces the 3-state FSRA.

 

FSRA: Construct FSRA  directly using the DFA states.

  •  as above.
  •  is the start state.
  •  (same as DFA):
  •  (emit 1 iff the new state after reading the input is accepting): 

For example, input .

  • Start in .
  • Read 1: new state , emit  because 1 alone is in . (Output so far: 1.)
  • Read 0: new state , emit  because  Output: 11.
  • Read 1: new state , emit  because . (Output: 111.)

If we had input 11 then at second step:

  • Start . read 1 → , emit 1.
  • Read 1 → , emit 0. So, outputs 10 indicating the prefix of length 2 is rejected.
FSRA detecting absence of substring "11"

Specification: The realized specification is the singleton-valued relation

For each observation prefix , the fiber  is the unique action history produced by the FSRA after prefix 

 

 

Example (DFA and FSRA - Constraint on Count and Ending)

Fix , observation alphabet . Fix indices , integers . Define

(By convention the empty string does not end in  and thus, is not in  unless you explicitly want to treat it otherwise.) We build a deterministic finite automaton
 that keeps two pieces of summary information:

  • the residue  equal to the current count of  modulo , and
  • the last seen symbol , where  denotes “empty / no last symbol yet” (useful for the initial state).

Define

  • . So .
  • Start state  (zero occurrences seen, no last symbol).
  • Transition  given by
𝟙

 𝟙 is 1 if  and 0 otherwise.)

  • Accepting set 

An example, 
 

Setup: We convert the language into a Boolean behavior, which is a convenient FSRA target. Let Observations:  and Actions:  

Define  by: for any input  and index 

𝟙

So  is the binary stream indicating, step-by-step, whether the observed prefix satisfies the two constraints (count-of- and prefix ends with 

Finite Memory: The classes of  are precisely the DFA states
. In particular  has at most  classes, so  is finite memory. Using the similar proof technique like the previous example we can show that index of  is .

FSRA: We now give the Finite-State Reactive Agent  that implements the indicator function  above.

  •  (same as DFA states).
  •  (per-step indicator).
  • Start .
  • Transition  (state update) is the same as the DFA:
𝟙
  • Output function 
𝟙

Take , alphabet . Let  . Let  and . So, we require  ends with d.

Start . Consider input.

  • Read : new residue , last symbol b. New state . Since  but last symbol , (). Output so far 0.
  • Read  stays 1, last  → state . Not . Output 0. Outputs 00.
  • Read , last  →  → output 0. Outputs 000.
  • Read : (r=2), last  →  → output 0. Outputs 0000.
  • Read : (r=2), last  → . Since  output 0. Final outputs: 00000.

Now consider appending four more  then : continuation .

  • After appending  the residue increases by 4 → new residue .
  • After final  the residue remains , last symbol d, and outputs 1. 

 

Conclusion

Thinking of agents as simple finite-state machines helps clarify what they can and cannot do. It shows which interactive behaviors can be handled using only limited memory, and when more powerful models are truly necessary. This perspective makes the design space clearer: some goals can be achieved with small, structured controllers, while others require fundamentally richer mechanisms.

There are several natural directions to explore. We can study what changes if the agent is allowed to use randomness, make delayed actions, or update its behavior over time through learning. Each of these makes the model more realistic, but also raises new theoretical questions. Important technical problems include: (i) understanding how hard it is to decide whether a specification can be enforced with limited memory, (ii) finding the smallest possible controller that satisfies a given specification, etc.

These directions connect verification, synthesis, and learning, promising a practical theory for building simple, verifiable reactive agents.

 

Appendix

Deterministic Finite Automata and Transducers

Definition A.1 (Deterministic Finite Automaton - DFA) 

A DFA is a tuple , where:

  •   is a finite set of states
  •   is a finite input alphabet
  •    is the transition function (assumed total i.e. its defined for all possible tuples )
  •   is the start state
  •   is the set of accepting states

A DFA defines a language  (= set of all finite length string formed using the alphabet in the alphabet set ) by the usual acceptance criterion: a string  is accepted if and only if the state reached after reading  from  belongs to .

Here is a simple example to understand it. We will construct a DFA  which that reads a binary string and accepts if the number of 1s in the string is even. (And remains in the final state whenever the number of 1s encountered till now is even.)

Formally, define the automaton  as follows:

  • States where the state encodes whether the agent has seen an even or odd number of 1s so far.
  • Alphabet
  • Start state:  
    (before reading anything, the count of 1s is zero, which is even)
  • Accepting states
  • Transition function :
    • Reading 0 does nothing to the parity:
      •  
      •   
    • Reading 1 flips the parity:
DFA for Parity of number of 1s

Given any input string  , the DFA starts in  and updates its state step-by-step according to . The string is accepted if the final state lies in , i.e., if the total number of 1s in  is even.

For example:

  • Input 1010 → two 1s → ends in  → accepted
  • Input 111 → three 1s → ends in   → rejected

You can view this DFA as a minimal agent with memory:

  • The state is the agent’s internal memory (here just one bit: even vs odd).
  • The alphabet is the agent’s observation space.
  • The transition function is the agent’s update rule.

The accepting states encode a goal or success condition. It already captures the core agent loop: observe → update internal state → decide success/failure.

 

Definition A.2 (Mealy Machine - A Deterministic Finite Transducer) 

A Mealy machine is a tuple , where:

  •  is a finite set of states   
  •  is a finite input alphabet   
  •  is a finite output alphabet   
  •  is the (total) transition function 
  •  is the output function
  •  is the start state

On input  starting from , the machine produces the output:

where the sequence of states   is defined by .

In a Mealy machine, each transition produces an output.
The output generated at each step depends on both the current state and the current input symbol and is produced synchronously with the consumption of that input symbol.

Thus, a Mealy machine defines a function

that maps every input string to an output string computed incrementally during the run of the machine.

Equivalently, the transition and output functions can be combined into a single function

where

This formulation emphasizes that output is produced on transitions, which is why Mealy machines are also called deterministic finite-state transducers.

 

Now, let's discuss a small example. We construct a Mealy machine that, on each input symbol:

  • outputs 0 when it reads 1
  • outputs 11 when it reads 0

Thus, the machine maps any binary input string to a binary output string in which every 1 appears in pairs, so the total number of 1s in the output is always even.

Define  by:

  •  (a single state, the machine is memoryless)
  •  (input alphabet)
  •  (output alphabet)
  •  (stay in  on every input)
  •  (output 0 on input 0, output two 1’s on input 1)

Equivalently you can combine  and  into one  with   

Notation: Input/output. Mealy Machine to output double 1s for each 1 in input. 

Here is an example run: 

  • Input: 
  • State sequence:  (always 
  • Output (stepwise): 
    Concatenate to get . Count of 1’s in  = = even.

This Mealy machine produces strings in the language


Every string in  has an even number of 1’s, but not every binary string with an even number of 1’s is in  . For example the string 101 has two 1’s (even) but is not in . So the machine guarantees parity but does not generate all even-1 strings.

If your goal is all even-1 strings, you need a different transduction strategy (for example allowing outputs ε or single 1’s at some steps plus internal bookkeeping so the machine can place 1’s arbitrarily while keeping total parity even). That requires either a larger state space or different output rules.

 

The Mealy machines defined above allow a single observation to trigger a variable-length burst of output. While this generality is convenient for describing transductions, it is too permissive for modeling agents that must performs a single action at each step in real time.

In an online agent setting, we might want a stricter notion: at each timestep, the agent observes exactly one symbol and immediately emits exactly one action. This synchrony ensures that the agent’s behavior is length-preserving.

To capture this reactive, per-timestep interaction pattern, we now introduce the standard Mealy machine, obtained by restricting the output function to produce a single symbol per input symbol.

 

Definition A.2′ (Standard Mealy machine) A standard Mealy machine is a tuple  where all components are as in Definition A.2 except the output function is  (a single output symbol per input symbol). On input , the output is ), hence .

 

 

Think of this Mealy machine as a tiny reactive agent:

  • Observation space  is what the agent sees at each timestep.
  • Action space  is what it can emit; here actions can be length-2 strings (so actions may produce multiple primitive outputs in one step).
  • The internal state  is the agent’s memory (here trivial: a single state, so no memory).
  • The pair  is the agent’s policy + state-update rule. On every observation the agent immediately emits output determined by .

In this example the agent’s policy enforces a global constraint (even parity of 1’s) by a local rule (“whenever you see a 1, emit 11”). This shows two useful points for thinking about agents:

  1. Global invariants from local rules: A simple per-step policy can guarantee a nontrivial global property of the interaction history (here: even parity). 
  2. Expressivity vs constraints: The policy here is conservative it enforces parity but at the cost of reduced expressivity (not all acceptable-looking outputs are reachable). 

 

Definition A.3 (Moore Machine) 

A Moore machine is a deterministic finite-state transducer given by a tuple , where all components are as in the Mealy machine definition except that the output function is , depending only on the current state.

On input , the machine produces output:

where  for  (with  the initial state). Equivalently, some conventions output  as the first symbol, either convention yields equivalent expressiveness.

Mealy and Moore machines are equivalent in expressive power: every Mealy machine can be converted to an equivalent Moore machine with at most a linear increase in the number of states, and vice versa. 

 

Recall the Mealy machine from the previous example, which on input 0 outputs 0 and on input 1 outputs 11, thereby ensuring that the total number of 1’s in the output is always even.

We now construct a Moore machine that produces exactly the same input–output behavior.

Formally, define  as follows:

  • States
  • Input alphabet
  • Output alphabet
  • Initial state
  • Transition function (): 
  • Output function (): 
A Moore Machine which is equivalent to the Mealy Machine in previous example

Thus:

  • whenever the machine enters state , it outputs 0,
  • whenever it enters state , it outputs 11.

For example, on input , the state sequence is ,
and the output is .

As in the Mealy case, each input symbol 1 contributes exactly two 1’s to the output, and each 0 contributes none. Hence every output string has an even number of 1’s.

This Moore machine implements the same transduction as the Mealy machine in the previous example, but the output is associated with states rather than transitions. The extra state  is needed to remember which output should be produced after reading the current input symbol. This illustrates the general transformation from Mealy to Moore machines: outputs that depend on  in a Mealy machine are encoded into states whose labels determine the output in a Moore machine.

Moore machines label states with outputs rather than labeling transitions. The most general Moore transducer can emit a short burst of output when a state is entered, but that extra generality is unnecessary when modeling agents that act once per timestep. To model a reactive, step-by-step agent it’s clearer to assume each entered state produces exactly one action symbol. This synchronous restriction  gives the standard Moore machine.

Definition A.3′ (Standard Moore machine) A standard Moore machine is a tuple  where all components are as in Definition A.3 except the output function is . On input , the output is 

From an agent viewpoint, Moore machines correspond to agents whose actions depend only on their current internal state, not directly on the most recent observation. The observation updates the state via , and the action is then emitted as a function of that state alone. Compared to Mealy machines, Moore machines enforce a stronger separation between state update and action selection. This distinction mirrors a common design choice in AI systems: whether actions are allowed to depend directly on raw observations, or only on a processed internal representation. The equivalence between Mealy and Moore machines suggests that this is largely a representational choice rather than a fundamental difference in capability.

 

Remaining Proofs

Proof for non-finite memory of 

For each integer  let  (The prefix consisting of  ones), so . We show  and  are not -equivalent whenever .

WLOG . Let  (nonzero). Choose an integer  large enough that


(Existence: for fixed nonzero , the equation  implies ; this has only finitely many integer solutions, so for sufficiently large  the second condition holds.)

Set  and take continuation  (i.e.  more ones). Then

  • , a perfect square, so the output bit at the final step of  (and indeed one of the outputs in the run) is .
  • , which by choice of  is not a square, so the corresponding output bit in that position is .

Thus  and  differ, so . Because there are infinitely many , the relation  has infinitely many equivalence classes. Hence  does not have the finite-memory property.

Proof for finite memory of :

If two prefixes lie in the same class, then they are -equivalent. Take  in the same class, we show  

Case 1 (): If a prefix already contains 11, then every extension also contains 11. Hence for any continuation , every longer prefix of  (from the point where 11 first appeared onward) is not in the language, and the same holds for . Concretely, the per-prefix membership bits for  and  coincide at every position (they may differ on earlier positions before the first 11, but those are identical by assumption of same class, both have already passed the 11 event). Thus .


Case 2 (): Neither  nor  contains 11, and both end in 0. From that point onward, whether a future prefix  is in the language depends only on the continuation ) and the fact that the last seen symbol before  is 0. Since both  and  share this same “local summary” (no 11 seen, last symbol 0), for every continuation  the sequence of future per-prefix membership bits after position  depends only on  and not on the internal details of  or . Also the membership bits up to  and  are the same shape relative to their class. Hence .


Case 3 (): Similar argument: neither contains 11 and both end with 1. Future membership after appending  depends only on the fact “no 11 so far and last symbol = 1” and on . Therefore .
 

So, each class is contained in a single -equivalence class. We now show that for any two distinct classes there exists a continuation that makes the two resulting outputs differ; this proves they are different -equivalence classes.

Distinguish  Let  and . Consider the one-symbol continuation . Then:  contains 11 (since u ended with 1), so the output bit at that step is 0 (the prefix ending at that position is not in the language).
 does not contain 11 (since v ended with 0), so the output bit at that step is 1.
Thus  and  differ at the last position, so .


Distinguish : Let  and . Take continuation  (a block of  zeros) of any positive length. For every extension , since 11 already occurred inside , none of the prefixes at or after the position where 11 occurred will be in the language, in particular, the outputs at and after position  are zeros. But for  (where v has not seen 11 and ends with 0), all those extended prefixes continue to avoid 11 and so the outputs at these positions are ones. Thus  and  differ (for instance, at position , so .
 

Distinguish : Similar: take . For , outputs after  are zeros; for  and suitable  (even  suffices), some prefixes of  will still avoid 11 and so produce ones. Concretely,  keeps no 11 and so has output 1 at that position, while  has output 0. So 
Hence the three candidate classes are pairwise distinguishable, so they correspond to three distinct -classes. 



Discuss

Contra Alexander's Half-Defence of Bio Anchors

2026-02-15 21:08:46

Published on February 15, 2026 1:08 PM GMT

 Scott Alexander has a new piece out on Bio Anchors. For those who weren't keeping up at the time, it went like this:

  • Ajeya Cotra produced a long, deeply sourced report in 2020, which estimated AGI in ~2045, based on anchoring the development of AI to five different biological processes. Each of these anchors gave an estimation for the amount of compute required to produce a human-level intelligence. She then estimated how fast compute would grow.
  • Eliezer Yudkowsky responded with an enormous fictional dialogue where he debates various characters on AI timelines before finally getting to Ajeya Cotra's report, and estimates that AGI is will arrive much sooner than 2045.[1]
  • Scott Alexander reviewed this debate.
  • Eliezer turned out to be basically right on the timelines thing.[2] 
  • Scott Alexander did a retrospective.

EDIT: Addditional Discussion

There is also some discussion deep into a comment thread where Eliezer was discussing bio anchors, but the overall comment threads on that post are pretty messy. 

Daniel Kokotajlo has praised the model, and (allegedly) claimed that AI 2027 was based on it. His primary improvement to this was to throw out the Bio Anchors part, and used his own estimates of compute required for AGI. I'm not sure how to think of Ajeya's contribution overall, since estimating the compute required for AGI is the hard part of this, which Ajeya got seriously wrong.

Yudkowsky's Point On Methodology Stands

Let me first try to put words in both sides' mouths.

Cotra's point seems to be: We don't know what's going on, let's try and put together some prima facie good estimates of what might be going on, and then average them, like you would when doing a Fermi estimate.

To which Yudkowsky responded: No, you don't know what's going on, so your estimates of what might be going on will be totally unconnected to reality, and you can't get good numbers from things you don't understand.

I'd argue that the crux here is a question of which methods for generating arguments will produce arguments which bind to reality. Cotra believes that her methods will work even when she's uncertain about the world, Yudkowsky believes they won't. 

Fermi Estimation

The Fermi estimate thing makes sense to look at in more detail. Fermi estimation is the process by which you can usually get a decent (i.e. within 1 OOM) estimate of some quantity by just eyeballing/vibing out the intermediate variables. You can watch me fail to do one here:

Me Failing To Fermi Estimate Car Dealerships In The US

A classic example is estimating the number of car dealerships in the US. You guess how often you buy a car (every five years or so) how long that process takes you (two hours) how many salesmen a dealership employs (ten) how many hours a year they spend on sales (two thousand) and how many people are in the US (three hundred million).

Then you multiply and divide those numbers:  car per year per person   hours per car   hours per employee per year   employees per dealership   people =  car dealerships.

Real answer: 60,000.

Claude guesses that car dealerships only have 3-5 salespersons, people spend more time with them than I thought (4-6 hours total, but the salesperson may only be present for part of that) and <100% utilization of salespersons.

I was an OOM off. I may have had a better guess if I had ever bought a car or lived in the US!

We can think of Fermi estimates like this as a random walk in log-space. My equation had five elements in it (though one, the US population, could be easily looked up). If each one is off by around a factor of 2, then that's a step of  OOMs. With five times, we would expect the overall error to be  OOMs, or a factor of 4.6. This is pretty impressive for a guess which doesn't require you to leave the house!

Report-Based Epistemology

The Bio-Anchors timeline report reminds me of another personal bugbear, the Rethink Priorities Welfare Range report. Both of them have a vast amount of effort put into them, contain a lot of sources, and a huge number of arguments. Both of them are clearly written by thoughtful people working in good faith. I also think that both of their conclusions are basically nonsense.

Am I arguing against analysis in all cases? No, not at all! I think this kind of analysis is great for analysing e.g. which charity is better. You can solidly estimate the effectiveness of malaria nets, deworming, lead reduction, and direct giving using standard estimation techniques because none of the sub-questions involved require difficult out-of-distribution thinking.

If we look back at the analysis I did earlier, estimating a  OOM error for a five-step Fermie estimate, we can see that it only works when each of your intermediate values is off by a factor of two! If one of your intermediates is off by a factor of infinity, then your final estimate is also off by a factor of infinity.

Now, to be fair, Ajeya's report---like most of EA-style reports---uses a two-fold estimation strategy: it comes up with five disjunct anchors, estimates each separately, and then hedges uncertainty over the anchors. This strategy seems good at first glance, but I think it actually hides problems rather than solves them.

Averaging over different estimates can mean multiple things. It could be that you expect your different methods to give the same result, so by averaging, you're compensating for noise in estimates (like, estimating malaria net effectiveness based on population-level studies vs multiplying net usage by cases prevented, or something). In this case, it's great.

This isn't what Bio Anchors is doing, which is fine, because there are other good reasons to perform a superficially-similar estimate. If you're uncertain of a higher-up variable in your causal graph, you might want to consider different options and average them.

Unfortunately, this only works if the generator which produces these estimates is functioning correctly! If you're estimating e.g. which malaria drug the UN will roll out, as a higher-up node in estimating Malaria funding needs, then this is doable and you can expect your estimates here to be sane. If not, then, well...

The Nematode Thing

One of Ajeya Cotra's anchors is "All of the computing power used by all animals during the evolution of humanity." which might sound like a reasonable upper-bound.

This turns out to be over 90% nematodes. Scott thinks this is cool in how impressive the calculation is. I think this kind of thing is pretty damning for the project as a whole, and the impressiveness of the calculation actually makes it worse, not better.

To me, the obvious implication of this is "Oh right, obviously the amount of computation used by individual nematodes has no bearing on how humans evolved, so this must be meaningless." Instead, Ajeya left it in.

Scott also argues it doesn't matter, because "The highest and lowest anchors cancel out, so that the most plausible anchor - human brain with time horizon of hours to days - is around the average. If you remove all the other anchors and just keep that one, the model’s estimates barely change."

To give a ridiculous and hyperbolic comparison[3]: imagine if you were Prime Minister of some nation, and chief Economist estimated +2.5% growth in GDP next year, with a standard deviation of 1 percentage point.

Then you looked at the 10th and 90th percentiles of the estimate, and the 10th percentile was -100% because of the return of Christ, bringing the human race to a paradise free of toil, and the 90th percentile was +10,000,000% because of aliens arriving and sharing their advanced technology.

You would not be very satisfied if your Chancellor defended the Economist by saying "Well, these cancel out and the more reasonable estimations remain": you would have serious doubts about the remaining 80 percentage points of estimation!

You would probably fire your Economist (unless they had a really good justification for putting 10% probability each on first contact and the second coming, or something) and get a new report written.

I think Yudkowsky was correct to have extremely serious doubts about the remainder of the estimates. The process generating the arguments had become uncoupled from reality in an obvious way, which should have made us suspicious about the rest of it.

Contra Alexander on Compute and Paradigms

Alexander says:

Yudkowsky further explained that his timelines were shorter than Bio Anchors because people would be working hard to discover new paradigms, and if the current paradigm would only pay off in the 2050s, then probably they would discover one before then. You could think of this as a disjunction: timelines will be shorter than Cotra thinks, either because deep learning pays off quickly, or because a new paradigm gets invented in the interim. It turned out to be the first one. So although Yudkowsky’s new paradigm has yet to materialize, his disjunctive reasoning in favor of shorter-than-2050 timelines was basically on the mark.

I think this is an inaccurate way of thinking about things. AI progress has sped up due to the development of test-time compute scaling. GPT-5 is probably[4] less than our around 1 OOM bigger than GPT-4 in terms of training compute, while previous  scalings have been 2 OOMs.

This is exactly the kind of thing that Yudkowsky was talking about! It explains a huge amount of the speed-up. It doesn't really feel quite like a paradigm shift the way going from RNNs to transformers did, but it does change the scaling dynamics, the one thing that Ajeya was relying on.[5]

The Final Score

The proof of the pudding is in the eating. The proof of a forecast is in whether it was correct. Ajeya's was bad. Eliezer's was much better. Yes, forecasting some things is hard, and maybe there really are cases where we should say "Martine did a great job, but got really unlucky.". Scott even talks about this here. But Ajeya was wrong, and Eliezer was right, and that report probably made the EAs' policies worse. Perhaps that's all that should be said. 

Despite the negative effects of the report, I don't really begrudge Ajeya. The EAs in 2020 were EAs ascendant, EAs triumphant over Pestilence and rushing to the war against AI risk. I think of them and I think of a WWI cavalry regiment charging forth from the glories of the 19th century into the barbed wire and artillery fire of the 20th.

Perhaps the EAs will realize their efforts in these new battlefields are doomed and they need better epistemics, or perhaps they won't. I would bet on the latter.

Coda

I'll end with a warning about how one discusses one's interlocutors. Imagine two boxers. One of them says in the pre-match interview "My opponent is a lily-livered yellowbelly, who suckles his thumb and cries to his ma when he loses!". The fight happens, and that boxer goes down in the first round. Now he's the guy who lost to a lily-livered yellowbelly. Rough.

In the original Bio Anchors analysis, Scott says this:

Ajeya’s report is a 169-page collection of equations, graphs, and modeling assumptions. Yudkowsky’s rebuttal is a fictional dialogue between himself, younger versions of himself, famous AI scientists, and other bit players. At one point, a character called “Humbali” shows up begging Yudkowsky to be more humble, and Yudkowsky defeats him with devastating counterarguments.

Four years ago, this made Ajeya seem pretty sensible. Now... well, Ajeya is the person whose 169-page collection of equations, graphs, and modeling assumptions was outdone by a fictional dialogue where Yudkowsky epically owns a fictional character called "Humbali" who shows up and begs him to be more humble. Rough.

 

  1. ^

    It is over, AI optimist. I have already written a ten thousand word story where your views are loudly shouted by Moronicus and mine are explained calmly by Smartior.

  2. ^

    At least it seems that way at the moment.

  3. ^

    This comparison is ridiculous and hyperbolic. I am not saying that Ajeya's work is as insane as this imaginary economist.

  4. ^

    Claude argues:

    Claude GPT-5 Compute Estimate

    Total Training Estimate: ~5×10²⁵ FLOPs

    Epoch AI estimates GPT-5 used approximately 5×10²⁵ FLOPs total, including both pre-training and reinforcement learning—more than twice GPT-4's ~2×10²⁵ FLOPs, but less than GPT-4.5's >10²⁶ FLOPs Substack.

    This is fascinating because GPT-5 actually broke the scaling trend—every previous GPT model had used roughly 2 orders of magnitude more compute than its predecessor. Had GPT-5 followed that pattern, it would have needed ~10²⁷ FLOPs.

    Architecture: Trained From Scratch

    GPT-5 was trained from scratch as a natively multimodal model on multiple modalities simultaneously, rather than building on GPT-4o or other pre-existing models Wikipedia. So it's not using GPT-4o as a base—it's a fresh pre-training run, likely on a mid-sized model (~100B active parameters, possibly MoE with 1-2T total params based on pricing/speed profiles).

    The Compute Breakdown

    Pre-training: ~3×10²⁵ FLOPs

    • Trained on 30-40T+ tokens (possibly higher given OpenAI's investment in pre-training)
    • Standard transformer compute: C ≈ 6ND for forward pass (where N = parameters, D = tokens), ~2× for backward pass with gradient checkpointing gives C ≈ 8ND total

    RL Post-Training: 2×10²⁴ to 6×10²⁵ FLOPs (10-200% of pre-training)

    This is where it gets really interesting. In early 2025, RL compute was typically just 1-10% of pre-training, but this scaled rapidly—OpenAI increased RL by 10× from o1 to o3 Substack. The uncertainty about GPT-5 is whether they:

    1. Kept RL modest (~10-20% of pre-training) → ~3-6×10²⁴ FLOPs
    2. Matched pre-training compute (~100%) → ~3×10²⁵ FLOPs
    3. Exceeded pre-training (~200%) → ~6×10²⁵ FLOPs

    For o1-style reasoning models, RL can account for 40% or more of overall compute Interconnects. Reports suggest GPT-5's training "wasn't straightforward" with lots of experimentation, which points toward the middle-to-high end of that range.

     

  5. ^

    Previously, if you got  more compute, you'd train a  bigger model for  as many steps, and use  as much compute during deployment, because the model was just bigger. Now, roughly, you can train a  bigger model for  as many steps, and use  as many test-time tokens on a  bigger model (in theory, kinda) to scale your training and deployment compute by , instead of  and  as before.



Discuss