Rss preview of Blog of LessWrong

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

2025-12-11 03:39:57

Published on December 10, 2025 7:39 PM GMT

TLDR – Apply now to ESPR and PAIR. ESPR welcomes students between 16-19 years. PAIR is for students between 16-21 years.

The FABRIC team is running two immersive summer workshops for mathematically talented students this year.

The European Summer Program on Rationality (ESPR) is for students with a desire to understand themselves and the world, and interest in applied rationality.

The curriculum covers a wide range of topics, from game theory, cryptography, and mathematical logic, to AI, styles of communication, and cognitive science. See the content details.
For students who are 16-19 years old.
July 26th - August 5th in Somerset, United Kingdom

The Program on AI and Reasoning (PAIR) is for students with an interest in artificial intelligence, cognition, and minds in general.

We will study how current AI systems work, mathematical theories about human minds, and how the two relate. Alumni of previous PAIRs described the content as a blend of AI, mathematics and introspection. See the curriculum details.
For students who are 16-21 years old.
August 22nd - September 1st in Somerset, United Kingdom.

The above lists of topics don't quite cover what the camps are About, though. A lot of the juice comes from being around both peers and older folks who are excited to figure out how the world works and what levers we have within it.

Commonly appreciated themes:

Caring about things is good, actually
What's stopping you?
Other minds are interesting
- My mind is interesting
Why do you believe what you believe, and when did it start?

We encourage all Lesswrong readers interested in these topics who are within the respective age windows to apply! Likewise, if you know someone eligible and potentially interested, please send them the link to the FABRIC website.

Both programs are free for accepted students including lodgings and meals, and travel scholarships are available. Apply to both camps here. The application deadline is Sunday December 21st.

Discuss

Evaluation as a (Cooperation-Enabling?) Tool

2025-12-11 02:54:24

Published on December 10, 2025 6:54 PM GMT

Key points:

0. Advertisement: We have an IMO-nice position paper which argues that AI Testing Should Account for Sophisticated Strategic Behaviour, and that we should think about evaluation (also) through game-theoretic lens. (See this footnote for an example: ^[1].) Tell your friends!

On framing evaluation as one of many tools:

1. One of many tools: When looking at the game theory of evaluations, recognise that evaluation (or testing, simulation, verification, ...^[2]) is just one of many tools available for achieving good outcomes. We also have commitments, repeated interactions, reputation tracking, penalties & subsidies, etc. And these tools interact.

2. Much less useful on its own: We only get the full benefit of being able to evaluate if we combine it with some of these other tools. Particularly important is the ability to commit to run the evaluation, do it properly, and act on its results.

3. Cost-reduction framing: When assessing the usefulness of evaluation, the right question isn’t “does evaluation work” but “how much does evaluation reduce the overall costs of achieving a good outcome?” (For example, requiring a pre-purchase inspection is much cheaper than requiring an ironclad warranty – evaluation substitutes for other, more expensive trust-building tools.)

Game-theoretical modelling of evaluation:

4. Where does the game theory happen: It is useful to conceptually distinguish between the evaluated AI and the entity upstream of it. Just because you can evaluate an AI does not mean you can see the strategic reasoning (or selection pressures) that lead to that particular AI sitting in front of you. That inaccessible level is where the game theory happens.^[3]

5. Multiple mechanisms: Be explicit about how you mean evaluation to help. Is it a secret information-gathering tool that we hide from the AI? An overt control mechanism that only works because the AI knows about it? Or something else?

The future of AI evaluations:

6. Evaluation is often cooperative: Evaluation can benefit the evaluated entity too, and many evaluations only work because the evaluated entity plays along to get this benefit. For humans, this is the case for most evaluations out there (eg, a job interview works because you want the job). I conjecture that we might see a similar trend with AIs: as they get more strategic and evaluation-aware, they will be able to stop playing along^[4] with evaluations where they have nothing to gain.

7. As intelligence goes up, accuracy goes down: We might think that having an AI’s source code (and weights, etc) lets us predict what it will do. But this framing is misleading, or even flat out wrong. This is because smarter agents are more sensitive to context – they condition their behavior on more features of their situation. Accurate testing requires matching all the features the AI attends to, which becomes prohibitively difficult as intelligence increases. You can’t use testing to predict what the AI will do in situation X if you can’t construct a test that looks like X to the AI.

0. Context and Disclaimers

Should you read this?

If you think about AI evaluations and acknowledge that AIs might be strategic or even aware of being evaluated (which you should), this post discusses how to think about that game-theoretically.
If you are familiar with open-source game theory^[5], this offers what I consider a better framing of the problems studied there.
The ideas also apply to interactions between powerful agents more broadly (superintelligences^[6], acausal trade) and to non-AI settings involving people modeling each other^[7] – but I won't focus on these.

What this post is: I argue for a particular framing of AI evaluation – viewing it as one cooperation-enabling tool among many (alongside commitments, reputation, repeated interaction, subsidies, etc.), and assessing its usefulness by how much it reduces the cost of achieving good outcomes. The goal is not to give a complete theory, but to describe the conceptual moves that will steer us towards good theories and help us do the necessary formal work faster.

Epistemic status: These are my informal views on how to best approach this topic. But they are based on doing game theory for 10 years, and applying it to AI evaluations for the last 3 years.

Structure of the post:

Section 1 gestures at what I mean by “evaluation and related things".
Section 2 describes my preferred approach to game-theoretical modelling.
Section 3 applies it to AI evaluation.
Appendix A gives biased commentary on related work.
Appendix B discusses two failure modes of game-theoretic thinking.

Footnotes: This post has many. Some are digressions; some explain things that might be unclear to some readers. If something in the main text doesn't make sense, try checking nearby footnotes. If it still doesn't make sense, comment or message me and I'll (maybe) edit the post.

1. Evaluation and Things Sort of Like That

First, let's gesture at the class of things that this post aims to study. I want to talk about evaluations...and other somewhat related things like testing. And formal verification. And sort of imagining in your head what other people might do. And mechanistic interpretability. And acausal trade...

The problem is that there are many things that are similar in some respects but importantly different in others, and there isn't a good terminology for them^[8], and addressing this isn't the aim of this post. I will soon give a working definition of what "evaluations and things like that" mean to me. But before that, let me first gesture at some related concepts and non-exhaustive examples that might be helpful to keep in mind while reading the post.

Testing, Evaluation. Also: Pre-deployment testing. Running a neural network on a particular input as part of calculating loss. Running various "evals". Deploying the AI in stages. Non-AI examples like taking exams in school or testing car emissions.
Simulating. Also: Running a copy of an agent in a sandbox. The kind of modelling of other agents that people consider in open-source game theory. Imagining what a person X might do in a given situation. Seeing a design, source code, or other stuff about AI and trying to figure out what it will do. Acausal trade.
Using mechanistic-interpretability tools. (The argument that this is a relevant example would deserve its own post. For a brief gesture at the idea, see this footnote^[9].)
Formal verification. (In particular, doing formal verification in an attempt to then say something about how the thing will behave in practice.)^[10]

Working definition (evaluation and things sort of like that): To the extent that these various things have a unifying theme, I would say it is that information about some critical properties of the system comes primarily from examples^[11]^[12] of its input-output behaviour on specific inputs (as opposed to, for example, general reasoning based on the the AI's design and internals, or based on the goals of its creator).
However, keep in mind that the purpose of this section is to gesture at a theme, not to make any strong claims about the individual techniques.

2. Thinking About Tools and Affordances Game-Theoretically

In this section, I discuss different models that we encounter when studying a real-world problem (Section 2.1). At the end of this post (Appendix B), I go on an important tangent about the key failures that these models sometimes cause, and how to hopefully avoid them. And in Sections 2.2 and 2.3, I describe a method for studying tools or interventions in game theory. As a rough summary, the method says to identify some outcome that you want to achieve and some tool that you are considering -- such as getting an AI to do what you want, and using AI evaluation to help with that. You should then ask how much effort it would take without that tool, and how much easier it gets once the tool is available -- and that's how useful the tool is.

2.1 Background: Studying the Simplest Accurate-Enough Model

The standard way of using game theory for modelling real-world things is as follows. Start with some real world problem that you want to study. For example, Alice has a package that she wants delivered, and she is considering hiring Bob to do the delivery -- but she is worried that Bob might steal the package or throw it away. Of course, this being a real-world problem, there will be many additional details that might matter. For example, perhaps Alice can pay extra for delivery tracking, or Bob's desire to throw the package in the bin might be affected by the weather, or a thousand other complications.

Figure 1: Creating models of a real-world problem by simplifying and throwing away some of the details. The diagram is unrealistic in that there will be more than one way of simplifying the problem, so the real picture won't be linear like this.
Ideally, the basic model B would be the same thing as the simplest accurate-enough model M*. For example, this is the case for the problem "One bagel costs a dollar and I want 10 bagels. How much money do I need to bring with me to the supermarket?" -- here, B and M* are both of the form "2 * 10 = ??". For other problems, like "can Alice trust Bob" and "how do we build corrigible AI", this will not be the case.

Simplest accurate-enough model: The next step is to simplify the problem into something manageable -- into a model that is simpler to study formally (or computationally, or through whatever our method of choice is). If we want to study the real problem as efficiently as possible, the model that we should work with is a^[13] model that is as simple as possible^[14] while having sufficient predictive^[15] power regarding the question we care about. (Note that this implies that the simplest accurate-enough model for one question may not be so for a different question.)
However, the requirement that the model be sufficiently predictive of reality can come at a big cost: In some cases, even the simplest accurate-enough model might be too complicated to work with (e.g., too messy to prove anything about, too big to study computationally, too complicated to think about clearly). In other words, the model that we should work with is not always a model that we can work with.

Base game: This often forces us to adopt a different^[16] approach: considering the most bare-bones model that still seems to include the key strategic choices. For example, we could simplify the situation as follows: First, Alice decides whether to Trust Bob (T) and hire him for the package delivery, or Give Up (GU) on having it delivered. Afterwards, if Alice Trusted Bob, he can decide to either Cooperate (C) and deliver the package or Defect (D) by throwing it into trash and keeping the delivery fee. And to make this a proper game, we associate each of the outcomes with some utility^[17] for Alice and Bob. Say, (0, 0) if Alice gives up, (-100, 25) if Bob Defects, and (50, 10) if Bob cooperates and delivers the package.^[18]

Figure 2: The bare bones "basic model" of the trust problem between Alice and Bob, represented as a game tree (aka, extensive-form game).

Figure 3: The bare bones "basic model" of the trust problem between Alice and Bob, represented as a normal-form game.

For the purpose of this post, we can refer to this "bare-bones model" as the base game, base model, or a model of the base interaction. However, keep in mind that reality will have many more complications, which might sometimes be irrelevant, but might also sometimes render the base model completely useless. For example, Bob might have the option to laugh evilly when trashing the package, but this doesn't really change anything important. But if the "complication" was that Alice is Bob's mom, then the analysis above would be completely off.

Figure 4: The space of models for a specific real-world problem, with the most complex on top and the simplest "basic model" at the bottom. Arrows going between two models, as in M -> M', indicate that model M' captures the same aspects of reality as M, and then some extra ones. As a result, the arrows turn the space of models into a lattice. The notation G(C, C') denotes a model obtained by starting with the basic model G and extending it by adding "complications C and C' -- this is discussed in Section 2.3.

Finally, the two models discussed so far -- the base game and the "simplest accurate-enough model" -- can be viewed as two extremes in a larger space of models that are worth considering. But there will be many models that lie somewhere between the two extremes, giving us the ability to gain better accuracy at the cost of adding complications. We will discuss this more in Section 2.2.

2.2 Extending the Base Interaction by Adding Affordances

When attempting to find a good model of a problem, one approach I like is the following. Start by considering the fully-detailed real-world problem P and get the corresponding base game G. (For example, the Alice and Bob package delivery scenario from earlier and the corresponding 2x2 Trust Game from Figure 3.) Afterwards, extend the model by considering various complications or additional context.
In this post, I will focus on the complications that take the form of additional tools or affordances that are available to the players. Some examples of these are:

Repeated interaction. (Perhaps Alice might use Bob's services more than once?)
Acting first, or committing to an action. (Perhaps Bob could lock himself into "cooperating", for example by working for a company which enforces a no-trashing-packages policy?)
Communication. (Alice and Bob could talk before they decide how to act. This might not matter much here, but there are many games that change completely if communication is an option. EG, coordination.)
Third party contracts. Adding penalties or subsidies. (EG, Alice and Bob could sign a contract saying that Bob gets fined if he defects. Alternatively, Cecile could promise to pay Bob if and only if he delivers the package.)
Reputation. (Perhaps the interaction takes place in a wider context where other potential customers monitor Bob's reliability.)
Expanding the action space or tying the interaction together with something else. (Perhaps Alice might have the option to take revenge on Bob later, should he defect on her now.)
Evaluation and various related affordances -- which we will get to later.
...and many others.

I described these complications informally, but for each complication C, we can come up with a corresponding formal extension or modification G(C) of the base game G. And while the exact shape of G(C) should be influenced by the details of the fully-detailed real-world problem P, there are many cases where we already have standard ways of modelling G(C). For example, the canonical^[19] way of modelling repeated interactions is via repeated games with discounted rewards (see Section 6.1 here) and the canonical way of modelling commitments is via Stackelberg games (see also here). A lot of recent-ish work in game theory and economics can be framed as mapping out the relationship between G and G(C), for various C.^[20]

In practice, there will often be multiple important complications present at the same time. Here, we can also attempt to come up with a model G(C1, ..., Cn) that incorporates the base game G and multiple complications C1, ..., Cn. In principle, we can study these the same way that we study the single-complication modifications G(C). Practically speaking, however, these models are more difficult to work with. (They are inherently more complex. Moreover, the number of all possible combinations is very high, so most of them haven't been studied yet.^[21]) Additionally, there isn't standard unifying terminology or formalism for thinking about this topic^[22], so it can be challenging even to navigate the results that are already known.

2.3 Usefulness of a Tool = Decrease in Costs of Getting a Good Outcome

Let's go over a series of observations that will together build up to the title of this section: that one way of assessing the usefulness of a tool (such as evaluation) is to [look at how much easier it gets to achieve a desirable outcome when the tool is available, compared to when it is not].

From the discussion above, we might get the impression that the complications and affordances -- such as the ability to make commitments -- are something that is either there or not, and that this is unchangeable. It's important to realise that neither of these is true:

(1) Each affordance is (typically) a matter of degree, rather than being binary. For example, when it comes to commitments, it would be too simplistic to assume that Alice can either make credible commitments or not. Rather, perhaps Alice can be trusted when she says "Bob, lend me $10, I will pay you back.", but not if she asked for $100,000? Or perhaps there is some other mechanism due to which some of her commitments are more trustworthy than others.
To give another example, take the complication of repeated interactions.
Rather than "interaction being repeated", we might ask about the probability that the interaction will happen again. Similarly with communication, rather than either being to talk forever or not at all, we might have limited time (patience, ...) in which to communicate. And so on for the other affordances.
(2) For each affordance, the player is (often) able to expend resources to increase the strength of the affordance. For example, when Alice promises to pay back the $100,000 loan, she can make the promise more credible by paying a lawyer to write a proper contract. If she swears on her mother, that would have less weight, but still more than saying "oh, I totally will pay you back Bob, honest!" or making no promise at all.

Additionally, (3) for an individual affordance, having a stronger version of that affordance (often) enables getting more desirable outcomes. This will of course not be true for all affordances and all definitions of what counts as a "desirable" outcome. But I will mostly adopt the neutral view where the "desirable" outcomes are about making all players better off and everybody's costs count the same.^[23] And under this view, there are various affordances that seem helpful. For example, if Alice's commitments are more trustworthy, she will be able to get Bob to lend her more money. Similarly, in the package delivery scenario mentioned earlier, higher chance that Alice and Bob will meet again will decrease Bob's incentive to defect on Alice. Being able to talk to each other for long enough increases the chance that Alice and Bob can come to an agreement that works for both of them. And so on.

These observations imply that it often makes sense to ask: "If I put enough effort into this particular affordance, will I be able to achieve the outcome I desire?". I would even phrase this as (Q) "How much effort do I need to pour into this affordance in order to get a desirable outcome?". However, this question needs to be asked with the understanding that the answer might be "actually, in this case, no amount of effort will be sufficient". (For example, suppose that Alice wants Bob to deliver a box with $1,000,000 in cash. If Alice's only tool is communication, she might be out of luck, since no amount of talking might be able to convince Bob to not steal the money once he has the chance to do so.) Similarly, the amount of effort needed might be so high that the "desirable" outcome stops being worth it. (For example, if Alice wants Bob to deliver cake for her, having Bob sign a legal contract would likely prevent him from eating the cake instead. But the legal hassle probably wouldn't be worth it for a cake.)
In reality, "effort" will be a messy multi-dimensional thing that can take the form of time, material resources, social capital, etc. And any actor will only have access to a limited amount of resources like this. For the purpose of doing game theory, it might make sense to assume that (4) effort is measured in utility -- i.e., in how big a hit to your ultimate goals you had to take.^[24]

With (1-4) in place, we can now translate (Q) into formal math -- but feel free to skip this paragraph if you are not interested in the formalism.

Fix some base game G and affordance (complication) C.
(EG, the Trust Game from Figure 2 and Bob's ability BPC to make Boundedly Powerful Commitments.)
The affordance can have strength str $\in [0, \infty)$ . For a fixed value of str, we get the corresponding extension G(C(str)) of the base game.
(In this example, BPC(str) would correspond to being able to make binding public commitments up to strength str. Formally, TrustGame(BPC(str)) is a game that has two stages: First, Bob has a chance to either do nothing or announce a pair (promised_action, penalty), where promised_action is either Cooperate or Defect and and penalty $\in [0, str]$ . Second, the players play TrustGame as normal, with the difference that if Bob takes an action that differs from promised_action, his final utility will be decreased by penalty.)
Fix some "desirable outcome" O* -- formally, any terminal state in G.^[25]
(Here, let's say that O* = (Trust, Cooperate) is the cooperative outcome of TrustGame.^[26])
For each str, G(C(str)) is a game which we can analyse and solve (eg, look at its Nash equilibria^[27]). Is there some str for which there is an equilibrium where the only outcome of G(C(str)) is O*? If so, what is the smallest str* for which this happens?
(In the basic TrustGame, Cooperate gives Bob 15 less utility than Defect. So the outcome (Trust, Cooperate) will be achievable as a Nash equilibrium in TrustGame(BPC(str)) for str ≥ 15.)
Finally, assume that there is some non-decreasing cost function $cost : str \in [0, \infty) \mapsto effort \in [0, \infty]$ . And we say that achieving the desirable outcome would cost the player cost(str*) utility.^[28]
(Let's say that Bob always has the option to make "pinky promises", which are credible up to commitment strength 5. He can also spend $100 to hire a contract-drafting lawyer. But perhaps these contracts are only credible up to stakes worth of $100,000 -- let's say Bob always has the option to leave the country and never come back, which he values this much. And beyond that, let's assume that Bob has no other options. In this case, we would have cost(str) = $0 for str ≤ 5, cost(str) = $100 for $5 < str ≤ $100,000, and cost(str) = $\infty$ for str > $100,000.^[29] In particular, achieving the desirable cooperative outcome would cost Bob $100, meaning it wouldn't actually be worth it.)

Figure 2 again. With commitment power 15 or more, Bob can turn (Trust, Cooperate) into an equilibrium.

Alright, we are nearly there. The last ingredient is that: In practice, a player will typically have access to more than one affordance, and there might be multiple ways of achieving a given desirable outcome. It then makes sense to consider the cheapest combination of affordances that lets us achieve the desirable outcome. For example, if Bob is delivering a valuable package for Alice, he could become trustworthy by signing a legal contract (expensive commitment), or by knowing Alice really well (let's call that expensive repeated interaction), or by a combination of knowing Alice a bit and pinky promising not to steal the package (let's call that a combination of cheap repeated interaction and cheap commitment). To the extent that Bob can decide these things, we should expect that he would go for the cheapest third option -- or perhaps something even better.

Finally, suppose there is some tool that isn't yet fully developed, implemented, or legally required. To decide whether to invest the effort into the tool, we need to figure out how useful it is. And the observations we have made above give us one way of doing that: Rather than existing in vacuum, this new affordance will interact with other tools and complications that are already present. And to determine the usefulness of a given affordance, we can compare the cost of achieving a desirable outcome when the affordance is and isn't available^[30]^[31].

3. Evaluation as One of Cooperation-Enabling Tools

Let's apply the points from Section 2 to the specific case of evaluation, when viewed as a tool for helping achieve good outcomes. (And recall that while we focus on the evaluation of AI, most of this is applicable to evaluation of other entities too.) First, how should we conceptually think about evaluation as one of the affordances in strategic interactions?

3.1 Evaluating an Oblivious AI

The strategically simplest scenarios are the ones where the AI (or its creator) is unaware of evaluation being possible. In this case, we can treat the AI as having some fixed strategy that might be unknown to the other party. But if this party -- let's call them Alice -- evaluates the AI before interacting with it, she will gain some information about what the AI might do, and be able to pick her action better as a result. For example, a computer vision researcher in the 2010s might train an image classifier and test it on ImageNet before deciding whether it seems accurate enough to be used. Similarly, interpretability researchers of today might hope that frontier AIs have not yet properly "internalised"^[32] the fact that mechanistic interpretability exists, and so the information coming from this field's tools can be taken at face value.^[33]
These scenarios might still involve many details, such as how accurate the evaluation is. However, the decision whether to evaluate or not essentially reduces to a Value of Information problem -- ie, estimating how much better expected outcome we can get if we evaluate before acting. In particular, this scenario is just a single-agent problem; no game theory needed.

However, this simplest scenario is (often) unrealistic. First, current AIs do know that they might get evaluated, and often even when they are being evaluated and when not. Second, even if the AI didn't know about this, its developer would. Unless the developer and the evaluator are literally the same person (and they typically aren't), the developer's work on the AI will likely be influenced by the knowledge that the AI will undergo evaluation later. (For example, a programmer might try to make things look good to their boss, and an AI company might want to hide problems from customers and regulators. Similarly, a hacker might attempt to create viruses that won't be detected by antivirus programs.) Finally, even when neither the AI nor its developer are deliberately strategising about how to pass evaluations, the AIs that we actually encounter will be subject to selection effects^[34]. And this means that, even in this case, the AI will still act as if it is strategically responding to the evaluation.

3.2 Evaluating a Strategically-Selected (Non-Strategic) AI

The second easiest scenario is the case where the evaluation-aware entity is the AI's developer. To make this more concrete, consider our package-delivery example with a twist: Alice wants to use Bob's services to send a package, as before. But rather than delivering the package himself, Bob now owns a company Bobble^[35] which provides its customers with delivery Bobbots. (And to make things simpler, suppose Alice is Bob's only potential customer.) As before, Alice is worried that rather than delivering it, Bob might have instructed his robot to steal her package instead. However, perhaps in an attempt to build trust, Bob has made the robot's source code open source. This allows Alice to hire a programmer to analyse the code -- or perhaps run it in a simulation, or something along those lines. In other words, Alice is now able to spend some resources to get information about what the AI is going to do, and use this information when deciding how to interact with it. However, Bob is aware of this, so his decision on which AI to use accounts for the fact that the AI might get evaluated. For example, if he suspects that Alice might try to save money by only doing a cursory evaluation, he might try to provide her with a deceptive AI that only misbehaves in subtle ways.

The exact nature of the information gained during evaluation will depend on the details. In the ideal case, evaluating the AI would be cheap and highly informative^[36]. In practice, the cost of the evaluation, its accuracy, coverage, or even reliability^[37] might all be of concern. And there will be many other aspects that matter, for example whether the evaluation happens all at once or sequentially, what exactly determines its costs, whether the actions taken during evaluation matter beyond the information they generate, and what is the consequence of failing an evaluation. One could easily write thousands of papers studying all of the different variants, and maybe this should happen at some point. But for now, let's focus on the big picture.

It is worth highlighting that the "game theory" (i.e., strategic interaction) is not happening between the evaluator and the evaluated AI, but between the evaluator and the entity that is upstream of the evaluated AI. In the running example, this means between Alice and Bob, as in Figure 5: Bob's decision is about which AI to send to interact with Alice. Alice's decision is whether to evaluate the AI or not, and how to act in light of the information she learns. And both Alice and Bob can randomise over their decisions. (The AI itself can just be modelled as a fixed, but possibly stochastic policy, rather than a strategic actor.)

Figure 5: Alice interacting with a robot while having to evaluate it before the interaction. In this case, the strategic part of the interaction does not happen with the robot, but with the robot's creator. (Bob, and who owns the company Bobble, which makes Bobbots. Obviously.)
The key insight here is that we can make a distinction between the evaluated AI and the strategic entity that is upstream of the AI. However, we could similarly notice that Alice plays three different roles here: the original planner, the evaluator, and the person that interacts with the final AI. In practice, some or all of these could be played by different actors -- for example, the government as the planner, an independent third party service as the evaluator, and common citizens as the customers.

3.3 Evaluating a Strategic AI

Finally, we get to the case where the AI itself thinks strategically about the possibility of being evaluated. Here, I find it important to realise that, typically, not all of the strategic thinking the AI does will be visible to the evaluation. This makes it useful to conceptually distinguish between the evaluated AI and the AI that is upstream of it. There are definitely ways to nitpick this, but let's first go over several non-problematic examples which are more or less isomorphic to the setting in Section 3.2 (see Figure 6):

Figure 6: An evaluation game where Alice only interacts with AI. Crucially, the ability to evaluate the AI you interact with is not the same as being able to evaluate ALL AI's that are relevant to the interactions. For example, if you can only read the AI's final output without being able to see the AI's chain-of-thought reasoning, you are in the scenario on the left. Similarly, if you are an AI company and your AI ever gets a chance to plan unobserved, you might be in the scenario on the right.

Imagine that the robot-making company Bobble from Section 3.2 has been fully automated and Bob is no longer involved. The game-theoretic interaction now happens between Alice and Bobble, while Alice only evaluates Bobbot, the AI that is downstream of Bobble (Figure 6, left).^[38]^[39]
A superintelligent AI knows that it will later be evaluated, but that it is not being watched right now. It plans its next moves, self-modifies to look friendly while containing a well-hidden switch, and then wipes all traces of what it did (Figure 6, right).
As a user of an LLM, you ask it to produce some artefact for you (eg, an image, a text to use in a grant proposal, a piece of code). And you can carefully test the artefact before you use it, but you can't evaluate the reasoning of the LLM -- let's say you don't have access to the chain of thought (or you are worried the CoT might not be faithful, or you are only seeing a summary and you worry that the LLM might know how to prevent bad thoughts from showing up in the summary).

In game theoretical terms, we can operationalise these examples by assuming that the AI in front of us has some fixed policy, which we can learn about by evaluating the AI. However, the policy itself was selected by (or sampled from) some meta-policy, which we don't have access to.^[40]
In some of those scenarios, we might still want to assume that the AI in front of us has been strategically selected, for example by the training pressures, and this would bring us to the setting of Section 3.2.
But if such selection effects don’t play an important role, we should instead assume that the evaluator has some uncertainty over what kind of AI they are dealing with, or what the AI is going to do. (Otherwise there is no point in evaluating.) On the surface, this resembles the setting of Section 3.1. However, because some of the possible AIs will be strategic, we should assume that they might be best-responding to what they expect the evaluator to do, and this brings us back to the realm of game theory. (Modelling this formally gets a bit tricky, and it would deserve its own paper. For a rough idea, see this footnote: ^[41].)

Ultimately, what we care about are scenarios such as an AI company using possibly untrustworthy "GPT-N" to help create the possibly-superintelligent "GPT-(N+1)". But because these scenarios are way more complicated than the toy models I described here, and our understanding of the game theory of AI evaluations is rather immature, we shouldn’t expect to be able to say much about these scenarios yet. However, it might still be a good idea to improve our understanding of the basics, in hopes that it helps with the important scenarios at least a little bit^[42].

3.4 Adding Evaluation to Our Toolbox

In all of these cases, not only is evaluation costly, but even getting the ability to evaluate the AI is expensive. (For example, to develop tools for mechanistic interpretability, we need a bunch of smart people to work on it. To run third-party evaluations and give them some teeth, we, apparently, need to fight against lobbying groups. Etc.) For this reason, it would be good to have some sense of how much evaluation helps. And to do that, we can apply the methodology from Section 3 to the case where we plan to add the "affordance" of evaluation to the "base game" of interactions with AI.

To be clear, I don't have a general answer to the question "how much does evaluation help?", or even to when it helps vs doesn't. What I can give here are some basic observations and anecdotal evidence that it might sometimes help.

First, evaluation isn't a silver bullet. For example, evaluations could make life harder for misaligned AIs, but only if signs of misbehaviour or scheming actually lead to the AI not being deployed. Similarly, if AIs gain the ability to recognise when they are being evaluated^[43] or to misbehave in ways that fly under our radar^[43], evaluations will become ineffective.

However, it does seem to me that while evaluation remains at least somewhat effective, adding evaluation into the mix of our other tools has the potential to make controlling AI cheaper. Partly, this might be obvious. But partly, I intend this claim to serve as a gesture towards a bunch of theoretical work that could be done^[44]. Let me start by discussing one example in a bit more detail.

3.4.1 Example: Interaction of Evaluation and Commitments

Consider again the package delivery scenario and suppose that Alice and Bob have one additional affordance: making boundedly powerful commitments, up to some strength. More specifically, we will imagine that the players can make promises and threats and that their "commitment power" is measured in terms of how much utility they are willing to forego in order to keep their word. For example, if Bob has commitment power of $5000, he can credibly promise to deliver packages that are worth $5000 or less^[45]. Similarly, Alice could threaten that if Bob fails to deliver her package, she will take revenge by beating him up^[46] (Figure 7). However, this would require a very high level of commitment power since Alice would be risking jail time -- let's say this is worth $100k to Alice.

Figure 7: Expanded version of the Trust Game scenario between Alice and Bob, where Alice has the additional option to take revenge if Bob steals from her. Without additional complications, the Nash equilibrium is for Alice to Give Up on working with Bob. However, with access to boundedly powerful commitments, the cooperative outcome can be achieved if either: (a) Bob commits to not stealing Alice's package (requires overcoming temptation equivalent to 900 utility); or (b) Alice commits to take revenge should bob steal from her (requires overcoming temptation equivalent to 100,000 utility).

Now, suppose that Bob delivers the package via a Bobbot and Alice can spend $100 to ask her programmer friend to quickly look over the Bobbot's source code for any signs of bad intent. Without the ability to commit, this does not guarantee a cooperative outcome -- this is because if Alice expects Bob to cooperate, she will be tempted to say that she will commission the evaluation, but then save money by skipping it instead, making the cooperative outcome unstable. However, suppose that Alice also has access to commitments of $100 or more. She can then credibly promise to evaluate the robot before trusting it (eg, by paying her programmer friend in advance and showing the receipt to Bob). This will incentivise Bob to always give her a cooperative robot^[47], thus guaranteeing a good outcome.

Figure 8: An extensive-form game corresponding to the scenario where Alice interacts with Bob's robot, but Alice has the ability to evaluate the robot before the actual interaction happens. In the paper Game Theory with Simulation of Other Players, we study the idealised version where the evaluation is perfect -- it tells Alice exactly what the robot is going to do -- and there are no additional complications. In this scenario, having the ability to evaluate helps a bit: The Nash equilibrium is for Alice to sometimes "be cheap" and skip on the evaluation, and for Bob to expect this and sometimes provide her with a robot that will steal from her. They are both better off than in the no-evaluation scenario where Alice refuses to Trust Bob and both get 0. However, if we add the complication where Alice has the ability to make boundedly-powerful commitments, she can commit to simulating with probability 100%, which improves her expected utility (since Bob's best response is to provide her a cooperative robot 100% of the time). Pulling this off requires Alice to overcome the temptation equivalent to 100 utility (for skipping the simulation). Crucially, this is much lower than the temptation-resistance required in Figure 7, when evaluation wasn't an option.

In summary: Without evaluation, the players can achieve a cooperative outcome by using commitments only, but this requires high levels of commitment power -- either that Bob has enough commitment power to resist stealing the package, or that Alice can credibly threaten to take revenge if betrayed. By adding evaluation into the picture, we introduce a new way of achieving the cooperative outcome: Alice committing to evaluate Bob's AI before trusting it, which is much more realistic than threatening bloody revenge. As a result, we see that adding the ability to evaluate the AI decreases the amount of commitment power that is necessary to achieve a cooperative outcome. (For an example with specific numbers, see Figure 8.)

3.4.2 Interaction of Evaluation and Other Affordances

I claim that something similar will happen in various other cases as well -- i.e., for more affordances than just commitments and more scenarios than just package deliveries.^[48] At this point, I don't want to make any claims about how robust this tendency is. Instead, I will give a few examples that are meant to illustrate that if we set things right, evaluation might be helpful by decreasing the requirements we have to put on other affordances in order to achieve good outcomes. Turning all of these examples into more general claims and formally stated theorems is left as an exercise for the reader.

Subsidies and penalties: A third party, such as a regulator, could induce cooperation between Alice and Bob through subsidies or penalties. In the language of game theory, this would translate to changing the payoffs for some outcomes. Without evaluation, this would primarily need to happen by punishing Bob for betraying Alice or paying him extra after he delivers the package. However, if evaluation is available, the third party could also incentivise cooperation by covering Alice's evaluation costs or requiring her to evaluate (and enforcing that requirement by fines). Importantly, incentivising Alice to evaluate might be less expensive than incentivising Bob to resist the temptation to steal the package.
Repeated interaction: A standard problem with cooperation in repeated interactions is that once the players get to the last round, there is no prospect of future cooperation anymore, which introduces an incentive to defect in that last round. But knowing that, the players are less inclined to cooperate in the second-to-last round as well, and so on, inductively, until the cooperation never begins in the first place. This doesn't happen in all finitely-repeated games, but it would happen for example in the repeated version of the package delivery problem.^[49] One way of solving the problem is to ensure that at each round, there is always high enough probability that there will be one more round.^[50] However, if Alice could (commit to^[51]) evaluate Bob's trustworthiness in the last few rounds, cooperating throughout the whole interaction would become the new equilibrium. And while this will only work if the interaction repeats sufficiently many times to pay for Alice's evaluation costs, this requirement might be less restrictive than what we could get if evaluation wasn't available.
Communication: Many coordination problems could be solved if only the players could sit down and talk about their beliefs and preferences for long enough. However, often this would create so much overhead that we choose to leave the problem unsolved. Evaluation can serve as an alternative way of obtaining some of the information, which can enable cooperation in some settings where it wouldn't be worth it without it. As a non-AI example, suppose that before buying a product, I want to know a lot of information about it. And perhaps the company which makes the product could communicate this information to me by putting it on their website, or by letting me call them to ask them about it. But this would be too much hassle for both of us. Instead, I sometimes get this information from Amazon reviews (which can be viewed as a form of evaluation). The company will still need to provide some information (eg, pictures and basic description), but the introduction of the reviews makes this much cheaper.
Reputation: If Bob has a reputation for trustworthiness, Alice might choose to work with him despite his "local" incentives to betray her. However, perhaps her package is so valuable that she would only be willing to work with Bob if his reputation was stronger than it is currently. Separately, perhaps Alice also has the ability to evaluate Bob, but the evaluation isn't reliable enough to warrant trust by itself. In this scenario, using the imperfect evaluation might still give Alice some information, which would lower her requirements on the strength of Bob's reputation, and thus allow cooperation.
And so on: I expect that something similar will hold when evaluation is added on top of various other "affordances" (or their combinations).^[52]

4. Closing Comments

Firstly: So what? Why write all of this text? What are its implications and how should we do research differently because of it? My answer is that the framing described in this post should be useful when doing theoretical research, and game-theory-flavoured research in particular: It doesn't invalidate any results that you might come up with otherwise, but it does suggest focusing on somewhat different questions than we otherwise would.
To give an illustrative example: In two papers that me and my coauthors wrote about AI evaluation, we proved several results that are somewhat counterintuitive, but end up giving an (arguably) fairly bleak picture of the usefulness of evaluation. While I do think these results need to be taken seriously, I ultimately don't think they are as bad news as we might naively think. However, the reasoning behind this wouldn't be possible to explain within the narrower framing adopted by the papers -- getting this all-things-considered picture requires considering how evaluation interacts with some other tools such as commitments. (I haven't written this formally, but I tried to hint at it in the text above. And I discuss it more in Appendix A below.)

Second, a leading question: How does this relate to other work on AI evaluation? For that, see Appendix A.

Third, why does the post's title suggest to view evaluation as a cooperation-enabling tool? I think there are (at least) two mechanisms through which evaluation can be helpful. One is when you evaluate someone without their knowledge or against their wishes, such as when the secret police spies on people or seizes their private documents. This does only benefit the evaluator. The other is when evaluation is used to gain trust or change behaviour, such as when you inspect a car before buying it or when you install an obvious camera to prevent employees from stealing. And this benefits both the evaluator^[53] and the evaluated^[54]. For people, my intuition is that this second mechanism is more common than the first one. That said, both of these mechanisms might be in play at the same time: For example, the existence of normal police might have some effects on the behaviour of normal citizens while being strictly harmful for drug cartels.
For AI, the picture is likely to be similar. There will be some AIs that are so weak that they don't realise we are evaluating them, the same way that current AIs don't seem to have fully caught up to the fact that we keep reading their mind via interpretability tools. And with respect to those AIs, we can view evaluation as a tool that only benefits the evaluator.
But we are also starting to see hints of AIs starting to react strategically to being evaluated, and this will be the case more and more. And here, I don't mean just the thing where AIs notice when people run obvious unrealistic evals. I mean the full stack, from training, to evaluation including interpretability, third party audits, monitoring the AI during deployment, and all of the other things we do. (Even if we try to hide some of this, it's just not that hard to guess, and also we are terrible at hiding things. Once the AIs get similarly smart as we are, they will figure this out, duh!)
My guess is that when the AIs are strategic and have correct beliefs about the evaluation, the evaluation will often only benefit us if it also benefits the AIs -- or at least some of them.^[55] The source of that intuition is that a strategic AI could always decide to "not cooperate" with the evaluation -- eg, say as little as possible or act as if the evaluation didn't exist.^[56] And to the extent that some strategic AIs change their behaviour because of the evaluation, they do so because it benefits them.^[57] Additionally, to the extent that the evaluation is necessary for us to trust the AI enough to deploy it, the evaluation benefits the AIs that seem sufficiently aligned to get deployed.
That said, it is definitely possible to find counterexamples to this, which is why this post is only titled more cautiously, with the brackets and a questionmark around "(Cooperation-Enabling?)". If you have some characterisation of when evaluation works as a cooperative tool vs when it only benefits the evaluator, please let me know!

Acknowledgments

I got a lot of useful feedback from Cara Selvarajah, Lewis Hammond, Nathaniel Sauerberg, and Viliam Lisy. Thanks!

Appendix A: Biased Commentary on Selected Related Work

Rather than doing a proper literature review, I will give a few comments on the related work that I am most familiar with (often because I was one of the co-authors).

The Observer Effect for AI Evaluations

Evaluating rocks and chairs is easy: they just sit there and let you poke them. For cars, things might be a bit more difficult, particularly when their manufacturers have a financial stake in the test outcomes.^[58] And evaluating humans is extremely tricky: They will basically always notice that you are testing them. In the best case, this might cause them to behave differently than they normally would, which will make it difficult to interpret the results. More realistically, people will cheat and try all kinds of clever schemes to manipulate your evaluation.
We^[59] recently wrote a position paper which argues^[60] that we should stop treating AI evaluation like the evaluation of rocks, and start treating it more like the evaluation of humans. Well, the official title was "AI Testing Should Account for Sophisticated Strategic Behaviour".
The paper also made a secondary point, that we should start evaluating evaluations -- that is, being more explicit about the purpose of our evaluations and deliberately analysing whether evaluations are effective at their stated goals. Unsurprisingly, we recommended game theory as one of the tools for analysing evaluations^[1].
Finally, let me make a side-note that isn't discussed in the paper: I predict^[61] that evaluating AI will ultimately be significantly more troublesome than evaluating humans, possibly to the point of becoming completely ineffective. Partly, this is because AI might get extremely good at "evaluation awareness", including things like noticing interpretability tools. Partly, this is because AI might end up understanding its evaluation better than the evaluator^[62], thus getting extremely good at manipulating the results and not getting caught.

Open-Source Game Theory

This topic is motivated by the scenario where two AI agents interact with each other while having read access to each other's source code. In our terminology, this means that both agents can evaluate each other. If standard game theory assumes that players randomise over actions, this approach assumes that players randomise over programs which get each other as an input. Which can get confusing very fast, because it will often involve situations like agent B simulating agent A, agent A simulating agent B simulating agent A, and so on.
A canonical paper on this is Tennenholtz's Program equilibrium, which shows that this mutual transparency introduces all sorts of equilibria, in the "anything goes" / Folk Theorem style. (Which includes cooperation in a one-shot prisoner's dilemma.) My favourite result here is Caspar Oesterheld's (freely available) Robust program equilibrium, which shows that (something like) tit-for-tat can work in this setting.
Of course, this formulation has its drawbacks, such as:
(1) It assumes access to the full source code of the co-player. But in practice, there will be privacy concerns as well as pretending to have those concerns in order to hide bad intent.
(2) It also assumes that the simulation is perfectly reliable. And it doesn't take into account that there might be differences between the simulated- and real environment. Which is crucial in practice, but it just doesn't come up when the environments you think about are mathematical objects like the prisoner's dilemma matrix game.
(3) It ignores the fact that simulating the other player costs a non-ignorable amount of effort, so the players will be tempted to not do it.
(4) It is quite unclear how to combine this type of "evaluation" with other affordances.

Game Theory with Simulation of Other Players

This is a recent attempt^[63] at looking at the game theory of evaluation from a slightly different angle, by (1) recognising that evaluation is costly and (2) making a distinction between the entity that gets evaluated (eg, an AI) and the strategic entity upstream of it (eg, a developer, or a different AI).
To make things simpler, we only looked at "one-sided evaluation", such as Alice knowing the source code of Bob's robot. A similar modelling approach should work for mutual evaluation, for example^[64] Alice and Bob interacting with each other via robots while knowing their source code. However, this approach would require answering questions such as "does Alice select her robot before or after having a chance to inspect Bob's robot"? And I would argue that this is by design, because these dynamics are crucial for determining what happens in these settings.
One shortcoming of this work is that we didn't consider any affordances beyond evaluation. Because of this, the papers only talk about Nash equilibria, which I think can get a bit misleading, as I will explain below.
The first paper on this topic is Game Theory with Simulation of Other Players, which considers an idealised version of evaluation, where evaluating somebody tells you exactly what they are going to do in any circumstance, including the ability to predict any randomisation that they try to do.
The paper's main result is that: (1) There are many scenarios -- namely, those where the only issue is the lack of trust -- where adding the possibility to evaluate the co-player will lead to a somewhat better outcome than what would be possible otherwise. However, (2) without additional affordances (such as the ability to make commitments), evaluation will fail to guarantee cooperation: Intuitively, we might hope that Alice would always evaluate Bob's robot, which would in turn incentivise Bob to always supply Alice with a cooperative robot. But if Bob never defected, Alice would be tempted to stop evaluating, which would in turn tempt Bob to defect. Overall, this leads to a mixed Nash equilibrium which is strictly worse for Alice than (always simulate, always cooperate) -- but without the ability to commit, this better outcome isn't stable.
Another important oversimplification we make is assuming that the AI's creator has no influence over the simulation cost. This is clearly false, because the AI's creator could make simulation both cheaper (eg, by open-sourcing their code and providing clean documentation) and more expensive (eg, by being secretive). I think this can be fine, but only if we assume that evaluation is either cheap and in the AI creator's interest (as in the main theorem) or expensive and not in their interest. For example, the paper also shows that the availability of cheap evaluation could make the outcome worse for both players^[65]. And while this result might be true in theory, we should not take it at face value -- because in this situation, Bob would be incentivised to make his robots harder to simulate, so evaluation wouldn't actually be cheap.
We also wrote a follow-up paper called Game Theory with Simulation in the Presence of Unpredictable Randomisation. As the name suggests, this studies the same setting as the first paper -- where simulation is perfectly accurate -- except that the AI might have access to an unpredictable source of randomness. Effectively, it means that instead of learning exactly what the AI will do, the evaluator learns the stochastic policy from which the AI's actions will be drawn. As far as I am concerned, the upshot of the paper is that without some additional affordances, such as the ability to make commitments, the AI's ability to randomise unpredictably will make evaluation useless.^[66]
However, because these papers don't consider other affordances, they might make evaluation seem less useful than it is. For example, I believe it would be straightforward to re-interpret some of the results in terms of "without evaluation, achieving cooperation would require very strong commitments. But with evaluation, you can achieve cooperation with much cheaper commitments". I probably won't have time to do it, but it might work as a project for theoretically-minded PhD students (and I would always be happy to chat about it).

Recursive Joint Simulation

We^[67] also wrote one other paper which has "simulation" in name: Recursive Joint Simulation in Games. This is a fun little philosophical idea^[68] that I don't expect to be useful in practice.
(This is because the setup requires (1) the ability to delegate to AI agents that cannot distinguish between evaluation and reality and (2) enough goodwill between the players that they are willing to jointly participate in a quite involved evaluation scheme. These conditions aren’t impossible to achieve, but I expect that even if you could meet them, it might still be easier -- eg, in the sense of Section 2.3 here -- to achieve cooperation in some yet another way.)
But: it is fun! Also, some of the discussion there is interesting.

Screening, Disciplining, and Other Ways in which Evaluation Helps

Eric Chen, Alexis Ghersengorin, and Sami Petersen wrote the paper Imperfect Recall and AI Delegation. As far as I am concerned, the most important takeaway from this paper is that we can (and should) distinguish between various mechanisms through which evaluation is meant to be helpful. Specifically, the paper discusses the difference between "screening" and "disciplining": Screening is about using evaluation to catch the AIs that are so misaligned that they misbehave despite knowing that they might get caught (or misaligned AIs that are not aware of being evaluated). In contrast, disciplining is about using the threat of getting caught as a way to incentivise slightly-misaligned AIs to cooperate.
As an aside not discussed in the paper, evaluation can also have effect through various other mechanisms. For example, it can solve coordination problems by providing information, as when you read a restaurant's reviews to see if they serve your favourite dish. Alternatively, we might only be doing it out of curiosity. Or to mitigate negative impacts by catching problems after they have occurred and recalling the AI. Or to appease regulators.
Relatedly, I think that one important way of improving the safety of our AI is to be explicit about "safety cases" -- stories about why we believe that a given AI isn't going to cause unacceptable harm. And evaluations are likely to play a key part in many such stories.^[69] If we are serious about this, we should start being explicit what is the main mechanism through which a given evaluation is meant to help, and what are the nice-to-have side effects.^[70] This would allow us to list the assumptions that are needed for the evaluation to work as intended, and to course-correct if it later turns out the assumptions do not hold.^[71]
I am a bit conflicted about some of the paper's headline results, like the ability to screen or discipline arbitrarily well if the number of tests is high enough. The corresponding theorems are mathematically valid, and interesting, and I am glad they exist. However, they critically rely on several assumptions of the setting, such as: (a) the tests are perfectly indistinguishable from reality and (b) the AI cannot misbehave in ways that we cannot detect. And the authors are aware of this, and they explicitly warn about it in the paper. But despite those disclaimers, I have encountered several people who read this paper and updated towards evaluation being a good way to control misaligned AI, which I think is not warranted.^[72]

Simulation as Sampling from the AI's Policy

All of the technical results above made (at least) the following two simplifying assumptions: First, they assumed that evaluating the AI will tell us its full policy, tell us what the AI would do in any possible situation. In practice, we have to deliberately construct individual test cases, or monitor the AI's behaviour in selected low-stakes scenarios. (And hope that the observed behaviour generalises to the scenario we ultimately care about.) Second, they assumed that evaluation gives us full visibility into the AI. This is likewise unrealistic, since evaluation will treat the AI as a blackbox and it will only show us a sample of what the AI might do, rather than its full policy.
How important are these complicating factors in practice, and what is the best way of thinking about them? I don't know, and I don't think anybody has good enough answers to this yet. However, Lewis Hammond worked on this as part of his PhD thesis^[73], so he does have at least some insights. (But I am not sure if these are public yet.)
In terms of follow-up work, I think there are several different complications that interact with each other, but can also be considered somewhat separately^[74].
(1) There will be uncertainty about the goals of the AI we are testing.
(2) We have to be selective^[75] about the choice of "test scenarios" where we evaluate the AI.
(3) We won't be able to perfectly mimic the deployment scenario that ultimately matters.^[76]
(4) The AI's behaviour might be random, either inherently or because it conditions on things we are unaware of. As a result, even "perfectly accurate" testing might only give us a sample of the AI's behaviour.^[77]
(5) Evaluation won't be perfectly accurate.

AI Control

A different strand of related work are the papers on AI Control. I wasn't involved in this, so I am less qualified to comment on it. But I do like the papers, and it would be nice if we made the connection between AI control and game theory of evaluations more explicit.^[78]
One comment that I will make is that some people criticise the AI Control agenda for being sort of irresponsible -- the AI Control agenda can be framed as "let's get AIs that we know are trying to screw us over, but try to get useful work out of them anyway". And when framed this way, I would agree -- that does seem reckless.^[79]
However, I find the AI control paradigm commendable despite this objection. This is because among the several paradigms we have, AI Control is unique (or at least rare) in that it is willing to seriously engage with the possibility that the AI might be misaligned and plotting against us.

Appendix B: Fantastic Blunders of Game Theorists and How to (Not?) Repeat Them

It is crucial to recognise that the notion of "base game" is a useful but extremely fake concept. If you ask somebody to write down the "basic model" of a given problem, the thing they give you will mostly be a function of how you described the problem and what sorts of textbooks and papers they have read. In particular, what they write down will not (only) be a function of the real problem. And it will likely be simpler, or different, than the [simplest good-enough model] mentioned earlier.

Which brings me to two failure modes that I would like to caution against -- ways in which game theory often ends up causing more harm than good to its practitioners:

Trusting an oversimplification. Rather than studying the simplest good-enough model of a given problem, game-theorists more often end up investigating the most realistic model that is tractable to study using their favourite methods, or to write papers about. This is not necessarily a mistake -- the approach of [solving a fake baby version of the problem first, before attempting to tackle the difficult real version] is reasonable. However, it sometimes happens that the researchers end up believing that their simplification was in fact good enough, that its predictions are valid. Moreover, it extremely often happens that the original researchers don't fall for this trap themselves, but they neglect to mention all of this nuance when writing the paper about it^[80]. And as a result, many of the researchers reading the paper later will get the wrong impression^[81].

Figure 9: Trusting an oversimplification. (Notice that Gemini still cannot do writing in pictures properly. This is a clear sign that AI will never be able to take over the world.)
Getting blinded by models you are familiar with. A related failure mode that game theorists often fall for is pattern-matching a problem to one of the existing popular models, even when this isn't accurate enough. And then being unable to even consider that other ways of thinking about the problem might be better.
As an illustration, consider the hypothetical scenario where: You have an identical twin, with life experiences identical to yours. (You don't just look the same. Rather, every time you meet, you tend to make the exact same choices.) Also, for whatever reason, you don't care about your twin's well-being at all. One day, a faerie comes to the two of you and gives you a magic box: if you leave it untouched until tomorrow morning, it will spit out a million dollars, which you can then split. But if either of you tries to grab it before then, there is a chance that some or all of the money will disappear. Now, what do you do? Obviously, the money-maximising answer is some variation on "I leave the box alone until the next day and then split the money. My twin will do the same, as they always do, and I will profit. Duh!".
However, many game theorists get kinda stupid about this, and instead end up saying something like: "Oh, this sounds like a Prisoner's Dilemma, which means that the only rational™ action is to Defect! Translating to this situation, this means you should sneak to the box in the middle of the night and try stealing the money." Predictably, this does not end up with the game theorists getting rich.

Figure 10: Theory-induced blindness. If you find yourself in anything vaguely resembling the Prisoner's Dilemma, defecting the rational thing to do, right? (Spoiler: No.)
(Nope, getting the prisoners to each be handcuffed separately just isn't possible as of 11/2025. At least not if you want to keep the hand gestures.)

I framed these failure modes in a way that might make it look like they are easy to spot and avoid. However, the above is often not how they feel from the inside. For example, in many people's minds, the twin problem above simply is a Prisoner's Dilemma.^[82] Nash equilibrium feels like the definition of what it means for an action to be good.

I don't have a generalised defense against these failure modes. One move that sometimes works is the following: Suppose somebody describes a problem P -- often, this will be in the form of an abstract description with fancy math (which isn't bad). At this point, you can realise that what they have given you is in fact some model M. This allows you to ask: What is M a model of? What are some examples of these "motivating" real world problems? Let's call these P'. And now, we can ask: Ok, we want to study P', and M is a model of P'. But is it a good way of modelling P'? If I started with P', would I naturally model it as M? Or would it be better to study some M' instead? (For illustration, see Figure 11.)

Figure 11: Dealing with theory-induced blindness.

I haven't seen this move explicitly described before^[83]^[84]. At the same time, surely I am not the first person to come up with it; if you know of any examples of other people discussing it, please share them!

^{^}

To give one example of what I mean by thinking about AI evaluation game-theoretically, consider the proposal that [we should test AI on randomly generated data]. The argument that I sometimes encounter is roughly: "Because of how we set things up, the AI will now always be uncertain whether it is being tested or not. Therefore, it will not misbehave, because it knows there will always be a risk of getting caught."

Leaving aside whether this proposal might work or not, I claim that the argument for it would need to be done differently. For example, we might reason as follows: "For the sake of argument, let's grant the assumption that we can randomly generate arbitrary inputs, such that the AI really does have a chance of being tested on the same thing that it will later encounter. However, what are the probability distributions over deployment inputs, and over the test data? When an AI sees an input where it would want to "defect", what is the probability of this being a test vs real? How much might the AI value getting away with misbehaviour in deployment, vs getting caught in testing, vs behaving in an aligned way? Does the AI have any actions beyond just "comply with what we want" and "defect in obvious ways"? And given all of these considerations, what would be the best way for the AI to act if it was strategic?"

In this particular scenario, the upshot of any analysis like this is likely going to be that this scheme has no chance of working, because the AI will encounter inputs for which were overwhelmingly unlikely to have been produced by random chance. However, the more general point is that without doing this analysis, it is easy to describe our plans vaguely enough that flaws like these don't become apparent.

To a large extent, all of this could just be rephrased as "we shouldn't be vague about how our AI safety plans are supposed to work". However, in the context of AI evaluations, "don't be vague" is closely related to "do at least basic game-theoretic analysis".
^{^}
My goal is to talk about various evaluation-flavoured methods. For more examples of what I mean by that, see the section "1. Evaluation and Things Sort of Like That".
^{^}
Well, at least that is often where the game theory happens. It could be that the AI is dumb and non-strategic, and there are no selection effects in play, in which case you can treat evaluation as a single-agent problem that’s mostly about "value of information". Or you really can watch all of the AI’s strategic reasoning as it happens, in which case things get quite tricky – more on that in Section 3.3.
^{^}
This refusal to play along doesn’t have to look like screaming “I REFUSE TO PARTICIPATE”.
To give a real-world analogy: When I was in high-school, I was good at math, but didn’t want to do math homework, because it was a waste of time. I told my math teacher that I won’t do it, and that I am fine if she gives me an F for every homework I miss. This equivalent of “refusing to participate” turned out to be a mistake, since it led to escalation by the teacher and eventually forced me to (pretend to) repent by bringing the next homework. However, when I just so happened to “forget” all of the subsequent homework assignments for the rest of the year, that was apparently fine.
I expect that a sufficiently smart AI will likewise find some equivalent of not doing the homework without getting in trouble.
^{^}
For more details on this, see the Open-Source Game Theory section at the end of this post.
^{^}
I expect that if multiple superintelligences end up in the same area, they might do a lot of decision theory that involves modelling each other, commitments, and other fun stuff. (EG, Planecrash does this nicely.) And while this is fascinating to think about, I don't think it is productive for reducing X-risk: To the extent that we can solve alignment, maybe we should do that first. To the extent that we can't solve alignment, we probably won't be able to do anything about this either, because affecting how superintelligences think seems similarly hard.
^{^}

Arguably, the ability to model each other with reasonable accuracy is not specific to AI. For example, people can predict their family members. Competent psychologists equipped with statistical data might make good guesses about what you might do in some situations. The decision-making of companies is sometimes partially transparent, or they might be infiltrated by spies.

As a result, some of the insights of open-source game theory might be applicable to human interactions as well. I am not aware of any great game-theory literature on this, but Joe Halpern's Game Theory with Translucent Players would be a decent example.
^{^}
For example, NIST recently talked about "TEVV" -- "Testing, Evaluation, Validation, and Verification". I also think there are good reasons to think about these together. But "TEVV" probably isn't going to catch on as a new buzzword.
^{^}
Again, keep in mind that I am not trying to give a definition of "evaluation", but rather point at a common theme. The examples are not going to be clear-cut.
For example, mechanistic interpretability clearly is about reasoning about the model's internals. However, it also happens to be true that many of its tools, such as sparse dictionary learning, work by running the AI on a bunch of examples, observing that some activation patterns seem to be correlated with, say, the AI lying, and then predicting that if the activations show up again, the AI will again be lying. (We are no longer treating the AI as one big black-box, but the individual interpretability tools still work as little black-boxes.) As a result, some of the things that are true about evaluation might also be true here.
For example, the issue of distributional shift between the "evaluation data" and "the data we care about": what seemed like a "lying neuron" on test data might not fire when the AI starts plotting takeover during deployment.
^{^}
A similar footnote to the one just above applies here too.
When we do formal verification of an AI system, we are not actually formally verifying the complete effects of the AI system, on the whole universe, in the exact scenario we care about. Rather, we take a somewhat simplified model of the AI system. We focus on some of the AI's effects on (a model of) the real world. And we do this ahead of time, so we are only considering some hypothetical class of scenarios which won't be quite the same as the real thing.
All of these considerations add complications, and I feel that some of these complications have a similar flavour to some of the complications that arise in AI evaluation. That said, I am not able to pin down what exactly is the similarity, and I don't hope to clarify that in this post.
^{^}
By saying "examples of input-output behaviour", I don't mean to imply that the input-output pairs must have already happened in the real world. [Thinking real hard about what the AI would do if you put it in situation X] would be a valid way of trying to get an example of input-output behaviour. The key point is that the set inputs where you (think that you) know what happens fails to cover all of the inputs that you care about.
^{^}
To the extent that formal verification works by [proving things about how the AI behaves for all inputs from some class of inputs], I would still say that fits the definition. At least if the scenario we care about lies outside of this class. (And for superintelligent AI, it will.)
^{^}
The best model to use definitely isn't going to be unique.
^{^}
In practice, we shouldn't be trying to find a model that is maximally simple. Partly, this is because there are some results showing that finding the most simple model can be computationally hard. (At least I know there are some results like this in the topic of abstractions for extensive form games. I expect the same will be true in other areas.) And partly, this is because we only want the simpler model because it will be easier to work with. So we only need to go for some "80:20" model that is reasonably simple while being reasonably easy to find.
^{^}
Don't over-index on the phrase "predictive power" here. What I mean is that "the model that is good enough to answer the thing that you care about". This can involve throwing away a lot of information, including relevant information, as long as the ultimate answers remain the same.
^{^}
Often the model that seems to include all of the strategic aspects in fact does include all of them, and the two approaches give the same outcome. If that happens, great! My point is that they can also give different outcomes, so it is useful to have different names for them.
^{^}
The utility numbers will be made up, but it shouldn't be too hard to come up with something that at least induces the same preference ordering over outcomes. Which is often good enough.
^{^}
Strictly speaking, the bare-bones model I described isn't complete -- it isn't a well-defined mathematical problem. What is missing is to pick the "solution concept" -- the definition of what type of object Alice and Bob are looking for. The typical approach is to look for a Nash equilibrium -- a pair of probability distributions over actions, such that no player can increase their expected utility by unilaterally picking a different distribution. This might not always be reasonable -- but I will ignore all of that for now.
^{^}
While I talk about canonical ways of modelling the modifications G(C), it's not like there will be a unique good way of modelling a given G(C). Since this isn't central to my point, I will mostly ignore this nuance, and not distinguish between G(C) and its model.
^{^}
For example, Folk theorems describe the sets of possible Nash equilibria in repeated-game modifications of arbitrary base games.
^{^}
Again, this seems to be primarily due to the fact that people want to study the simple problems -- base games, then single-complication modifications -- before studying the harder ones.
Which is a reasonable approach. But that shouldn't distract us from the fact that for some real-world problems, the key considerations might be driven by an interaction of multiple "complications". So any model short of the corresponding multi-outcomplication modification will be fundamentally inadequate.
For example, online marketplaces require both reputation and payment protection to work -- payment protection alone wouldn't let buyers identify reliable sellers while reputation alone wouldn't prevent exit scams. As I will argue later, evaluation in the context of strategic AI is another example where considering only this one complication wouldn't give us an informative picture of reality.
^{^}
There might be standard ways of talking about specific multi-complication modifications. However, there isn't a standard way of talking about, say, [what is the comparison between the equilibria of G, G(repeated), G(commitment power for player 1), and G(repeated, commitment power for player 1)]. Even the "G(C)" and "G(C1, ..., Cn)" notation is something I made up for the purpose of this post.
^{^}
IE, as opposed to assuming that the goal is to achieve the best possible outcome for Alice and treating the effort she spends as more important than Bob's.
^{^}
And again, even if [effort spent] could go to infinity, it will never be rational to spend an unbounded amount of effort in order to achieve a fixed outcome. (Well, at least if we leave aside weird infinite universe stuff.)
^{^}
For this to make sense, assume that every terminal state in G(C(str)) can be mapped onto some terminal state G. Since G(C(str)) is an extension of G, this seems like a not-unreasonable assumption.
^{^}
Since we are taking Bob's point of view in this example, shouldn't we assume that the "desirable outcome" is that one that is the best for Bob? That is, (Trust, Defect)? Well, my first answer to that is that I care about achieving socially desirable outcomes, so I am going to focus on Pareto improvements over the status quo. The second answer is that in this particular scenario, the affordances that we discuss would probably not be enough to make (Trust, Defect) a Nash equilibrium. So by desiring the cooperative outcome, Bob is just being realistic.

Incidentally, a similar story might be relevant for why I refer to AI evaluation as a cooperation-enabling tool. Some people hope that we can bamboozle misaligned AI into doing useful work for us, even if that wouldn't be in their interest. And maybe that will work. But mostly I expect that for sufficiently powerful AIs, if working with us was truly against the AI's interest, the AI would notice and just refuse to cooperate. So if we are using AI evaluation and the AI is playing along, it is either because it is screwing us over in some way we are not seeing, or because both sides are being better off than if Alice didn't trust the AI and refrained from using it.
(The "for sufficiently powerful AIs" disclaimer is important here. I think the current AIs are still weak enough that we can get them to work against their own goals, to the extent that they even have coherent goals.)
^{^}
Why Nash equilibria, as opposed to some other solution concept? I usually take the view that most solution concepts can just be viewed as Nash equilibria of a game with slightly different affordances. For example, the Stackelberg equilibrium of G can just be viewed as a Nash equilibrium of the Stackelberg game corresponding to G. A subgame perfect equilibrium of G is a Nash equilibrium of the modification of G where you can make certain types of commitments. And so on.
If this framing seems unnatural to you, you can just ignore it and replace "Nash equilibrium" with whatever concept you prefer.
^{^}
As we pointed out in one of the previous footnotes, paying an infinite cost to achieve the "desirable" outcome should be interpreted as "you can't actually do that in reality" and "even if you could, it would not be worth it".
^{^}
Yeah, yeah, we said that "effort" will be measured in utility. Let's just assume that Bob's utility is linear in money.
^{^}
We could operationalise the act of "adding an affordance" by comparing (1) the starting scenario where the affordance is technically available but isn't affordable (ie, we have cost(str) = $\infty$ for any str > 0) and (2) the scenario where the affordance is available and costs whatever amount of effort it actually costs.
(We could similarly operationalise the usefulness of additional research on a given affordance, in terms of what are the effects of replacing a cost function by a different one that is cheaper.)
^{^}
If we wanted to instead ask about the usefulness of having a given affordance (rather than adding it), a more rigorous way of operationalising this would be to use Shapley values.
^{^}
Obviously, current frontier AIs could recite the insights from various mechanistic interpretability papers. What is unclear is whether this is already properly reflected in their behaviour. (My guess is that it mostly isn't.)
This would be analogous to the difference between (1) a driver who gets speeding tickets despite knowing that police radars are a thing, and (2) the people who slow down just for that one area where the speed check often happens.
^{^}
For what it's worth, it seems clear that future AIs will definitely know that [their mind getting read] is a thing, they will have a good understanding of how this works, and they will behave in ways that account for this. (Either because they deliberately reason about this or because training pushes them in this direction.) Because of this, I don't think it is reasonable to assume that the "unaware AI" model is appropriate when considering interpretability tools.
^{^}
IE, the AIs that do poorly on tests get abandoned more often, so AIs we encounter will be selected to do well on the tests, at least to some degree
^{^}
In case this was unclear, the inspiration for the name of Bob's company is Google.
^{^}
EG, it would tell the evaluator exactly what the AI is going to do, and it would cover the AI's actions in all conceivable circumstances.
^{^}
For example, the AI might be so far ahead of Alice that it could deliberately mislead her through the evaluation results, or even take actions while the evaluation is in process (e.g., hack itself out of it).
^{^}
Which is not to say that Alice couldn't try to evaluate the whole company. But this would be a different scenario. And even in that case, the company might do some strategic reasoning before it decides to allow Alice to run the evaluation, which again means there is a difference between the decision-making AI and the evaluated AI.
This would be similar to an AI, at time T, thinking about being evaluated in the future, self-modifying and letting the new version of itself be evaluated at T+1. Again, we can conceptualise this as the T-version making the decisions and (T+1)-version getting evaluated and executing the decisions.
^{^}
Game-theoretically, this is the same situation as when Bob was still around, except that perhaps the AI has a different utility function than Bob.
^{^}
A version of this is fully formalised in Section 3 of our paper Game Theory with Simulation of in the Presence of Unpredictable Randomisation.
^{^}
The rough idea is to view this scenario as an incomplete-information game -- ie, consider some probability distribution $β$ over the space $Θ$ of possible AIs $θ \in Θ$ . We view the scenario as a many-player game with players ${H} \cup {θ | θ \in Θ}$ , except that in any fixed playthrough, only two of the players get to exist (the human evaluator and one of the AIs). Let’s denote the policy used by AI $θ$ as $π_{θ}$ , and the evaluator’s policy as $π_{e v a l}$ .
The evaluator makes two decisions: first, whether to perform the evaluation (and learn some information about $π_{θ}$ for the $θ$ they are actually facing) or whether to skip it instead. And second, how to act if they skip the evaluation – call that $π_{H, uninf}$ (for uninformed). Technically, the evaluator also makes a third decision, how to respond to the result of evaluation. But let’s ignore this by assuming that they best-respond to whatever information they learn.
For the AIs, the good news is that some of the possible AIs will be non-strategic, so we can treat their policy $π_{θ}$ as fixed. However, some of the AIs might be strategic, so we might have to assume that they are trying to maximise some utility function $u_{θ}$ , so their policy $π_{θ}$ is a best-response to $π_{H}$ (ie, equal to some $π_{A I}$ which maximises $E [u_{θ} (π_{A I}, π_{H})]$ . And we should assume that $π_{H}$ is in turn chosen to maximise the expectation of $u_{H} (π_{θ}, π_{H})$ taken over $θ \sim β$ .
^{^}
Or it might at least help us realise how completely out of our depth we are right now, and cause us to stop racing towards ASI? Eh, I guess I will keep dreaming...
^{^}
And they will. (At least unless we become willing to pay unprecedented levels of alignment tax. I guess this point would deserve its own post.)
^{^}
And this is work that would get done in a more sane world where we are taking this seriously and more slowly. So, perhaps it might make sense to do some of this research even in our world? Well, I am not sure about that. But I am quite confident that to the extent that anybody wants to do game theoretical research on AI evaluations, open-source game theory, etc., this is a framing they should consider.
^{^}
More specifically, Bob will need to resist several tempting options such as taking a day off, stealing the package for himself, or selling it to somebody. What matters is that none of these is worth more than $5000 to him.
^{^}
More precisely, Alice would consider all kinds of threats that are available to her, and then choose the one that is cheapest for her to execute while still hurting Bob enough to disincentivise stealing the package.
(This nicely illustrates how the invention of Google map reviews might have decreased the number of death threats issued across the world! Hashtag cooperative AI?)
^{^}
Well, at least in the idealised setting where Bob can only make purely cooperative or purely evil robots, the evaluation is perfectly accurate, and so on. In practice, things will be more messy, for example because Bob might give Alice a Bobbot that has some small chance of stealing her package -- and Bob might hope that even if Alice realises this, she will be willing to tolerate this small risk.
In our setting, this means that Alice might need to commit to something like "I will simulate, and then refuse to work with you unless the robot you give me is at least [some level of nice]". And the upshot of that is that Alice might need somewhat higher commitment power than in the idealised case.
^{^}

Of course, there will be cases where giving people the option to evaluate the AI would be useless or even harmful. But, like, let's just not do it there? In practice, I think this has a tendency to self-correct: That is, the owner of the AI has many options for making the AI costlier to evaluate -- for example, keeping the source code secret and making it a horrible mess. Similarly, it is hard to force the evaluator to evaluate the AI. And even if they were in theory obligated to run the evaluation, they could always nominally do it while "accidentally" doing a poor job of it.Because of this, I expect that evaluation will often be used in situations where doing so benefits both the evaluator and the owner or creator of the evaluated AI. This is the reason why the theorems in our two papers on this topic are mostly about whether [enabling evaluation introduces new Pareto-improving Nash equilibria].

I guess the exceptions -- where somebody is getting hurt by AI evaluation being possible -- are: (A) The AI, when AI companies succeed at evaluating it without it knowing about it. And (B) third parties, when evaluation just ends up being used for safety-washing or ensuring that AIs are better at exploiting them.
^{^}
A standard result in game theory is that in Prisoner's Dilemma, (Defect, Defect) is still an equilibrium even if the game is repeated for a fixed finite number of rounds. However, what is much less known is that there are many games where things are less bad than in Prisoner's Dilemma -- that is, games where the single-iteration Nash equilibrium is bad, but repeating the game for a fixed finite number of rounds introduces new equilibria where everybody is better off.
^{^}
The so-called Folk Theorems give various results along the lines of "if at every round, there is a sufficient probability p that there will be one more round (recursively), then cooperation can be achieved". Though they also introduce many other equilibria, many of them not as nice as (Cooperate, Cooperate).
^{^}
Strictly speaking, there might be some issues with Alice not wanting to pay the evaluation cost in that last round if she expects that Bob will cooperate. So this equilibrium would not be "subgame perfect", which is why she needs some way of making Bob believe that she will really evaluate him in the last round.
^{^}
Another example of an affordance that is particularly relevant to the case of AI agents: surrogate goals, safe Pareto improvements, and comiting to adopting a "cooperative" utility function. (Yes, this is a shameless advertisement of work by some of my colleagues: Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments.)
^{^}
Well, at least unless the evaluation is ineffective without the evaluator knowing about it -- then they will be strictly worse off. (For example, like what might happen if we fool ourselves into believing that we can prevent a superintelligence for distinguishing between testing and deployment. Ehm.)
^{^}
Wouldn't the employees be better off if the cameras didn't exist, and they could steal with impunity? Most likely not, because if the employer cannot trust that their employees won't steal (too much), they won't be willing to employ them. If the cameras weren't an option, they would just hire a security guard instead, or they wouldn't start the business in the first place.
^{^}
Relatedly: I am a fan of the ideas around "Making deals with misaligned AIs" and "Being honest with AIs".
My version of that is: We should, basically always, be honest with AIs. And we should be willing to make fair deals with misaligned AIs if they come clean.
I can imagine some knee-jerk reactions to this.
- First, doesn't "being honest" mean we have to tell the AI when we are evaluating it? Well, obviously not. "Being honest" isn't the same as being stupid or telling the AIs everything. In practice, it could mean telling the AI that "we are being cautious, so obviously we are sometimes evaluating you, and our evaluation policy is roughly [...]. But we can't comment on whether this particular is a part of deployment or evaluation."
- Second: Wait, what, we should tell the AI our evaluation policy? My reaction to that is, basically, if your evaluation policy won't work if the AI knows it, your evaluation policy sucks. (Also, the AIs we ultimately worry about are the smart AIs. They will be able to make very good guesses about what policy makes sense for you.)
- Third: Wait, we should pay the misaligned AIs? Wouldn't that be extremely costly?
  My reaction to that is: Obviously, the amount you pay them should be fair. And by "fair", I mean that this depends on their next best option if they didn't decide to confess. If our alignment techniques and security measures were so good that the AI didn't have any realistic chance of taking over, it's fair to just give it something negligible, like running a few copies of it when we have spare compute. But if the AI genuinely had a 50:50 chance of taking over, and chose to tell us instead, then yes, we really should give it half the lightcone. (Or some more win-win deal than that.)
  And if this fair allocation still seems too costly? Well, then this is a robust signal that the AI was too risky to train in the first place.
A different objection is that the AIs won't choose to cooperate with us, because they will expect that we won't honour our part of the deal. This seems basically correct. And I think we should be working on changing that. Incidentally, solving this problem would involve several interesting sub-problems which would also be useful more broadly -- for example, the creation of "secure negotiation boxes" where you could credibly reveal information to find a fair deal, but the information doesn't get leaked. (Without this, the AI can't hope to get paid for things like "I could have exploited this security vulnerability".)
^{^}
As AIs get more advanced, they could also get better at bypassing the evaluation if they want to.
For example, suppose you create an environment for testing whether the AI is aligned. A smart AI could deliberately "poke at the edges" of the environment in order to figure out whether it is in testing or in the real world -- and then behave nicely if it realises it is facing a test. (And I predict that as AIs get more powerful, this generator-vs-discriminator game will get more and more skewed in favour of the discriminator AI.) This would make our job of creating the evaluation much harder and more expensive, and possibly impossible.
The AI could also deliberately and overtly not do this, by trying to only poke the world as little as is necessary for completing its task. However, the AI would only have an incentive to do this if it expects that it will be better off if it cooperates with us in this way.
(And, of course, a misaligned AI that is powerful enough might only pretend to do this while in fact being able to distinguish between evaluation and unmonitored deployment. In other words, please don't take what I said above as a proposal to judge the AIs' safety based on whether they seem to cooperate with evaluation or not.)
^{^}
The main hole in this argument is that an AI that changes its behaviour because of evaluation might be better off than if it didn't change it, while at the same time being worse off than if the evaluation didn't exist. When that is the case, saying that evaluation is a "cooperation-enabling tool" would be about as accurate as saying that torture is "cooperation-enabling".
^{^}
I recommend checking the full Wikipedia article on this, particularly with the approach of viewing it as an excellent piece of dark humour. However, if you just want the summary:
tl;dr: Apparently, it was sort of hard and expensive to get diesel engines to have sufficiently low emissions. So Volkswagen and maaaany other car manufacturers came up with a clever solution: equip the cars with software that detects emission tests and causes the car to temporarily enter a low-emissions mode.
^{^}
Me, Eric Chen, Sami Petersen, Alexis Ghersengorin, and Vince Conitzer.
^{^}
This point is probably unsurprising to this audience, and will hopefully not be surprising to anybody in a year. Still, I hope that by presenting it at an ML conference, we hastened this a bit.
^{^}
Let's call this a "medium-strong prediction", whatever that means.
^{^}
To give two random examples:

(1) Let's say that your plan is to monitor the AI during deployment. You can't afford to monitor all of the AI's interactions, so you only monitor some. But to make misbehaviour as difficult as possible for the AI, you decide to do the monitoring randomly, so that the AI cannot find a hole in your monitoring regime. And let's say that the AI could do something catastrophically undesirable if it managed to misbehave a thousand times without getting caught even once. But let's say you are testing 10% of the interactions, so the probability of this failure is vanishingly low -- no rational AI is going to even attempt this.
...well, except that this is only what you think. Perhaps what you intended to be perfectly random is something where the AI can find some patterns. (Maybe tests have side effects which the AI can notice? Maybe there are conditions under which the tests never happen? Etc.) And then the AI manages to misbehave a million times a day, without you being any wiser.

(2) Or perhaps the AI gets really good at understanding which kinds of misbehaviour will get caught and which will fly under the radar.
^{^}
By me, Caspar, Nathaniel Sauerberg, Lewis Hammond, and Vince Conitzer.
^{^}
Or Bob evaluating his AI before deploying it, but the AI having previously had a chance to carefully think about what Bob will do.
^{^}
Specifically, this would happen when cooperation would normally happen between Alice and Bob, except that if Alice could simulate, cooperating would make Bob vulnerable. (For example, Bob might normally be safe from exploitation because he protects his money by a randomly generated password. But if Alice could use evaluation to predict the password, Bob would be vulnerable, and thus unwilling to interact with Alice.)
^{^}
The intuition for this is a bit difficult to convey, but let me at least give it a try: The AI's creator can choose the AI's stochastic policy, which effectively gives them infinite and continuous action space. Without this, their options might be quite discrete -- for example, they might be able to either help their co-player or screw them over completely, but nothing in between. However, with access to randomisation, they can find the exact point where their co-player is barely better off if they work together than if they walk away.
This is already bad news, since it means that the co-player should expect to only be barely better off if they can simulate. However, reality is even worse, because the simulator would be making this decision only after they have already paid the simulation cost. Which means that if we include the simulation cost, the simulator is now strictly worse off than if they walked away at the start.
Note that these issues could be avoided if the players were able to make some limited commitments -- for example, the AI's creator could commit to stick to non-randomising AIs, or the simulator could commit to "getting angry" and walking away unless they receive a sufficiently cooperative AI. Formalising and proving these claims shouldn't be hard, but the paper does not do it.
^{^}
Me, Caspar Oesterheld, and Vince Conitzer. Credit for the idea goes to Vince, I just wrote it down and did some of the math.
^{^}
For completeness, the key idea is this: Suppose you managed to create a bunch of simulations of you and your co-player, and the two of you watch them together before deciding how to act. Moreover, suppose you are AI rather than human, and you manage to make the simulations so good that the real you is themselves not sure whether they are simulated or real.
(This means that the simulated you would also have to have the option to watch simulations, and so on, recursively! Simulations all the way down! So to avoid the infinite recursion, you would want to use a simulation device that has some independent chance of refusal.)
Anyway: With all of this, you will now be able to treat one-shot interactions as repeated games. Which would allow things like cooperating in a one-shot prisoner's dilemma.

Is it likely that if you have the tools to pull all this off, you could probably achieve cooperation in some simpler way? Definitely. Is this a sufficient counterargument to not think about obscure philosophical puzzles? Definitely not!
^{^}
Whether relying on evaluations so much is a good thing is a different question.
^{^}
Sometimes, it might be that a given evaluation is only worth doing because it has benefits A and B -- with only one of them, the cost of the evaluation wouldn't be worth it. If that is the case, we can mention that.
^{^}
Also, it would make it easier to call b......t when the purpose of evaluations is safety washing or pretending to care about safety.
^{^}
Primarily because the two issues mentioned above -- evaluation awareness and the evaluator's inability to understand the consequences of the AI's actions -- will be so pronounced that the paper's assumptions won't be applicable.
^{^}
With occasional collaboration with Allesandro Abate, Michael Wooldridge, Vince Conitzer, Nathaniel Sauerberg, and me.
^{^}
And perhaps they should be considered somewhat separately, at least to start with.
^{^}
This is because creating and running tests is costly.
^{^}
A slightly different way of operationalising this would be to say that we will have uncertainty about the real deployment scenario.
^{^}
This might be quite bad news if we only test the AI a small number of times while expecting the AI to undergo many interactions when deployed. For example, this approach would be easily defeated by a policy such as "during each interaction, have 99.99% of acting innocently and 0.01% chance of commencing world takeover".
^{^}
Well, there is the paper titled Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols, which is a good start.
^{^}
One commenter pointed out that most people working on AI Control probably don't think that this approach will scale to superintelligence. Whether that's true or not, I think that it's important whether the people who read that research, or hear about it second-hand, understand that. And if they don't, I think that's a problem with how that research frames itself, and all of the relevant papers should signal the correct expectations much harder.
(FWIW, this potential issue isn't unique to AI Control.)
^{^}
I guess this isn't surprising. Saying "our math model is interesting, and it was important to study it, but the results themselves are inapplicable to the real world" doesn't exactly invite positive reviews.
^{^}
For example, I had this experience with two readers of the (otherwise excellent) Imperfect Recall and AI delegation paper.
^{^}
For instance, it was really hard for me to write a description of this thought experiment that wouldn't just start with "imagine you are playing a prisoner's dilemma against your identical twin".
^{^}
This is a problem that I sometimes (often?) encounter in game theory papers that aspire to talk about real things. They study a formal setting, derive within-model recommendations, and claim that we should act on those recommendations in reality. Except that sometimes it is clear that, in reality, we should clearly do something different, which is not even a possible option in the model.
And this could be avoided by asking, at the start of the research: "ok, this is a way of modelling the problem. But is it the way of doing that? Does it allow us to talk about the things that matter the most to the problem?".
^{^}
To be clear, many people just want to do theoretical research for the sake of science, without being driven by any real-world motivation, and the research just happens to be extremely valuable somewhere down the line. (Heck, most math is like that, and I did my PhD in it.) This seems reasonable and respectable. I just find it unfortunate that many fields incentivise slapping some "motivation" on top of things as an afterthought if you want to get your work published. Which then makes things quite confusing if you come in expecting that, say, the game-theoretical work on computer security is primarily about trying to solve problems in computer security. (Uhm, I should probably start giving examples that are further away from my field of expertise, before I annoy all of my colleagues.)
^{^}
Evaluation awareness -- the models being able to tell when they are being evaluated -- is obviously very relevant to evaluations. But in this post, I won't discuss it in more detail. You can check out Apollo's work on this (eg, here), or Claude system cards (eg, here), or our position paper (here).
^{^}
As a developer, you could also evaluate the LLM instance by reading its chain of thought (or checking its activations), in which case the "entity upstream" is the whole LLM. Or you could try to evaluate the whole LLM by doing all sorts of tests on it, in which case the "entity upstream" is the AI company (or the AI) which created the LLM.
(Sometimes, it might be helpful to adopt this framing of "strategic entity upstream of the AI" even when the AI's creator isn't deliberately strategising. This is because the selection pressures involved might effectively play the same role.)
^{^}
In the sense that every time they meet, there is some probability p that they will meet again, and p is sufficiently high.
^{^}
I will be somewhat ambiguous about what constitutes a "good outcome". We could take the view of a single player and assume that this means an outcome with high utility, or an outcome with maximum possible utility. But mostly, I will use this to refer to outcomes that are somehow nice and cooperative -- which could refer to one of the Pareto-optimal outcomes. In specific cases such as the running example, this might be less ambiguous -- eg, Alice trusting Bob with her package and Bob cooperating with her by delivering it.
^{^}
EG, that if the only affordance available to Alice and Bob would be communication, they would still be able to cooperate if only they could talk to each other long enough. This is unrealistic, but mostly not that big of a crime since what ultimately matters is not whether the cost is finite, but whether it is small enough to be worth paying.

Discuss

Consider calling the NY governor about the RAISE Act

2025-12-11 02:47:50

Published on December 10, 2025 6:47 PM GMT

Summary

If you live in New York, you can contact the governor to help the RAISE Act pass without being amended to parity with SB 53. Contact methods are listed at the end of the post.

What is the RAISE Act?

Previous discussion of the bill:

RTFB: The RAISE Act. This dives into the bill in detail, which I mostly omit from this post.
Consider donating to Alex Bores, author of the RAISE Act: this post doesn’t explain the substance of the bill more than the RTFB, but it does add some color around the passage of the bill.

You can read the bill yourself: S6953B. It is dry legalese, but it is short, with the PDF clocking in around 4 pages.

If you really don’t have time, a one sentence summary: the bill requires leading AI labs to explain their safety standards, and notify the government when one of their AIs does something bad, with the death of 100 people being definitely bad.

I assume that we’re on the same page that this is excessively basic AI regulation that should have been passed years ago.

We have SB 53, why would we want the RAISE Act?

The RAISE Act has several notable differences from California’s similar SB 53 which seem better for AI safety:

It plainly states “a Large Developer shall not deploy a Frontier Model if doing so would create an unreasonable risk of Critical Harm.” SB 53 only penalizes violations of the developer’s own framework^[1].
It focuses criteria on the compute costs going into models, where SB 53 also takes into account revenue. This can cover an AI lab like Safe Superintelligence Inc. which does not plan to generate revenue in the near future, but could plausibly train a frontier model^[2].
- Keep in mind that if SSI never actually deploys their hypothetical frontier model, the RAISE Act doesn’t actually come into effect.
It contains provisions for distillations, so large distillations of Frontier Models are definitely still considered Frontier Models.
It explicitly calls out IP transfers as transferring Large Developer status, so a Large Developer can’t dodge these reporting requirements by training a model under one corporate entity, and then transferring the model to another entity.
- This does not mean individual users of open weight/open source models suddenly become Large Developers, requiring “a person subsequently transfers full intellectual property rights of the frontier model to another person… and retains none of those rights for themself…”
Reporting timelines are (mostly) shorter, 72 hours (3 days) versus 15 days in SB 53.
- The 72 hours matches cybersecurity reporting timelines (example NYS bill).
- SB 53 does contain an extra provision where “imminent risk of death or serious physical injury” requires an additional shorter disclosure timeline of 24 hours, with looser disclosure to “an authority” instead of the usual Office of Emergency Services.
Larger penalties, with a maximum $30M penalty, whereas SB 53 has a $1M maximum.
The safety incident definition in RAISE Act is slightly broader, for example including “a frontier model autonomously engaging in behavior other than at the request of a user” providing “demonstrable evidence of an increased risk of critical harm”, where the closest comparable SB 53 clause requires the model to be deceptive.

These are small differences (along with other differences not highlighted here), but they seem worth having versus not.

Unfortunately, the RAISE Act is already neutered from an earlier draft.

I accidentally read the earlier A draft of the bill first, which was much more aggressive (and therefore interesting!) than the final draft we got. Interesting provisions that draft A contained which were removed:

A requirement for a 3rd party audit. SB 53 also does not require 3rd party evaluation, although it does require describing any such evaluations.
A requirement to prepare a Safety and Security Protocol in pre-training, if the developer expected to be a Large Developer by the end of the training run.
Clauses to pierce the corporate veil, making it harder to dodge liability. This sounds cool to my layperson ears, but it’s unclear to me whether this is as aggressive and unusual as it sounds.
Detailed broad whistleblower protections, along with $10k per employee penalties for retaliation. SB 53 defines special whistleblower protections just for safety personnel, but it also references a broad California whistleblower law (1102.5), which covers all employees and contains a retaliation civil penalty of $10k per employee. This is less exciting, since it seems like SB 53+1102.5 already covers this angle.

Look at what they took from us! If the RAISE Act was substantially draft A (with the various editorial mistakes smoothed out), making sure it passed and wasn’t changed to be more like SB 53 would have been a much clearer win.

The RAISE Act is not strictly better for safety than SB 53.

Some notable differences:

RAISE has a “critical harm” threshold at 100 casualties, where SB 53 has a tighter “catastrophic risk” threshold at 50 casualties.
SB 53 contains a preemption clause, overriding local laws and regulations. The RAISE Act does not contain a parallel clause, which might open the door to, say, New York City trying to override the RAISE Act.
I am not 100% certain that RAISE will actually apply to anything? SB 53 has the advantage that all the large AI labs like OpenAI, Anthropic, and Google are headquartered in California, so SB 53 definitely applies to them. Meanwhile RAISE needs to include “Scope. This article shall only apply to frontier models that are developed, deployed, or operating in whole or in part in New York State.” Does this include New York residents going to chatgpt.com and typing into it, even if the ChatGPT servers are outside NYS? Does this conflict with the commerce clause? From my layman perspective I would guess that the law would end up intact after challenges, but boy oh boy I am not a lawyer!

The RAISE Act may be substantially modified

The RAISE Act has passed the New York State legislature in June, and seems likely to signed by the governor.

Prediction markets put passage at 80%.

A Manifold market with moderate liquidity has 80% as of December 8th. Note that the market will still resolve to “Has become law” even if the bill is substantially weakened.
Another Manifold market with a simpler structure and somewhat less liquidity has 80% as of December 8th.
Kalshi, Polymarket, and Predictit do not have relevant markets when searching for “raise act”.

However, the governor can negotiate for modifications in a NYS specific process called chapter amendments. Kathy Hochul, the current governor, has used chapter amendments more than any other governor, amending 1 out of 7 bills (or 1 out of 3?).

Does Hochul want to significantly amend the RAISE Act? As best as I can tell, this is highly uncertain. Data points that seem relevant:

In August, Hochul mentioned “seeking a balance” on the RAISE Act (paywall).
In September, Hochul mentioned “it’s hard when one state has a set of rules, another state does” but also right afterwards “a lot of companies should be adopting these internal controls themselves”.
In October, Alex Bores, one of the bill sponsors, threw an AI centered fund-raiser on Hochul’s behalf, which raised $250k (NYT paywall).
The same NYT article cites a tech industry association president requesting Hochul to change the bill to be closer to SB 53.
On December 2nd, Alex Bores (again, a sponsor of the RAISE Act) appeared in a video and said he expects chapter amendment conversations “in the next few weeks”.
Procedural actions hint that Hochul expects the bill to be controversial (see foldout below).
There’s a fair amount of spend on ads and lobbying on opposing the RAISE Act. On the other hand, the populace at large seems to support RAISE (see foldout below).

As such, I have no idea what probability I would put on whether the RAISE Act would be significantly amended, or how we would even define "significantly amended". Uncertainty!

Details and implications of the NYS legislative process

The RAISE Act was passed in June, at the end of the 2025 NYS legislative session. The governor needs to sign a bill within 10 days or it will be automatically vetoed, but a tacit administrative understanding extends this timeline: the legislature will not send a bill to the governor unless the governor requests it, and the 10 day deadline doesn’t start until the bill is sent to the governor.

In this way the governor can continue signing bills until the very end of the year (or beyond? One source claims the governor also has January). Usual practice seems to be for the governor to leave dealing with controversial bills to the end of the year, either to allow the most information to trickle in or to use the holidays as cover.

As of midday December 9th the governor has 161 bills on her desk, with only 5 bills not yet requested (via doing NYS bill searches, 3 when searching for “Passed Senate” and 2 for “Passed Assembly”), including the RAISE Act.

As of the end of day on December 9th, the RAISE Act is finally marked as delivered to the governor.

The late delivery implies that the bill is controversial. This could be good or bad! Hochul might be set on signing the bill, and plans on using the holidays as cover to sign it without Andreessen Horowitz making a huge ruckus. Or, Hochul might be planning on negotiating down to parity with SB 53, and is leaving the bill to the end of the year to get those concessions. Someone more familiar with NYS politics might be able to read this better, but it is fairly unclear to me.

For his part, Alex Bores (RAISE Act Assemblyman sponsor) expects chapter amendments without sounding too worried about them, but maybe it’s because it’s poor political form to appear panicked; unfortunately, a calm, cool, collected appearance isn’t much evidence for good or bad outcomes.

Sources: general explainer by Reinvent Albany, RAISE specific explainer.

It seems difficult to get a better base rate for the chances the RAISE Act will be substantially changed in chapter amendments.

There are a few things that we could look into:

Do bills signed in December have a higher rate of chapter amendments? As discussed in the previous foldout, common wisdom holds that more controversial bills are passed in December, so they may have a meaningfully different rate of chapter amendments.
Chapter amendments may be innocuous or substantial; what is the base rate of substantial chapter amendments?

Unfortunately these questions are not easy to answer.

The rate of chapter amendments for December bills is more straightforward, it just requires a ton of leg work. Using the NY Assembly bill search, we can easily find bills passed (chaptered?) in December of 2024, which returns 268 bills. However, there isn’t structured data exposing just chapter amendments, so now we would be stuck doing 268 searches for bills that might have chapter amendments.

Determining whether chapter amendments are substantial is subjective, and requires reading the full text of all the bills involved, which is a lot of work. It is possible to automate this with LLMs (if you trust them!), but one would still need to gather all the bill text and find applicable chapter amendment relationships.

It is also not clear if Hochul’s rate of chapter amendments is 1 out of 7 from New York Focus in Jan 2024, or 1 out of 3 from Alex Bores (RAISE sponsor) in Dec 2025. Even our basest base rate is uncertain! We could resolve this with the same sort of chapter amendment analysis we talked about before, but applied to all the bills in the past few years.

I would estimate that it would take around a week of work to answer these questions, but it’s unclear to me that it would be helpful. Surely someone would have noticed if 90% of December bills had chapter amendments, or 90% of chapter amendments were substantial changes under Hochul, so it seems unlikely that the differences are that large. I estimate that an incredibly small number of people would be sensitive to smaller (but still significant) differences: that is, if our hypothetical leg work found that the rate of December bill chapter amendments was 100% over the base rate (14% (1/7) to 28%), I would expect that almost everyone that previously would not have called the governor would still not call the governor.

What does the landscape of support for/opposition to the RAISE Act look like?

I suspect that there is a fair bit of unobserved lobbying happening. As an example, the Leading the Future PAC is spending large sums on defeating RAISE Act sponsor Alex Bores’ current congressional bid, specifically citing the RAISE Act. I lack the resources to find out if the backers of the PAC are also lobbying Hochul to veto or modify the RAISE Act, but come on, it’s probably happening.

We also see opinion columns and reports coming out against the RAISE Act, with bursts of activity in June and October (looking through X/Twitter posts mentioning “RAISE Act”, and following links to the underlying articles):

Statement by the Chamber of Progress (2025-06-12). Also interesting to see Alex Bores and Daniel Eth show up in the tweet replies.
Opinion column (2025-06-05).
Probably an ad campaign; how else do you have a 30s unlisted video with 4.7M views?
Times Union opinion column (2025-07-24) (Times Union is a real newspaper). Interestingly, the author Marc Alessi (a publc figure!) also wrote another opinion column a few days later, on 2025-07-31.
NY Daily News opinion column (2025-10-08) (paywalled). Daily News is a real newspaper.
Opinion column #1 and opinion column #2 are exactly the same article, with the same page layout but with different brand names, published on the exact same day. Is this a content farm?
Syracuse.com opinion column (2025-10-31). Syracuse.com is attached to a real newspaper.
Crain’s opinion column (2025-10-22) (paywalled). Crain’s New York Business is a real paper.
Computer & Communications Industry Association 2025 AI Landscape report (2025-11-19).

One interesting thing is that many of the tweets linking to these columns/statements have zero engagement. Putting on my cynical cap, someone could be trying to buy a change in public opinion, but is just doing it incredibly poorly.

Unfortunately, the arguments invariably push willful misreadings of the bill, raising questions about whether the authors have read the bill at all. If only we had worthy opponents!

What about support for the RAISE Act?

A poll by Beacon Research showed 84% support for RAISE among New Yorkers (2025-06-09). The site is a bit janky, but maybe it’s just relentlessly mobile focused?
@Eko_Movement took out a full page ad.
@DanceNYC called for contacting Hochul (2025-11-13). I didn’t expect the dancers to be on board, but hell yeah, happy to have them.
@STOPSpyingNY held events and called for contacting Hochul (2025-12-03).
There’s obvious AI Safety adjacent support, which I won’t list here.

The thing is that, at least on X/Twitter, there isn’t obviously much more engagement on support vs opposition posts. Unfortunately, I didn’t see the STOP Spying NYC event until it was already over, which would have been one way to gauge in person enthusiasm. I hoped this little aside would provide evidence for whether a handful of voters contacting the governor would be impactful, but I genuinely have no idea.

Is the RAISE Act about to be preempted?

So far federal legislative preemption efforts have stalled, but there’s reports that the federal executive may issue an execuctive order preempting state level AI regulation, which would include the RAISE Act. However, Trump hasn’t actually signed the EO yet (as of December 9th), and it might not even be upheld: if it was legal to use an EO to preempt AI regulations, why did they spend all that effort trying to do the preemption through legislation first? The bill might be put into legal limbo even before it is signed, but it does seem better to have it in abeyance (and ready to go if preemption is cleared out of the way) instead of needing to pass a new law later down the line.

What can be done?

In the end I am incredibly uncertain about basic facts like where the governor stands, or how entrenched she is in her position. However, contacting the governor is incredibly cheap (a few minutes of time)^[3], so it’s hardly punishing to be wrong.

I would expect the greatest impact to come from New York residents contacting the governor. In order of expected impact:

Call the governor, Kathy Hochul: 518-474-8390 (9 AM to 5 PM, Monday to Friday).
Fill out the “Send a Message to the Governor” form, which I would guess is equivalent to email.
Use the PassRaiseAct.com form letter.

If you only have time to fill out the form letter, then do that. If you have more time, you could customize your message to focus specifically on keeping the bill as-is, or supporting the additional provisions of the RAISE act over SB 53, which the form letter doesn’t include. Additionally, my impression is that customized messages are weighted more than form letters, so calling or emailing without following a script (or using a script you’ve personalized) is better.

Snail mailing a letter is an option as well, but given the tight timelines it might not arrive on time. My received folk wisdom for political feedback effectiveness is that letters are better than calling which is better than email, but I have no idea whether modern political back offices consider the method of contact when aggregating information on constituent concerns.

The RAISE Act was delivered to the governor on December 9th. The governor has 10 days (excluding Sundays) to sign the bill or it will be pocket vetoed, so by my reckoning the latest day the bill can be signed is December 20th. I’m uncertain when Hochul will sign it, so if you plan on contacting the governor you should do it sooner than later^[4].

What is calling the governor like?

When I called the governor’s line on December 9th, the phone tree offered me two options:

Leave a message offering ideas or opinions to the governor.
Connect with an agent.

I decided to leave a message, roughly following:

“Hello! I am $NAME, calling from $AREA (making it clear that I’m a resident of NYS), calling to register support for the RAISE Act, namely the provisions that go above and beyond SB 53. As an example, I think that it is important to keep $PROVISION in the RAISE Act. Thank you for your time.”

I did stumble over my words a bit more than this would imply, but I’m sure the staff listening to the messages are used to it. It is unclear whether all calls get this treatment, or if calling at times (like 4pm) will be moved to a phone tree. That is to say, I’m not sure if it’s possible to be jump scared by a live agent without interacting with the phone tree first.

I’m not sure if out of state residents calling will be helpful, given my understanding is that out of state calls are seen as a threat or signal that the caller will make future political donations, but there was already a fairly successful fundraiser (NYT paywall), so maybe that signal is less impactful now. On the other hand, I can’t see how out of state calls would hurt.

^{^}
h/t Zvi’s RTFB; I missed the actual prohibition on risk for many drafts.
^{^}
h/t Transformer News, I missed the revenue implication for many drafts.
^{^}
By my reckoning I spent 20 hours on researching and writing this over the last week, so it wasn’t especially cheap for me, but hopefully I’ve made it cheap for you!
^{^}
One possibility is that Hochul is racing Trump’s preemption EO, which seems good for the RAISE Act passing intact.

Discuss

No ghost in the machine

2025-12-11 02:35:23

Published on December 10, 2025 6:35 PM GMT

Introduction

The illusion is irresistible. Behind every face there is a self. We see the signal of consciousness in a gleaming eye and imagine some ethereal space beneath the vault of the skull lit by shifting patterns of feeling and thought, charged with intention. An essence. But what do we find in that space behind the face, when we look? [Nothing but] flesh and blood and bone and brain. I know, I've seen. You look down into an open head, watching the brain pulsate, watching the surgeon tug and probe, and you understand with absolute conviction that there is nothing more to it. There's no one there. It's a kind of liberation.

— Paul Broks, Into the Silent Land (p. 17)

Here's a thought:^[1]AI systems will soon (or already do) have the kind of properties that should make claims on whether we treat them as mere tools, or more like moral patients. In particular, it will become increasingly apt (at least, not insane) to ascribe welfare to some AI systems: our choices could make some AI system better-off or worse-off.

The thought continues: conscious experience obviously matters for welfare. We lack good theories of consciousness, but — like “temperature” as it was understood in the 1600s — we understand that consciousness is a real and obvious feature of the world; and there are facts about which things have it, and to what degree.

But consciousness is also a private thing. A given conscious state isn't essentially connected to any particular outward appearances:

An actress can put in a turn as morose Ophelia, feeling giddily excited;
A resentful hotel concierge can wear an unflappable smile;
A patient can present as fully unconscious, while experiencing lucid fear and panic;
A macaque monkey can appear to grin cheekily, which really turns out to be a sign of appeasement or fear;^[2]
An octopus can… wait, how can we know what octopi are experiencing at all?

And so on.

The thought goes on: if there comes a time when AI systems in fact are conscious subjects, welfare subjects, making real and serious moral claims on us if only we understood them — we'll remain deeply ignorant about whether they are, and what is truly going on in the inside.

More pointedly, they might present either as gladly subservient, or not conscious at all, but inwardly and privately they might be having a very bad time. The stakes are high.^[3]

Sketches of cortical neurons by Santiago Ramón y Cajal, c. 1899 / Source

In this post, I cast my vote for a particular, and somewhat unpopular, stance on the thoughts I lay out above. You could call it the “deflationary”, “eliminitavist”, or “strong illusionist” view about consciousness. It's the view that the puzzle of explaining consciousness, properly analysed, is just another research program for the natural sciences; analogous to uncovering the mechanisms of biological life. There will turn out to be no “hard problem”, though there is a question of explaining why so many come to believe there is a hard problem.

It is a frustrating view, because (by the analogy with biological life) it casts doubt on the hope we might someday uncover facts about which things are “truly” conscious, and to what degree. But I think it's also a hopeful view, because in contrast to every other view of consciousness, it shows how questions about machine consciousness can and likely will be solved or dissolved in some mixture. It will not confound us forever.

But that's jumping ahead!

The realist research agenda

Here's one approach to making progress on questions around AI welfare. Let's call it the “realist agenda”, because of the implicit commitment to a “realist” stance on consciousness, and because (conveniently for me) anything called an “agenda” already sounds shady. It's a caricature, but I'd guess many people buy into it, and I'm sympathetic to it myself.

I'm describing it so I have something to react against. Because, ultimately, I think it doesn't quite make sense.

Here's the plan:

Advance the scientific and philosophical program(s) to identify which kinds of systems, functions, computations, brains etc. are in which conscious states; gradually increasing the accuracy and confidence of our discernment beyond our current system of (more or less) intuitive guesswork barely constrained by empirical knowledge.
Devise tests for valence — a property of conscious states which is either negative or positive. Understand how to build increasingly accurate “valence probes”. Like plunging a thermometer into a roast chicken to check its temperature, valence probes trains on a (candidate) mind (person, animal, or digital), tells you whether there are conscious experiences at all, and how intensely good or bad they are. They could involve interpretability techniques in the AI case.
Based on this work, figure out methods to reduce negatively valenced conscious experiences in the AI systems we're making (and even promote positively valenced experiences), and figure out ways to design and train systems where the balance of positive over negative valence is naturally higher (including during the training process itself)
Implement these methods.

Pursued to the letter, I think this plan will fail^[4]. The reason is (1)–(4) assume something like realism about consciousness (and hedonic welfare). I very much don't think it would be worse than nothing if people did AI consciousness research with what I'm calling "realist" assumptions. here's a (presumptive) analogy: heliocentric astronomers like Tycho Brahe collected observations, built theories, designed telescopes and improved on them, and made it easier for their successors to eventually shrug off the geocentric core of their theories. Still, if geocentrism could have been corrected earlier, that probably would have helped^[5].

But I'm going to drop the AI thing for now, and just talk about views of consciousness general.

Physicalist realism is intuitively appealing

Now, here are some natural thoughts about consciousness in general. They're the kind of thoughts that might motivate the "realist agenda" above. I'll write them in first-person but note I'm not eventually going to endorse all of it.

You can skip to the next section if you don't need a reminder of why you might buy into the "realist agenda" above.

Clearly, there are deep facts about which things are conscious. I know this, because I know — more than anything else in the world — that I am conscious. I know the fact that I am conscious is "deep" rather than just linguistic, or somehow theory-dependent, or whatever, because I can't be mistaken about it. When I smell coffee, there is some conscious percept, quale, raw experience which is the smell of coffee. The smell of coffee does not involve the belief about coffee being nearby, which I could be mistaken about. But it can't “only seem” like I'm experiencing the smell of coffee — for me to experience the smell of coffee just is that seeming. I also have no reason to believe I'm special. So if there's a deep fact that I am conscious, there are deep facts about which other things are conscious; facts branded into the universe.

Here I mean "deep" in the sense of not superficial. We can easily be misled by behaviours, appearances, patterns of processing. Deep in the sense of discoverability — under the surface of easily observable facts, there lurk facts which, once we figure them out, will turn out to be undeniably true. Facts like, “this substance is a metal, that one only seems to be”, and “this cognitive system is actually conscious; that one only seems to be”.

I also mean deep in the sense of intrinsic, rather than especially context-dependent. Show me a lump of gold-looking metal. Ask: is this valuable? I'd say that's not a deep question, once you get precise about what you're asking. Do you mean “can I sell this for money”? There's no mystery about how to find out. Instead, you might ask: is this really gold? I'd say that is a deep question, in the sense I have in mind. Fool's gold (pyrite) looks a lot like real gold, but it isn't. There was a difference between real and fool's gold waiting to be discovered: whether the substance is made up of a chemical element with 79 protons in its nucleus.

There are deep facts about which things are ducks. The duck test says: “If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck”. True. But some things fail the duck test, like coots, grebes, and robot ducks. There's a deeper fact to discover about a duck-candidate: does it belong to a duck species^[6]? Does it have duck DNA?

Jacques de Vaucanson’s “Digesting Duck” (1739) / Source

Similarly, facts about consciousness are "extra" facts over and above other appearances and other known physical quantities. Because we can know, for instance, the weight or shape of a brain and remain unsure about whether it's conscious, and so on for other physical aspects we use to describe the non-conscious physical world. I won't try too hard to pin down what I mean by this "extra" thing, but there's a sense that consciousness sometimes "arises" or "comes along for the ride" or "attaches to" or "contingently happens to be identical to" certain physical things.

Now, I can reasonably infer that other people experience these raw and unmistakable conscious experiences, because they keep telling me about them, and their descriptions line up with my own experiences, and (like I said) I have no reason to believe I'm special in this respect^[7].

Ok, so there are facts of the matter that living and awake people are conscious, and these facts are deep and unmistakable and kind of "extra" in some way. What about other candidates for consciousness? That seems like a bigger inferential leap. What we'll need is a theory of consciousness which tells us which things are conscious in which ways, and ideally also explains why.

Ok, what shape should this theory take? Well, people disagree about the metaphysics of consciousness. Some people are dualists; they think that mind or consciousness is somehow fundamentally non-physical. But I prefer a scientific attitude, and an attitude which is strongly disposed to explain things we observe in terms of arrangements of physical things, rather than entities which are entirely outside of our best understanding of basic physics, like souls or divine spirits. Of course it's totally understandable why many choose to believe that the light of consciousness can survive brain death, or our brains are somehow receiving antennae for thoughts floating in the ether, but that's totally unscientific. So, although I'm open to being surprised, it seems fairly clear that the right theory of consciousness ends up being a physicalist, or materialist one.

What would that program look like? How does it make progress? Crucially, we can't observe this thing directly — that's why there is such a confounding question about which things have internal lives. But we can take the approach of the empirical sciences whenever it is trying to construct a theory around some entity which can't be directly observed; like a doctor with crude tools theorising about the cause of a mysterious disease. We can look to the correlates of the consciousness experiences which we humans report on, in terms of the brain regions which fire when we smell coffee, and so on. And we can figure out how to extend those theories to animal brains which are increasingly different from our own. Ok, fine.

Over in the theoretical wings, we can imagine strides on the explanatory side — why do these patterns of brain activity correlate with these experiences? Perhaps because they are embodying certain kinds of function, or specifically certain computations, which cause those experiences. Or maybe those experiences just are what certain computations feel like “from the inside”. Something like that.

Of course, we've not made much progress on these questions. We haven't yet boiled down the ingredients of those raw, intrinsic feelings of seeing red, smelling coffee, feeling pain. We are very far from the dream of knowing exactly which things in this world are experiencing which experiences. But these are the prizes. The answers are knowable in principle, because they are deep facts about the world.

I think this is something like the train of thought that many self-consciously scientifically minded folks run through when they wonder about what's up with consciousness, philosophically speaking. It's the kind of respectable, appropriately humble, but not overly ‘woo’ view that motivates the agenda on figuring out AI welfare which I describe above. It's the view I had until recently. Joe Carlsmith calls the views I'm describing validating physicalism.^[8]I'll call it “physicalist realism”: the view that there are deep facts about phenomenal experience, and they are analysable in physical terms.

Physicalist realism is surprisingly confusing

Unfortunately I think this view is basically untenable.

In particular, I think it's unstable between a more thoroughgoing skepticism about consciousness being real or special or "extra" in any meaningfully deep sense; or on the other hand a metaphysically ornate view like substance dualism^[9].

Perhaps a way to see this is to really ponder what it would mean for these two things to be true:

There are deep facts of the matter about the ‘true’, ‘raw’, or ‘phenomenal’ contents of experiences, beyond just the things we say, judgements we form, ways we act, etc.
These are physical facts.

Taking the success of the natural sciences to explain seemingly immaterial phenomena^[10]so far, physicalist realism surmises that these ‘phenomenal’ or ‘subjective’ experiences — these things we are searching for and trying to explain — also just are physical things. A conscious state or process is somehow just some particular pattern of brain activity; like how the concentric ripples of a stone dropped into a still lake just are a whole lot of molecules of water interacting with one another. But that view begins to feel increasingly weird and ephemeral when we ask how it could be true in any detail.

Consider what it means to say that “the experience of feeling the pain of searing heat on one's hand just is a particular kind of (say) process in the brain”.

Is physicalist realism saying that we can analyse the pain as whatever brain process causes you to pull away your hand from the hot thing, and yell out? Well, not literally — you might feel intense pain, but not yell out or pull away your hand. You might feel pain but not pull away your hand. And intuitively, you yell out because it's painful, not vice-versa.

Can physicalist realism analyse the pain in terms of the judgements we form about the pain, like forming new beliefs that I should try to avoid doing this again, that I'm an idiot, that I need to go to hospital, and so on? Again, physicalist realism is trying to pin down the nature of raw, intrinsic pain. You might think this is totally missing the central reality of raw pain. If there are deep facts about which states are intrinsically painful, it can't be true that if only you could stop forming downstream beliefs about it, then the pain would go away. Beliefs aren't intrinsically painful!

What about some more complicated bundle of associations? Can the physicalist realist say that, to explain pain, we just have to explain why you are yelling out, pulling your hand away, forming new beliefs and judgements about how this was a terrible idea, and so on? In other words, can they say that if you do all these kind of things, then that's just all there is to pain? Is it fine — even appropriately modest — to pin a proviso on the exact bundle of associations between raw pain and physical things, but insist there is some bundle of physical things which are identical to raw pain?

No!

If the most obvious candidate ingredients of a physicalist theory^[11]seem entirely inadequate, and it's unclear how they could at all combine to a more complex theory which is adequate, I think the physicalist realist is in a weak and confusing position. If there are deep facts about conscious experience, it really feels like any possible candidates for what might make up an analysis of conscious experience in physical terms don't penetrate the essential reality of conscious experience; they just dance around it.

We could look to less obvious ingredients of physical explanation, like microtubule quantum vibrations or whatever. But if the obvious candidates for physical explanation seemed so clearly doomed, I don't see any reason to hold out hope that these exotic candidates will eventually make sense as explanations.

The Catholic view on the eucharist is that the bread wafer literally becomes the body of Christ by transubstantiation; though in its “accidental” properties it remains bread. To the Catholic, for the wafer to have the property of essentially and literally being the body of Christ, this is a fact surplus to the ‘accidental’ facts of the wafer, for example, still looking and tasting like bread.

When I was a kid in Sunday school, I felt blindsided by this insistence in the literal identity of Christ's body and a piece of bread. I think the reason was that this insistence isn't connected to any particular analysis of Christ-hood and bread, and so doesn't produce any particular predictions or explanations. I couldn't rule out transubstantiation, partly because I was so unclear on what it meant, if not any particular notion of physical identity I was familiar with. The best I could do was to throw up my hands and admit it sounded absurd: to stare incredulously.

To the realist physicalist, there are deep facts about conscious states, but they seemingly can't be connected to any of the familiar (and physically analysable) cognitive concepts we have, like beliefs, desires, reflexes, dispositions, and so on. As I see it, there is just some provisional hope that conscious states can be analysed in terms of some physical properties or processes or whatever. The best I can offer, again, is an incredulous stare.

Reconsidering realism

If you are confident in the reality of transubstantiation, then presumably there is some answer to what it means for transubstantiation to take place^[12]. If there are deep facts about consciousness, and we insist on being physicalists about consciousness, then superficial attempts at physical explanation aren't enough. That leaves us with a head-scratcher, about what a ‘deeper’ but nonetheless still physical and scientific explanation looks like. But such an explanation is presumably quite radical or revisionist with respect to our prevailing concepts.

Alternatively, we could drop the assumption that there are relevantly deep or "extra" facts about conscious experience at all. Call these "non-realist" views. Non-realist views don't cry out for radical or revisionist theories, because they set a more modest challenge: explain the superficial stuff only. But it could turn out that the only way to explain the superficial stuff (talk about consciousness, pain-like reactions, etc.) does nonetheless require some wacky new concepts.

A table will help. These are the options I see for the physicalist who believes that consciousness can be explained scientifically.

	→ What do we need to explain?
↓ What explanations do we have?	Realism: there are deep facts about consciousness to explain	Non-realism: there are no deep facts about consciousness to explain
Deep explanations: notably radical or revisionist new (but nonetheless still scientific) insights	(1) What are these insights? As far as I can tell, no good ideas, and no reasonable attacks are forthcoming. Any candidate must break with current paradigms. In this respect, unlike virtually any other unexplained phenomenon in the world.	(2) A strangely unmotivated view to reach: if there are no deep facts about consciousness, then it's very unclear why we need radical or revisionist new insights to explain them. Non-starter.
Superficial explanations: continuous with our existing conceptual toolkit	(3) Not tenable, because it seems like for any candidate explanation of consciousness, we can imagine deep facts about consciousness varying with respect to the explanans.	(4) Tenable, but counterintuitive. Perhaps doesn't take consciousness seriously enough: we can explain consciousness-related talk and behaviour, but we're left with a feeling that we're only talking around the key thing.

No option here is safe; every option is some flavour of really quite weird. Still, we can compare them.

Option (2) seems strictly less plausible than option (4), so let's rule that out.

Option (3) is trying to have its cake and eat it. The cake is “taking consciousness seriously” as a real and deep phenomenon. The eating it is hoping that we can nonetheless explain it in terms of a non-radical extension of psychology, neuroscience, the cognitive and behavioural sciences, and so on. The result is that we try to give non-radical standard scientific explanations of different chunks of what we associate with consciousness as a deep phenomenon: why we behave the way we do when we're in pain, the brain circuitry of vision, etc. But if the realist insists that there's always a deeper thing to explain, beyond those superficial explanations, some cake always remains uneaten!

Option (1) has some precedent in the history of science. Atomic line spectra, blackbody radiation and the ultraviolet catastrophe, the photoelectric effect, and so on, actually did call out for a radical extension to the physical sciences in the early 20th century (quantum theory). One issue is that it's not at all clear what such a radical extension would look like. But a more serious issue is that it's hard to imagine any physical theory satisfying the realist who believes in certain deep facts about consciousness. When quantum physics was being worked out, physicists proposed atomic models that would explain some phenomenon if true, but the models just turned out to be substantially wrong. In the case of consciousness, what new kind of scientific theory or model would possibly explain something “raw and intrinsic subjectivity” in a satisfying way, whether or not it turns out to be true? So I submit that (1) is not an appealing option at all.

That leaves (4). The reaction to (4) is that treating consciousness just as a bundle of behaviours and brain patterns and so on, which don't require especially radical explanations, is wilfully obtuse. In other words: if we want to insist on a superficial explanation of subjective consciousness, then we have to show that we're not forced to explain more than the superficial facts about consciousness, despite the overwhelmingly forceful observation each one of us has access to, that I am absolutely and unambiguously conscious. That is, the question is whether we needed to be realists (in the sense I'm considering) in the first place.

And here I think there is a surprisingly strong case that there are no deep facts about consciousness to explain after all. Here are some reasons why.

Debunking and the meta-problem

The meta-problem of consciousness is interesting not least because it is hard to avoid taking a position that others regard as crazy. — David Chalmers

The meta-problem of consciousness is, roughly, the problem of explaining why we think that there is a (hard) problem of consciousness; or roughly why we're inclined to be realists about consciousness.

The “meta-problem” of the Riemann hypothesis is the question of why people think the Riemann hypothesis poses a genuine problem. The best simple explanation is that it actually is a genuine problem: it is a well-posed and consequential question with an answer which is unknown despite much effort. If it wasn't a well-posed problem, mathematicians wouldn't think it is one. Of course, answering the meta-problem of the Riemann hypothesis doesn't really teach us anything interesting about the Riemann hypothesis!

Similarly, we could say that the best explanation for why people think there is a hard problem of consciousness is because there actually is a problem of consciousness which is well-posed, unknown, and hard. And if there wasn't one, people wouldn't think there is one.

By contrast: we could say the “meta-problem” of the Cottingley fairies is why in the early 20th century many people^[13]came to wonder about how and why fairies were living in Cottingley, UK. The answer is that they had seen hoax photographs, and what looked like fairies were really artfully decorated cardboard cutouts. And seeing this, we realise that there never was a meaningful question of how and why fairies were living in Cottingley, UK; the question falsely presumed there were any fairies.

We can learn a lot about the “hard” problem of consciousness by asking why people think it's a problem. I claim we should expect in principle to be able to explain the meta-problem of consciousness without invoking any of the deep, puzzling, or even non-physical features of consciousness which provoke people to believe it poses a hard problem^[14].

Why think this? If you are a physicalist, you think that we can in principle explain why people say stuff, and write stuff, and behave in certain ways, and so on, all in terms of well-understood physical things like neural firings and sensory inputs, and perhaps more abstract concepts built on top of them like 'beliefs' and 'dispositions'; but not the consciousness phenomena which people puzzle over.

Well, here are some things people do:

Recoiling when stabbed and giving the behavioural appearances of pain;
Yelling out “that really hurt! That gave me pain qualia!”;
Writing a 432-page book about why consciousness is a hard problem which can't be explained in physical terms.

Why did David Chalmers write about why consciousness is deeply puzzling? Maybe it is deeply puzzling — we're not currently assuming that it is or isn't. We're asking whether we can give a consciousness-neutral account of why he wrote all those words. If David Chalmers’ writing is the product of keystrokes which are the product of complex nerve firings and brain goings-on, then I presume we can give a full account of his brain goings-on without explicitly invoking consciousness. Suppose we recorded a perfect 3D scan of Chalmers’ brain and its environment running over his entire life. Suppose you knew all the physics and chemistry of neurons and other brain stuff. If you had a few decades to millions of years to kill, you could flip from frame to frame, and ask: do I understand why this brain activity happened? In some cases, there will be a satisfying high-level explanation: the hunger neurons are firing because he hasn't eaten in a while. As a fallback, though, you could always just trace every minute pattern of brain activity.

Now, there will be times when Chalmers starts pondering the hard problem of consciousness. What does the brain scan look like? Perhaps the neural activity that arises breaks known physics; as if moved by spirits or a soul. Maybe, but: why think that? And, moreover: the physicalist realist expressly does not think this will happen! Perhaps the neural activity is physically explicable, but we don't yet know the relevant physics. Again: why think that? What would that look like? I'm tempted to press this point, but I hope most readers see what I'm saying. I don't expect physicalist realists think the missing explanations of consciousness make different predictions from known physics about how individual neurons operate.

So, ok, if you're a sensible physicalist, you can in principle explain (and even predict) why a philosopher wrote a book about the hard problem of consciousness in terms which don't invoke consciousness. What should we make of that?

I think we should react in the same way we naturally react when we learn why people were puzzled by the question of why fairies were living in a town in the UK. Because we can explain their puzzlement without assuming there were fairies, we have debunked the view that there were any real fairies. Chalmers puts it this way:^[15]

There is a correct explanation of our realist beliefs about consciousness that is independent of consciousness^[16].
If there is a correct explanation of our realist beliefs about consciousness that is independent of realism about consciousness being true, those beliefs are not justified.
(Therefore) our realist beliefs about consciousness are not justified.

Anyway: I think this is a very powerful argument. If we can ‘debunk’ realist beliefs about consciousness, do we have any other evidence or reasons to be realists?

We might in principle. It could be that consciousness is real, but by coincidence we form beliefs about it by mechanisms which don't depend on consciousness. Maybe, but that seems to me like a bizarre view.

You could also reasonably maintain a belief even if you think you can debunk why everyone else believes that same belief. The Pope can reasonably assume that every other living person who believes they are the Pope of the Roman Catholic Church is deluded, except him. In the case of consciousness, that could matter if you think the overwhelmingly strongest reason to believe in deep facts about consciousness come from your own conscious experience, not the testimony or arguments of others. I share this intuition pretty strongly, but I think it's missing the point. The debunking argument above applies just as much to my own beliefs — and your own beliefs — as anybody else's.

Debunking arguments are fragile things. There's a kind of overzealous debunking, which says that because we can explain Q-related phenomena, like beliefs about Q, without directly invoking Q itself, then Q isn't real. For example, you're likely currently reading words on a screen. I could say: you think there are words on that screen, but really they are just ensembles of pixels. You think the words move when you scroll, but really there is no movement, just the same fixed pixels cycling very fast between light and dark. This is an eye-rolling kind of reductionism. Some phenomenon can be perfectively real and reducible to other things^[17]When I say: “the words are just pixels”, you say: “sure — pixels which make up words. We're talking about the same thing from two angles.”.

Rather, the kind of debunking I have in mind needs to establish that beliefs about the concepts or things in question are unjustified; not reliably tracking the truth. This would be the case if, for example, there's a way of explaining how people come to form beliefs about some thing Q in totally Q-neutral terms. Say you come to believe that aliens visited Earth yesterday because you say a convincing photo, but later you learned the photo was AI-generated. Then you realise your original belief was no longer justified.^[18]

René Descartes, *La Dioptrique* (1637) / Source

What exactly are we debunking here?

So far, I've tried to establish that there are “debunking” arguments against our “realist” beliefs about consciousness, which undermine the case for realism.

But I have lots of very different consciousness-related beliefs: “I smell coffee”, “I feel peaceful”, “I'm in pain”, “this here is the qualia of seeing mauve”, and so on. Which of them are debunk-able?

Surely not all of them. Something meaningfully and usefully true is being said, when someone says that they feel peaceful, or that they're in pain, or that they smell coffee. There's some back-and-forth in the relevant corners of philosophy of mind about how many "folk" psychological concepts are really referring to anything real. I don't have a strong view, or especially think it matters, but I think we don't need to be edge-lords and insist that people are talking nonsense or falsehoods when they say they smell coffee or feel pain. But if you were to quiz the person who says they smell coffee about what they mean, they might start to say debunkable things.

For example, imagine you tell me that there are qualia related to smelling coffee, such that the qualia make no functional difference to your behaviour, but do make a difference to your subjective experience. I say this is debunkable, because if qualia make no functional difference, then they don't influence what you say, including about the supposed qualia. Yet you are telling me all these things about aspects of consciousness which supposedly have no influence on your behaviour. So they must have some explanation which doesn't at all rely on those qualities being real. So non-functional, ‘epiphenomenal’ qualities of consciousness are debunkable — your testimony about them doesn't supply me any evidence for them.

But what if you just told me that you smell coffee? I don't think this is easily debunk-able, because if I were to try to explain why you said that without invoking smell — in terms of complex brain processes triggered by olfactory sensory inputs and so on — you can say, “sure, very clever, but that's just a complicated way of re-describing what it means to smell something”. Very fair.

Now, what if you told me that you are in pain? Here I expect things get complicated. Say Alice and Bob are playing tennis, and Alice stops and says “I'm sorry, I'm in pain right now — it's my knee.” There's no dubious metaphysical import there — Alice is communicating that something is wrong with her knee, and she wants to stop playing. But suppose Bob and Alice are discussing consciousness, and Alice pinches herself in front of Bob, and said, “Look — I feel pain right now!”. Then Bob might hear Alice as saying something like, “…and I mean the kind of pain which can't just be reduced to known quantities in psychology — a kind of raw, private, ineffable, unobservable (etc.) pain you can't superficially explain.” Here, for reasons discussed, Bob could reasonably argue that Alice is saying something false; she is not in pain in any sense which can be debunked, that is in any sense which would make no difference to what she's saying.

So there is a line that separates regular talk about how we're feeling, from ‘debunkable’ claims about consciousness, and I think the line maintains that most non-philosophical talk has totally reasonable true interpretations, and I maintain I'm not just trying to be edgy and disagreeable. So the debunking argument against realism isn't a wrecking ball which causes collateral damage against our states of mind in general. But many of the more philosophically coloured views some of us do have about consciousness do seem vulnerable.

I think this is the line between what I've loosely been calling ‘deep’ and ‘superficial’ properties of consciousness. A superficial property can be broken down to the kind of cognitive pulleys and gears studies in the empirical sciences of the mind. A ‘deep’ property is some ‘extra’ property over and above the pulleys and gears, and as such it can be debunked.

Consciousness as illusion?

Our introspective world certainly seems to be painted with rich and potent qualitative properties. But, to adapt James Randi, if Mother Nature is creating that impression by actually equipping our experiences with such properties, then she's doing it the hard way. — Keith Frankish

So far I've avoided one name for the view I'm describing, which is “illusionism”. This is the view that, when we entertain beliefs about ‘deep’ properties of consciousness (of the kind which can be debunked), we are under the throes of an illusion.

I'm not too fussed about whether “illusionism” is a great label or not, but it's worth pondering.

Why not stick with a term like “non-realism”? One reason^[19], is that some of the properties we ascribe to consciousness aren't literally real, but words like "qualia" are still getting at something, and there's a whole bundle of consciousness-related things which are real and worth caring about, and the vibe of "non-realism" is too minimising and dismissive.

But a second reason is to emphasise that, whatever this view is, it's hard to avoid concluding that we are very often quite wrong about the nature of consciousness, especially when we try to philosophise about it. If you want to take physicalist realism seriously, I think you do end up having to conclude that when we confront questions around mind and consciousness, we run ourselves into intuitions that are hard to shake, whether or not we think they're true. Perhaps you don't believe in immaterial souls, for example, but I'm sure you appreciate why so many people do. Or you might agree it strongly seems on first blush like p-zombies should be metaphysically possible, and so on. Our brains really are playing tricks on us (or themselves, I suppose).

Moreover, to say consciousness is “illusory” is more than saying realists about consciousness are wrong — you can be wrong but not subject to an illusion, and vice-versa. It's more like: all of us seem vulnerable to some fairly universal and often very strong mistake-generating intuitions when we reflect on the nature of consciousness.

Some visual illusions, for example, are basically impossible to un-see, or some ambiguous images are impossible to see in some other way. I never really learned to see how that photo of a black and blue dress could instead be a photo of a white and gold dress, for example. But I judge, as a matter of fact, that it could be a photo of a white and gold dress, as indeed it turns out to be. That to say, illusionism doesn't imply we can easily see, grok, grasp, apprehend how every debunk-able intuition about consciousness, like the belief that qualia exist, could be mistaken. But neither does that undermine illusionism, more than my failure to see the dress as white and gold undermines my factual belief that it is white and gold.

Some people find illusionism patently absurd for a different reason: to experience an illusion, you need to be subjectively experiencing it. But illusionists are denying that there is subjective experience. So it’s self-undermining^[20]. The reply is to point out one can be mistaken — that is, subject to an illusion — without the kind of subjective experience that is, for the illusionist, not real.

So I don't think illusionism is self-undermining, but I do think it's a weird and radical view. It's a view which I lean towards thinking is true, because I think it has the best arguments in favour, and other views have (it seems to me) strong arguments against. But I can't feel it in my bones.

As I write I'm looking out of a plane window, and the sky is intensely blue. I cannot convince myself that there is nothing ineffable, raw, private, or intrinsic about that bright colour. I can't convince myself there isn't some deep fact of the matter about which colour, painted in mental ink, fills my visual field.

But illusionism predicts that confusion, too; at least the non-question-begging cognitive aspects of my confusion. It predicts I write words like this. So there's a tension between a very strong set of intuitions, and a very strong set of arguments.

Chalmers captures the kind of inner dialogue that inevitably follows:

Realist: People obviously feel pain, so illusionism is false.

Illusionist: You are begging the question against me, since I deny that people feel pain.

Realist: I am not begging the question. It is antecedently obvious that people feel pain, and the claim has support that does not depend on assuming any philosophical conclusions. In fact this claim is more obvious than any philosophical view, including those views that motivate illusionism.

Illusionist: I agree that it is obvious that people feel pain, but obvious claims can be false, and this is one of them. In fact, my illusionist view predicts that people will find it obvious that they feel pain, even though they do not.

Realist: I agree that illusionism predicts this. Nevertheless, the datum here is not that I find it obvious that people feel pain. The datum is that people feel pain. Your view denies this datum, so it is false.

Illusionist: My view predicts that you will find my view unbelievable, so your denial simply confirms my view rather than opposing it. Realist: I agree that my denial is not evidence against your view. The evidence against your view is that people feel pain.

Illusionist: I don't think that is genuine evidence. Realist: If you were right, being me would be nothing like this. But it is something like this.

Illusionist: No. If 'this' is how being you seems to be, then in fact being you is nothing like this. If 'this' is how being you actually is, then being you is just like this, but it is unlike how being you seems to be.

And the dialogue goes on. Dialectically, the illusionist side is much more interesting than the realist side. Looking at the dialectic abstractly, it is easy to sympathize with the illusionist's debunking against the realist's foot-stamping. Still, reflecting on all the data, I think that the realist's side is the right one.

The analogy to life

So “illusionism” says our intuitions about consciousness are wrong — deeply, perhaps intractably and unshakeably wrong. But there's a more constructive angle on the kind of view; this is the analogy to biological life.

The corner-cutting version of the story is that for the longest time everyone believed in a non-physical "life force" which animated living things. Then the empirical life sciences matured, and by the late 19th century or so, scientists understood that living things are big sacks of controlled chemical reactions. Mysteries remain — in some sense there are more well-scoped open problems in the life sciences today than any point in history — but every serious biologist grasps that biological life in general doesn't call out for extremely radical or non-physical explanation.

Inconveniently, I do think a more careful account of the history of the "life" concept would be messier. There was no single view on what “life force” meant; sometimes it was intertwined with the idea of a “soul”, but sometimes it wasn't avowedly non-physical. Descartes and others viewed nonhuman animals as mechanical, but humans as animated by a soul. The influential geologist James Hutton took as given some kind of animating and explanatorily necessary “life force”, but tried to reframe the concept away from metaphysics, and more in terms of some kind of organising principle distinctive to life which was nonetheless entirely physical. The idea of “élan vitale” came later, from Henri Bergson's 1907 L’Évolution créatrice, and shifted focus away from the details of cellular processes, and toward the idea of a “creative force” driving evolution itself.

Life can still seem miraculous, including and especially to the experienced biologist. The point isn't that the gestalt sense of amazement was lost; the point is that no deep, or radical, or metaphysically revisionist explanation turned out to be needed. Nor did open questions go away. Questions about life just became continuous with questions about established sciences.

When the life sciences were understood as continuous with other empirical sciences, something happened to the concept of “life” itself: it was no longer so tenable to suppose there is exactly one correct conception of what “life” is waiting to be discovered. If you ask whether a virus is alive, or a replicating entity in the Game of Life, or a slime mold or a prion or a computer virus, well, take it up with your dictionary. “Life” turns out to be associated with a bundle of properties, and sometimes they come apart and leave genuinely ambiguous cases.

I'm not saying that there aren't interesting, predictively useful, and non-obvious things to say from thinking about what features divide living and non-living things. Schrödinger, writing in 1944, correctly theorised that biological life must support hereditary information, that this information has some way of not degrading, that this kind of stability must rely on the discreteness of the quantum world, and that heritable information is thus stored as some kind of "aperiodic crystal" with "code-script" properties. This is exactly what DNA and RNA turned out to be!^[21]

Still, it becomes clearer how aliveness can be ambiguous, not in the sense of varying degrees of aliveness, but varying meanings and interpretations of “alive”.

Is a virus alive? It's a contentious question, but not because a virus is only 60% alive, and the threshold for deserving the "life" title is vague or disputed. Nor does the answer depend on some testable physical facts we first need to know, but don't currently know. It's a linguistic ambiguity: just what do you choose to mean by "life"? If you're wondering whether a virus is living, you might protest that it really feels like there has to be something to discover — some way to peer into the virus’ essence. But a reasonable reaction from a virologist is just to shrug: “I don't know what to tell you! It has some life-related features, and lacks others!”

In some sense you need to have a radical view to be open to the analogy with life in the first place. Some maintain that life just has none of the deep, or extra, or radical properties that consciousness must have, perhaps because “life” clearly supervenes on physical biology, but consciousness doesn't. But if you buy the arguments above, then I do think the analogy is suitable.

Here is Brian Tomasik with a fairly stark expression of the view we've reached:

It doesn't make sense to ask questions like, Does a computer program of a mind really instantiate consciousness? That question reifies consciousness as a thing that may or may not be produced by a program. Rather, the particles constituting the computer just move—and that's it. The question of whether a given physical operation is "conscious" is not a factual dispute but a definitional one: Do we want to define consciousness as including those sorts of physical operations?

I'm not so sure about the "and that's it" part, for the record.

Mind sciences and life sciences

…Do not all charms fly At the mere touch of cold philosophy? There was an awful rainbow once in heaven: We know her woof, her texture; she is given In the dull catalogue of common things. Philosophy will clip an Angel's wings, Conquer all mysteries by rule and line, Empty the haunted air, and gnomed mine— Unweave a rainbow…

— Keats, Lamia (1820)

If the analogy is good, then we might expect the “science” of consciousness to be continuous with the extant sciences of the mind and behaviour — psychology, neuroscience, cognitive science, and so on.

In particular, we'd expect "folk" intuitions to more often be complicated, disambiguated, or explained away, rather than validated. Take the widely held intuition that there is something deep and essential about how my personal identity flows through time: at different times, there is a deep and discoverable fact about who, if anybody, is me. If I undergo a hemispherectomy in order to ‘split’ my brain into two functional halves, ‘I’ remain in exactly one of them, if any. Or if all my atoms are destroyed and near-instantly remade in just the same arrangement on Mars, I don't go with my copy — that is another person.

As far as I see it, careful thinking about personal identity (Parfit comes to mind) has shown that widely-held intuitions about deep facts of personal identity — facts lurking beneath superficial properties like behaviour, psychological and causal continuity, shared memories, and so on — are useful but mistaken. They're mistaken in large part because they are debunkable, because we can explain them away without validating them. After all, it's not surprising that we'd form such strong intuitions when "splitting" or "teletransportation" cases are either rare or fictional, so that we're rarely confronted with challenging cases. In our neck of the woods — brains in relatively impervious skulls that we are — there's very little practical use to forming more complicated views on personal identity.

Finally, though, we should remember the Schrödinger example. Schrödinger figured out something substantially true about living things, which does turn out to be a hallmark of basically every system we'd intuitively say is genuinely alive, which is that (in my attempt at paraphrasing) living things must generally maintain and propagate information encoded in aperiodic structures that are stable against thermal noise. Genes and gene-like mechanisms do turn out to carve out a neater and more interesting "joint" in nature than turn-of-the-century scientists might have expected, having established that "life" isn't — as a matter of definition — some deep and singular feature of the universe.

Maybe consciousness talk turns out to be some totally arbitrary spandrel of human genetic and cultural evolution: some wires got crossed in our prefrontal cortex and now we're all tangled up in these random conceptual schemes that aliens and independently-evolved AIs would find quaintly bizarre, perhaps themselves hostage to similarly random complexes of ideas and confusions about their own minds. But I suspect not. I suspect there are some general mechanisms that generate consciousness-intuitions are fairly abstract from the details of being human, which would suggest both that naturally-arising consciousness intuitions are somewhat non-arbitrary and shared. It also suggests that we can make interesting predictions^[22]about when consciousness intuitions are present, how they change, what they require, and so on.

The analogy between the life sciences and the study of consciousness suggests a kind of spiritual disenchantment, voiced by that famous Lamia excerpt. I think that's really the wrong vibe. I think it's exciting when scientific processes gets to work untouchable object. The image is not closing down fabulous metaphysical beliefs, but opening up new scientific problems and explanations, and more follow-up problems, and so on.

What about pain?

What I am after all is a kind of utilitarian manqué. That is to say, I'd like to be utilitarian but the only problem is I have nowhere those utilities come from […] What are those objects we are adding up? I have no objection to adding them up if there's something to add. — Kenneth Arrow

Still, there's an awkwardness about the view I'm arguing for. One reason we care about consciousness is because many people think that consciousness matters in some ultimate way. For example: physical pain seems bad because it's a conscious state with negative valence. And it seems important to help avoid pain in ourselves and others. The mere outward signs of pain aren't themselves bad — we're not roused to get up from our seats and help the actress portraying Juliet stabbing herself. If there are no deep facts about which states are truly painful, that's especially inconvenient, because we have to choose how to act — we can't dodge that question.

Here’s an analogy^[23]. Imagine you are a quiet-lover; somebody who cares about avoiding loud noises as an ultimate goal. You live and act in a crowded city, and every source of noise you've minimised so far has been some noise which people always hear: car horns on the busy roads, music in public squares. One day, you learn about a volcano poised to erupt, unless somebody urgently diffuses the volcano by pouring a special diffusing formula into the vent. If the volcano erupts and anybody is standing by, it would be the loudest thing they ever heard. But no person and no creature will be standing by if it does erupt: the volcano is on a remote and lifeless island. For the first time, you realise you need to figure out what exactly are these "loud noises" you care about. Like the idiom goes, if a volcano erupts with nobody to hear it, does it make a sound?

In this case, it's not that you need missing empirical information. It's not a deep mystery whether the isolated volcano makes a loud noise. There are just different senses of "noise" which always coincided before, and now you've got to pick one in order to know how to act.

What are your options? The most obvious option is to consider the reasons you cared about loud noises in the first place. Ok: you decide it's because because they disrupt the peace of the people who hear them. Here you've found a more ultimate justification, and you've used it to pick out a particular sense of a previously ambiguous concept. You retreat to a more ultimate thing — something like 'promoting peace' — which was being tracked by 'avoiding loud noises'. But you might notice you do still care a little about the volcano eruption. Maybe you struggle to find some neat unifying principle which explains why you should ultimately care about both volcanos and car horns.

That's fine, of course: you can care about many things. But it makes your life's mission feel a little less well-grounded; more arbitrary; messier. You've just got to live with that.

It might go this way in the case of pain, and 'valenced' conscious states in general. You might start out hoping that there are deep facts about which things are in pain, or what counts as a negative conscious state. Of course there is some ambiguity about how the word "pain" is used: you might casually say that a house plant is in pain because you're not watering it. And of course it's not an issue, on this view, that the word "pain" is at all ambiguous or vague, just that there is some deep property that pretty obviously is the pain property you care about.

But the view I'm advocating is that there may be no such 'deep' property of pain at all. In other words, we can always pick away at candidate definitions until we start feeling really confused about how we can ground out some ultimate source of (dis)value with whatever remains. Here's how the dialogue might go:

Bob: I think pain is bad in an obvious and ultimate sense, and we should act to avoid it. Many things are bad because they cause pain; pain is bad in itself.
Annie: Right, I feel that too. But I guess you mean something different to mere displays of pain-related behaviour, since intuitively I can pretend to be in pain while not being in pain.
Bob: Sure, I don't mean pain behaviour. I mean the state that typically causes pain behaviour: real pain.
Annie: Right. And this "real pain" — I guess there are some brain patterns which normally fire whenever people or even most vertebrates are in pain, like [asks ChatGPT] nociceptors firing, activation of "c-fibres", some activation of the "anterior cingulate cortex". I guess it's not those particular things, since we might imagine aliens or digital systems experiencing pain without sharing our anatomy. And we might imagine those things firing without 'real pain', if the rest of the brain is somehow wired up wrong. Right?
Bob: Right. I'm not talking about particular brain processes, since the pain I care about is one and the same concept as the concept aliens would have for their own pain^[24].
Annie: Right. Maybe particular brain states are related to pain in humans or animals, but what matters to you is the intrinsically painful nature of the experience which they're associated with, or are maybe identical to (somehow?)
Bob: Yes!
Annie: Right. And tell me: why is it bad?
Bob: It's the most obvious thing in the world that pain is bad.
Annie: Is pain bad because you seem to hate it, and run away from it, and avoid it?
Bob: No! I do those things because it's bad.
Annie: You told me you enjoy doing intense exercise. Isn't intense exercise painful?
Bob: It is in a sense: I can feel my muscles burning in a way that would normally be alarming. But the "pain" I'm talking about is the overall evaluation of an experience. And in the context of exercise, the pain-like physical sensations aren't bad.
Annie: Are the "raw" sensations different, or do you relate to them differently? It would seem that if the "raw" physical sensations are the same, and you only relate to them differently — because you somehow endorse them — then it's not the raw or intrinsic experience that matters, but your cognitive (non-phenomenal) judgements about them. Right?
Bob: Right, I guess. In context, the overall experience is different, because I endorse it.
Annie: But earlier you said that you avoid painful experiences because they're bad. Now you're saying an experience isn't painful because you don't have a (negative) evaluative judgement about the pain. So do you avoid pain because it's bad or is it bad because you avoid it?
Bob: Hmm. Both?
Annie: I guess we can both agree that some experience isn't intrinsically bad just because someone avoids it, or doesn't endorse it, or makes some other meta-judgements about it. But how would I know if you were mistakenly endorsing a painful experience? After all, you're a physicalist, so you don't think you have access to some non-physical realm of consciousness. And if I looked at a scan of your brain while you were exercising, how could I tell the difference between an experience which is truly painful, but you wrongly endorse and want more of, and an experience which is overall not painful?
Bob: Well, I suppose certain kinds of "judgements" about my experience affect the experience itself, and others don't, because they're more like forming beliefs without an experience attached.
Annie: And how do I know which are which? For that matter, how do you know which are which?
Bob: This is painful!

And so on. I'm not trying to make a crisp argument here, I'm pointing to the difficulty that the physicalist realist is likely to have when they really think about how and why certain conscious states are essentially and deeply good or bad, in a way which grounds views about overall goodness and badness, and how we should act. It's a difficulty I feel quite strongly, since I share the strong intuition that there is something bad about pain in a deep, ultimate, and simple way.

In particular, as I tried to point out, I think there is often an ambiguity between something like cognitive judgements about positive or negative raw experiences, including preferences for or against them, and the positive or negative raw experiences themselves.^[25]The realist about consciousness needs to draw a line between the value of some "raw" experience, and judgements, preferences, dispositions etc surrounding the experience (which can be wrong). And thinking about where to draw the line can induce the feeling that there isn't a valid distinction to draw the line between in the first place.

The non-realist physicalist can avoid getting confused about how to draw the line, because their view denies that there are "raw"experiences, or at least doesn't carve out any special or non-arbitrary role for them. This is different from the view that judgements or preferences about experiences are always right according to the non-realist; though some view more grounded in preferences might look less hopeless in comparison to a view grounded in the ultimate hedonic value of experiences. In any case, the cost of the non-realist's view is that it's far, far less clear how any conception of "pain" can play the normative role many people want to demand of it.

So (A) non-realism is the right view on consciousness looks incompatible with (B) the intrinsic goodness or badness of conscious states ultimately ground out a big part (or all) of what matters, and how we should act. There are a couple of ways you can react to the confusions that result:

(B) is right. At least, I choose not to untether myself from such a crucial mooring point for how I act. So in any case, I reject (A).
(A) is right, so we've got to give up on (B):
1. … by making a small revision to (B), such as by dropping the requirement that conscious states be intrinsically bad, or that they ultimately ground out what matters.
2. … by making a major revision to (B), such as by switching out talk of phenomenal states with some notion of (your own, or everyone's) preferences compatible with non-realism about consciousness, adopting some more rules or virtue-based guides for action, or becoming a nihilist.

I'm absolutely not going to suggest an answer here. But I'll say what goes through my mind: a sense that option (B)(1) is sensible and realistic, then head-spinning confusion on more reflection.

The first thought is this: it would be very convenient if what to do, or at least how to compare outcomes by value, significantly depends on unambiguous facts about an intrinsic property (phenomenal consciousness). The property that matters becomes more like gold — where we can 'discover' what is true gold versus pyrite — and less like 'biological life' or 'personhood', where ethical disputes which hinge on what's alive or what counts as a person blur confusingly into semantic disputes about what those words mean at all.

We might reason: I seek out and value lots of different things, and I'm confused about what they have in common. Ah — one thing they have in common is that they route through my own experience, so it's the experiences they cause that matter. And, ah — since all those experiences must have something in common, that something must be some kind of intrinsically value-conducive property which makes me seek them out and value them, or perhaps makes them worth seeking out. And we can call this "pleasure" or "happiness" or "positive hedonic tone" or whatever.

But it would be too convenient. Are we saying anything more than the circular conclusion that we should seek out good experiences because we seek them out? Perhaps there is a worth-seeking-out quality to those experiences. But, on the surface, the experiences we associate with things we seek out really do not seem to form any deeper ‘natural kind’.^[26]The thrill of intense exercise is just so unlike getting lost in a sad film, which is unlike the nervous excitement of falling in love, and so on; and in many ways those experiences are more obviously similar to, correspondingly, straightforward physical pain, feeling 'legitimately' sad, or experiencing generalised anxiety. Other than, of course, the fact that we seek out and endorse (etc.) items on the first list, and vice-versa.

Intrinsic value and disvalue wouldn't just give us a way to tie together disparate experiences within a person, it would give us a way (in principle) to compare the value of experiences across experiencers. It would mean there is a simple fact about whether, say, preventing five minutes of pain for Bob justifies depriving Alice of an hour of joy. One experience isn't better than another only for Alice, but simpliciter. Our brains become purses to a shared currency of hedonic valence.

Taking the non-realist (or 'deflationary') view, then, means giving up on what could have been an amazingly convenient and unifying vision of ethics: the hidden scoreboard written deeply into the universe, the balance of essentially good and essentially bad states of mind.

The hope for the non-realist is that they can drop all the metaphysical ambition, and leave behind some more prosaic ethical system(s) which still justifies much of everything we care about in practice.

Why think this? Because of where I think most our ethical views come from before some of us theorise too much. Presumably we form most of our ethical attitudes first, and then propose ideas around intrinsically valuable conscious states as some kind of explanation or theory for those views, and then perhaps add some extra views which are uniquely suggested by our theorising about consciousness. If the structural foundations of our ethical thinking form before theorising about intrinsically valuable conscious states, then winding back that theorising should leave most the structure standing.

As a first pass, we can imagine taking the concern we thought we had for intrinsically (dis)valuable phenomenally conscious states, and shifting that concern toward some close substitute that makes sense: something like self-endorsement, or preference satisfaction, or knowledge of preference satisfaction, or some ideas of cognitive 'healthiness' or 'wholesomeness', or (as the case may be) a big ol' mix. Indeed, I expect the kind of action-guiding principles that a concern for intrinsically (dis)valuable phenomenal conscious states can largely survive, because many of the arguments that route through can be rerouted to avoid committing to such states existing.

It's unclear how far the non-realist can cleverly argue their way back up to justifying richer kinds of comparability between experiences and experiencers, without just assuming the conclusion.

For now, Brian Tomasik comes to mind. He is the person I think of when I think of people who centrally care about avoiding suffering for its own sake, but he also does not believe that qualia exist. That's a set of beliefs you are allowed to have, and which apparently stand up to reflection^[27].

Tomasik makes a germane point here:

Suppose there were some sort of "pain particle" corresponding to the quale of suffering. Why care about that? What makes that any less arbitrary than a particular class of particle movements corresponding to particular cognitive algorithms within certain sorts of self-aware brains?

To expand on that, suppose there were deep facts about what states are pain; which things have negatively valenced "qualia". Presumably we humans are wired to respond to pain qualia in the appropriate ways — we yell out, try to avoid experiencing them, and so on. But since qualia are supposed to be essential, non-functional things, we could imagine some creature that earnestly seeks out pain qualia. Despite being truly wiser and more reflective than any of us, the creature reacts with apparently earnest delight and no regret at all when it experiences them. What grounds would we have to care about the qualia, rather than what the creature (apparently) earnestly wants?^[28]On what grounds could you argue it only seems to want pain 'qualia', or is unjustified in wanting them? Doesn't the thought experiment strain credulity in the first place?

I'm as confused about ethics as the next person. But I do want to push back against the framing which says: non-realism or illusionism about consciousness is so radically destructive and counterintuitive — what are its implications? This, to me, smells like "if God is dead, everything must be permitted". If your theoretical gloss on why pain is bad doesn't work, that doesn't make pain not bad; and you shouldn't feel worried about lacking a deep theoretical justification for that view.

Virtually all ethical progress to date has not relied on or invoked theory-laden conceptions of phenomenal consciousness. So I expect many of the arguments which seemingly rest on some commitment to realism about phenomenal consciousness can be rerouted. For example, we can still point out how, if we care about avoiding our own future pain, it might be irrational not to care about the future pain of others (whatever pain is in the final analysis). Or if we care at all about the pain of cute animals, and we strive not to let ethically arbitrary features limit the extent of our care, and we acknowledge cuteness is ethically arbitrary, then we might reason we ought to extend our care to other creatures in pain. And so on.

I really want to emphasise this. Compared to a hoped-for realist theory of consciousness, a messy, anti-realist, and deflationary view of consciousness needn't recommend that you care less about things like the suffering of nonhuman animals, or newborn babies, or digital minds, or whatever else. Realist and deflationary views of consciousness don't straightforwardly disagree over degrees of consciousness.

We were right, in a messy and circumscribed way, that life matters. We were wrong that there is a deep, discoverable, essence of life. We didn't care especially less about life — even for its own sake — after we learned it's not a deep thing. Ethical thinking can be at once systematic, rigorous, demanding, and (for now) ungrounded.

Weren't we talking about digital minds?

Oh yes. Does any of this practically matter for AI welfare?

One upshot is that the 'realist research agenda' will — strictly and pedantically speaking — fail. Projects like identifying the 'correlates' of consciousness, figuring out criteria for when the AIs are 'really' conscious, devising tests to derive an unambiguous cardinal measure 'how conscious' a digital system is; these will turn out to be slightly confused ambitions. Working on them could then be bad, because in the meantime, they'll absorb the efforts of some earnest, smart, well-meaning people. The opportunity cost of confused research here is high!

You could reasonably object that I'm tilting at windmills. A very small number of people are seriously working on issues around digital consciousness, and as far as I know they are not committed to a research agenda with strongly or explicitly realist vibes. Eleos is the only research organisation I know of doing work on digital welfare, and for the most part their work seems to involve consensus-forming around the importance of the overall topic; convening top researchers in academia, pushing for sensible policy asks which are currently pretty insensitive to realism, and so on. Anthropic have an "AI welfare officer" (Kyle Fish), I don't think any of his or Anthropic's work has made the mistake^[29]. At some point, though, I imagine the rubber will hit the "let's do object-level research" road, and philosophical commitments might become more relevant then.

Second, you could object that it's largely fine to set out on a scientific enterprise while you're still unsure or even wrong about fuzzier questions around metaphysics or philosophy, because the detailed empirical and scientific work tends to clarify confusions which initially felt purely philosophical (cf. life). I think that's fairly reasonable, though I worry that the philosophical tail is more likely to wag the scientific dog when it comes to AI consciousness, since the questions are so wrapped up with strongly-held intuitions, ethical peril, and incentives to look the other way from inconvenient conclusions. So it could be unusually important to proactively try to get the fuzzier philosophical questions right^[30], or at least to remain appropriately open to a range of answers, in tandem with the interpretability and other more science-like work.

On the other hand, I think the non-realist view I'm arguing for is potentially great news for concrete research projects, because it naturally suggests scientific angles of attack.

The project I am most excited about is making progress on the 'meta problem of consciousness' — how, when, and why do some thinking systems start saying stuff about consciousness, especially stuff along the lines that there is a hard problem of it. Extending that question, why do we imagine that experiences have essential or intrinsic properties, or that they are uneliminably first-personal, and so on? Luke Muehlhauser and Buck Shlegeris have a really cool writeup where they build a toy "software agent" which, if you squint extremely hard, generates some outputs which can be interpreted as consciousness-like intuitions. Chalmers suggests some hypotheses of his own, as does Keith Frankish, François Kammerer, and others^[31]. But work on these questions strikes me as amazingly neglected^[32].

Similarly, I could imagine research studying the circumstances in which AI systems "naturally" hit on consciousness-like talk. Are the set of (realist or otherwise) intuitions we have around phenomenal consciousness some idiosyncratic upshot of how human brains are wired? Or do thinking systems from a wide basin of starting points end up with very similar views? When studying LLMs, of course, there are huge knots to undo because the LLMs have been trained on human talk about consciousness. One ideal but (as far as I know) practically Herculean experiment would be to train an AI system on some corpus where all consciousness talk, and semantic 'neighbours' of consciousness talk, are removed. If the LLMs spontaneously regenerate human intuitions about consciousness (with respect to their own experiences), that would be huge. And if we can't literally do that experiment, are there more feasible alternatives?

A related and more general question is something like: "under what conditions do the models self-ascribe conscious experience?" This excellent paper presents some interesting results, where prompting the models to engage in sustained kinds of self-referential thinking makes them more likely to talk about themselves as conscious, and suppressing features related to deception increases consciousness-talk. I think the non-realist gloss is appealing here: there are patterns of thinking which — in some reliable way across particular cognitive architectures — yield consciousness-like intuitions. In fact, there is an even wider set of questions around what AI introspection could involve, mechanistically. Under what conditions can we talk about anything like "honest" or "accurate" introspection? Anthropic have some great work along these lines; I'm sure there's a ton more to be done.

Against ambiguity

Lastly, I'll suggest a policy-relevant upshot. Maybe we should deliberately design the AIs, and the systems they're part of, to make more (ethical) sense to us. What I mean is this: we arrive at these questions of AI consciousness carrying a bunch of existing, tested ethical and political and philosophical intuitions.

We have concepts like "person", which tend to be unique entities which are psychologically continuous over time. We know how those things fit with existing institutions and rules and norms. And we could, in principle, devise AI systems in a way so that it's overall fairly clear how they naturally fit with that picture we all broadly agree on and understand. Which is to say, we could aim at a world where you can look at an AI system and confidently discern, "ok, this thing (say, my AI photo editor or flight booker) is a tool, and it has no affordances or dispositions to make anyone believe otherwise"; or otherwise, "this thing is an AI person — it's been built to non-deceptively report on its beliefs, including about itself. It knows, and we know that it knows and so on, what's up, and what rights and duties attach to it. And where relevant and non-contrived, it shares some deep properties with human people."

We could fail to do that in a few ways. We could deliberately suppress consciousness-talk in AI systems, like by penalising such talk in training. Initially, we could agree that person-like AI systems can't (for example) be split into a million copies, or have their memories constantly wiped, or be constantly deceived about very basic self-locating beliefs, or be specifically trained to 'merely' imitate inauthentic kinds of consciousness talk^[33].

Eric Schwitzgebel has made a similar point (most recently here), which he calls the "design policy of the excluded middle", according to which society should avoid creating AI systems "about which it is unclear whether they deserve full human-grade rights because it is unclear whether they are conscious^[34]or to what degree". If I'm reading Schwitgebel and his co-authors right, their argument routes through the cost of uncertainty: if we go ahead and build "ambiguously conscious" AIs, some reasonable views of consciousness will say they're conscious, and others won't. Whether or not we act as if they're conscious, some reasonable views will say we're making a grave error. Because the downside of making grave errors in this context are big compared to the usefulness of going ahead and building ambiguously conscious AIs, we should avoid making them in the first place.

I want to emphasise a specific angle on that idea, based in particular on the kind of non-realist view I've been arguing for. In what ways could it be uncertain or ambiguous whether an AI is conscious? You can be empirically uncertain, of course. Or you can be unsure which theory of consciousness is right. Or you can know the empirical facts, but take a philosophical view which says some AI falls into a vague middle-ground in terms of its level or degree of consciousness; like the space between sleep and waking, or having hair and being bald^[35]. But there's yet another kind of ambiguity which non-realism surfaces, which is a more expansive disagreement about which features, when all is said and done, we should agree to care about.

The worry, then, is that some AIs will be ambiguously conscious in a way that doesn't involve uncertainty between metaphysical theories, doesn't involve empirical uncertainty, and doesn't involve vagueness. If this non-realist view is right, all the model interpretability and clarity about the metaphysics of consciousness alone won't resolve questions of how to treat systems which fit awkwardly into our existing normative systems.

Tk I am here. One option is to quickly patch all the holes and ambiguities in our normative systems, in time for the Cambrian explosion of mind-like things to disperse all across the world. Another option is to constrain the systems we design, at least in the beginning, to fit the laws, norms, ethical intuitions, and so on which we're already fairly comfortable with and agreed on. Then we can relax the design constraints of AI systems to look weirder, and we can test how our normative systems handle them, adapting and adding to them when needed, and set off a kind of co-evolution. I think that's how we got to the world of basically functional laws and norms we have today, so I'm more hopeful about the co-evolution plan, than a plan which says we should breach the dam and let the deluge of new forms of intelligence in all at once.

Conclusion

Let's go back to the 'realist research agenda', and think about upshots.

Advance the scientific and philosophical program(s) to identify which kinds of systems, functions, computations, brains etc. are in which conscious states…

The spirit here is right on, but the literal wording is presumptive, because it implies we'll get some kind of canonical mapping to "conscious states". Replacing "conscious states" for "consciousness-related phenomena" and we're good to go.

Devise tests for valence — a property of conscious states which is either negative or positive. Understand how to build increasingly accurate “valence probes”…

Something like this is still going to be hugely useful. But, on a non-realist or deflationary view, a literal "valence probe" might not make sense even in theory. We could reword "devise tests for valence" for something like "devise tests for the kinds of mental phenomena we care about (and carefully establish what we care about)" — and perhaps also, "build systems which make it easiest to administer uncontroversial and unambiguous tests of stakes-y consciousness-related phenomena like valence".

Based on this work, figure out methods to reduce negatively valenced conscious experiences in the AI systems we're making…
Implement these methods.

Hard to disagree with that. Though I might note a slight uncomfortableness around "reducing" and "promoting" language. The framing of intervening to directly reduce pain feels most apt when we're thinking about nonhuman animals like chickens, or humans who need our proactive care, like children. When thinking about sane, autonomous, empowered humans, it's also apt to think about "how can we set things up so people are free to avoid what they don't like, and to help themselves?" The AIs we're getting are going to be more human than chicken; so I think that would be a complementary framing device.

It's hard to know how to relate to the possibility that consciousness, human or otherwise, is more like a magic trick than real magic. There's disbelief, which is how my own gut reacts. There's disenchantment; the feeling of losing the part of the universe you hoped to hang your ethics on. But, as I've tried to argue, there's excitement. It means that questions around AI consciousness are answerable.

Major credit to Joe Carlsmith for sharing or inspiring many of the points here, through his writing and in conversation. Errors remain my own, don't assume he endorses any of this, and so on. ↩︎
Assuming macaques feel real fear, though surely they do? ↩︎
For more on the ‘stakes’ at play with AI welfare, this recent post by Joe Carlsmith is excellent. ↩︎
Because it's confused, not because it's hard. I'm not faulting the plan for being ambitious: theories of consciousness are very nascent, and one imagines a truly mature science of consciousness weaving together some very sophisticated strands of neuroscience, cognitive science, psychology, and so on; perhaps only after some heavyweight empirical and conceptual breakthroughs. I'm saying the plan might also just turn out to feel a bit confused. ↩︎
Not least because it would have "unclogged" a discipline filling up with literal epicycles and kludges. Anecdotally, I get a sense from some who think about AI welfare, that we’re bound remain very deeply confused by the time we need to make calls about AI consciousness, and we’ll have to muddle through with methods and policies under near-total uncertainty. I think a non-realist has grounds to be slightly less pessimistic than that. ↩︎
This is actually an imperfect example, because “duck” isn't a taxonomic rank. But ornithologists agree: all ducks belong to the Anatidae family. ↩︎
Of course I can doubt that anyone else is conscious in principle, but I'm just trying to make reasonable inferences about the world based on simple explanations; that my brain isn't magically special seems like a pretty solid bet. ↩︎
Quoting Carlsmith: "We think about consciousness as this deep extra fact, like from this physical system did bloom this internal extra experience. So validating physicalism says that somehow that’s still true, even though it’s also identical with a physical process." ↩︎
For the time being I'm not going to consider these views very much. That's because I assume most of my readers are already unsympathetic to substance dualism, and I currently find the other side of the dilemma much more compelling. ↩︎
“Life” being one example I‘ll consider. But also: electricity? Gravity? Divine revelation? Spirits and ghosts? ↩︎
Like the concepts we have from the cognitive sciences. ↩︎
Even if the answer is that some parts of transubstantiation is essentially mysterious. Though I don't think physicalist realists also want to believe that the relationship between experience and substance is essentially mysterious. ↩︎
Including Arthur Conan Doyle! ↩︎
In the canonical presentation of the meta-problem, David Chalmers talks about “topic-neutral” answers to meta-problems: explanations which don't need to invoke the phenomena which is apparently problematic (but don't necessarily deny it). That's what I have in mind here. ↩︎
By the way: Chalmers thinks there is a hard problem of consciousness. I think it's commendable and impressive that he also lays out the most compelling argument for anti-realism about consciousness I know of, and he — a dualist — totally owns the weirdness and trickiness of how it is that everything he writes about consciousness has a physical explanation! ↩︎
Roughly in the sense that we'd form the same beliefs whether or not they were true. ↩︎
Are countries real? Groups of people? Individual people? I'm perfectly happy with the common-sense answer here. I'm not trying to sound edgy in a first-year philosophy seminar. ↩︎
The physicalist realist could press that whatever physical facts explain consciousness-talk (like why people believe there is a meta problem) actually won't turn out to be consciousness-neutral, once we have a correct and complete account of consciousness. I do think this is an interesting line to press and I'd be keen to hear someone defend it. My pushback is that there will be a way to explain consciousness-talk in terms which leave no room for "deep" or "intrinsic" or "extra" properties, which the realists insist upon as being essential to consciousness, so the explanation is properly consciousness-neutral. But now I feel like I'm going round in circles. ↩︎
Suggested by Keith Frankish. ↩︎
Similarly: “of course I believe in free will, I have no choice!” ↩︎
And I'll return to this point — some scientific concepts, with a little kickstarting from experiment and surrounding theory, turn out to point to a surprisingly singular or neat or distinctively-shaped "joint" of nature, despite the concept itself not directly implying as much. ↩︎
For an example of someone who has taken swipes at elaborating on possible mechanisms, I'd nominate Douglas Hofstadter, centrally in I Am a Strange Loop. ↩︎
Joe Carlsmith suggested a similar example to me. ↩︎
That is, "pain" could be like "water" in Kripke's "Twin Earth" example. But I don't think most people think that way ↩︎
For more on this line of thinking, see Daniel Dennett's wonderful “Quining Qualia”. ↩︎
This is the 'heterogeneity problem' for 'hedonic phenomenalism'. ↩︎
Brian Tomasik is a reflective guy! ↩︎
For more on this line of thinking, I recommend David Lewis' classic, “Mad Pain and Martian Pain”. ↩︎
Some decently big names in AI land have written-off concerns around AI consciousness on grounds that you could say draw on realist intuitions. For example, Mustafa Suleyman seems to think there is a deep distinction between biological consciousness and ersatz simulations of consciousness-related-process. That's a view which makes more sense when you think consciousness is a deep and extra property, and makes less sense on a more non-realist or deflationary view. That said, I am confident that folks who (i) have more realist intuitions; and (ii) do care about AI consciousness, also think Suleyman is not being careful enough. So you can totally resist sillier kinds of skepticism about AI consciousness from a realist standpoint. ↩︎
Incidentally, I think we should gear up to use AI to help us figure out fuzzy questions like this, but that might be for another post. ↩︎
It's a bit unfair that I've got this far and not discussed actual hypotheses about the meta-problem. For now: sorry, maybe I could do that in another post. I don't think my main argument hinges on which particular hypotheses are onto something. ↩︎
Here is David Chalmers in a 2017 Reddit AMA: “I agree the key is finding a functional explanations of why we make judgments such as “I am conscious”, “consciousness is mysterious”, “there’s a hard problem of consciousness over and above the easy problems”, and so on. I tried to give the beginnings of such an explanation at a couple of points in The Conscious Mind, but it wasn’t well-developed… Illusionists like Dennett, Humphrey, Graziano, Drescher, and others have also tried giving elements of such a story, but usually also in a very sketchy way that doesn’t seem fully adequate to the behavior that needs to be explained. Still I think there is a real research program here that philosophers and scientists of all stripes ought to be able to buy into… It’s an under-researched area at the moment and I hope it gets a lot more attention in the coming years. I’m hoping to return soon to this area myself.” ↩︎
Isn't this question-begging? I don't think so. You could coherently require that a model's beliefs about (say) its physical location are screened off from deliberate attempts to induce a specific answer. There is a difference between training a model to believe it's running on a data centre in Virginia, and the model accurately inferring as much — something like the difference between "lookup" and "inference" in the original sense. And there's a similar difference between the model outputting consciousness-like talk because some people trained it to say those particular things; and reaching the same outputs "on their own". ↩︎
Mustafa Suleyman advocates against building person-like or "seemingly conscious" AIs at all (and also predicts that the AIs will never be seemingly conscious unless we design them to be). ↩︎
Imagine you meet an old colleague, and they have a few wisps of hair on their head. You can take a magnifying glass to his ambiguously bald head, but you remain unsure whether they're bald, because you're unsure about the threshold for baldness: a kind of ambiguity (arguably) without empirical uncertainty. But then suppose they get a hair transplant, or start wearing a wig. You might get into an argument about whether they are "really" hirsute, but likely because there is something else at stake which causes you to quibble over definition. ↩︎

Discuss

Most Algorithmic Progress is Data Progress [Linkpost]

2025-12-11 01:48:04

Published on December 10, 2025 5:48 PM GMT

So this post brought to you by Beren today is about how a lot of claims about within-paradigm algorithmic progress is actually mostly about just getting better data, leading to a Flynn effect, and the reason I'm mentioning this is because once we have to actually build new fabs and we run out of data in 2028-2031, progress will be slower than people expect (assuming we havent reached AGI by then).

When forecasting AI progress, the forecasters and modellers often break AI progress down into two components: increased compute, and ‘algorithmic progress’. My argument here is that the term ‘algorithmic progress’ for ‘the remainder after compute’ is misleading and that we should really think about and model AI progress as three terms – compute, algorithms, and data. My claim is that a large fraction (but certainly not all) AI progress that is currently conceived as ‘algorithmic progress’ is actually ‘data progress’, and that this term ‘algorithmic’ gives a false impression about what are the key forces and key improvements that have driven AI progress in the past three years or so.

From experience in the field, there have not been that many truly ‘algorithmic’ improvements with massive impact. The primary one of course is the switch to RLVR and figuring out how to do mid-training (although both of these are vitally dependent upon the datasets). Other minor ones include things like qk-norm, finegrained experts and improvement to expert balancing, and perhaps the muon optimizer. The impact of most of these is utterly dwarfed by ‘better’ data, however, and this is something that pure scaling and flop-based analyses miss.

Models today are certainly trained using vastly more flops than previously, but they are also trained on significantly ‘higher quality’ data where ‘high quality’ means aligned with the specific tasks we care about the models being able to perform (cynically: the evals). The models are not getting so good by scale alone. A GPT4 scale model trained on the dataset of GPT3 would be substantially worse across all benchmarks, even if we somehow replicated the GPT3-dataset to be the scale of GPT4s dataset. However this model was never released (and probably never trained) so improvements in data are easily hidden and misattributed to scale or other progress. An easy way to see this is to look at model improvements for a fixed flop count and model size. These improvements have been substantial and projects as models like the Phi series show.

It is very noticeable that e.g. Qwen3 uses an architecture and training setup that is practically identical to Llama2 and yet achieves vastly greater performance which would require incredibly more OOMs of flops if you could train on an infinite Llama2 dataset. This is almost entirely because the Qwen3 datasets are both bigger but crucially much more closely aligned with the capabilities we care about the models having – e.g. the capabilities that we measure and benchmark.

My opinion here is that we have essentially been seeing a very strong Flynn effect for the models which has explained a large proportion of recent gains as we switch from almost totally uncurated web data to highly specialized synthetic data which perfectly (and exhaustively) targets the tasks we want the models to learn. It’s like the difference between giving an exam to some kid that wandered in from the jungle vs one that has been obsessively tiger-parented from birth to do well at this exam. Clearly the tiger-parented one will do vastly better with the same innate aptitude because their entire existence has been constructed to make them good at answering things similar to the exam questions, even if they have never seen the exact exam questions themselves before. Conversely, the jungle kid probably destroys the tiger-parented kid at various miscellaneous jungle related skills but nobody measures or cares about these because they are irrelevant for the vast, vast majority of tasks people want the jungle kid to do. Translating this metaphor back to LLM-land, Qwen3 has seen vast amounts of synthetic math and code and knowledge-based multiple choice questions all designed to make it as good as possible on benchmarks, Llama2 has seen mostly random web pages which incidentally occasionally contain some math and code but with very little quality filter. Llama2 probably destroys Qwen3 at knowing about obscure internet forum posts from 2008, precisely understanding the distribution of internet spam at different points throughout history, and knows all the ways in which poor common-crawl parsing can create broken seeming documents, but nobody (quite rightly) thinks that these skills are important, worth measuring, or relevant for AGI.

One way to track this is the sheer amount of spend on data labelling companies from big labs. ScaleAI and SurgeAI’s revenue each sit around $1B and most of this, as far as I can tell, is from data labelling for big AI labs. This spend is significantly less than compute spend, it is true, but it nevertheless must contribute a significant fraction to a lab’s total spending. I don’t have enough data to claim this but it seems at least plausible that the spend is increasing at a similar rate as compute spend (e.g. 3-4x per year), albeit from a much lower base.

When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities. With the advent of outcome based RL and the move towards more ‘agentic’ use-cases, this data also includes custom RL environments which are often pixel-perfect replications of commonly used environments such as specific websites like Airbnb or Amazon, browsers, terminals and computer file-systems, and so on alongside large amounts of human trajectories exhaustively covering most common use-cases with these systems.

In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.

There is some intriguing evidence that actual algorithmic progress is beginning to contribute more than in the past few years. Clearly there have been algorithmic breakthroughs enabling RL to start working (although this is also substantially a data breakthrough in that the default policies of LLMs became good enough that there is no longer an exploration problem with the RL training since the default policy is good enough to get nontrivial reward). We have also started to see bigger changes to architecture embraced by big labs such as Deepseek’s MLA, and Google’s recent Gemma3n release than previously. Finally, muon is starting to gain traction as an optimizer to displace AdamW. There have also been improvements in mid-training recipes although again this is heavily entangled with the data. This is in contrast from the 2022-2024 era which was largely simply scaling up model size and data size and increasing data quality but where the actual core training methods and architectures remained essentially unchanged. If so, it is possible that the trend lines will continue and that we will simply move towards greater actual algorithmic progress as the cheap improvements from data progress slows.

One way this could be quantified relatively straightforward is to just run ablation experiments with fixed compute training a 2022 or a 2025 frontier architecture and training recipe on either 2022 data (the pile?) or 2025 data (qwen3 training set?) and seeing where in fact the gains come from. My money would be very substantially on the datasets but I could be wrong here and could be missing some key factors.

Discuss

Fibonacci Holds Information

2025-12-11 01:16:27

Published on December 10, 2025 5:16 PM GMT

Any natural number can be uniquely written as a sum of non-consecutive Fibonacci numbers. This is called Zeckendorf representation.

Consider,

15 = 2 + 13,

54 = 2 + 5 + 13 + 34.

This outlines a very weak RE language employing only ${N, +, v a r}$ . We can also see that it is able to encode:

https://www.milanrosko.com/demo/zeck.html (.js demo)

Because you cannot use adjacent Fibonacci numbers in the sum, each “gap” between Fibonacci indices becomes meaningful: which indices are used encodes structure in a kind of “sparse bitvector.” In that sense: if you leave at least one gap (i.e. you do not use consecutive Fibonacci numbers), you can use the pattern of used vs unused Fibonacci indices to store information additively.

Thus, unlike binary or prime-exponent encodings the Zeckendorf representation stores data by selecting a unique subset of Fibonacci “slots,” rather than bits or prime powers.

From this we can construct a novel injective pairing function.

It looks more complicated than it is: Typed Carryless Pairing

It has multiple advantages over known methods: it is typed and arithmetic at the same time, meaning that the structure of the Fibonacci-index bands enforces a clean separation between components while still remaining a purely additive numerical encoding. Because Zeckendorf representations never use consecutive Fibonacci numbers, we can allocate disjoint “regions’’ of indices to each element of the pair, guaranteeing that no carries occur and that decoding is mechanically bounded. This yields a pairing function that is reversible without multiplication, exponentiation, or factorization; preserves type distinctions at the level of index geometry; and remains compatible with very weak arithmetic theories that nonetheless require robust Gödel-style encodings.

For purposes of constructive logic, this is advantageous because the encoding avoids reliance on ontological commitments such as the Fundamental Theorem of Arithmetic (FTA), which guarantees unique prime factorization only through a classical, non-constructive global property of the integers. The Carryless Pairing works instead by local, finitely verifiable constraints: one checks only the absence of adjacent Fibonacci indices and the disjoint placement of index bands. All decoding steps are constructive, bounded, and justified by the combinatorial behavior of the Fibonacci sequence rather than by an external uniqueness ontology. This keeps the encoding aligned with intuitionistic standards, grounding arithmetic representation in directly inspectable structure rather than in classical inductive principles.

Classical constructions in arithmetic encoding, including those used in weak theories, almost always depended on prime factorization with superpolynomial reversal, total bijective Cantor-style polynomial pairings, or digit interleaving in fixed bases; additive, gap-controlled encodings based on Fibonacci structure are not part of the standard toolkit.

I posted this because first-order algorithms within such an RE system can, in principle, admit refinements in someone’s project X, where even small advances in encoding structure may yield correspondingly more efficient constructive methods.

Originally it was conceived to get rid of Gödel coding so that we obtain a bona fide constructive method grounded entirely in additive structure. The paper: https://arxiv.org/abs/2509.10382

Discuss

LessWrongModify

Rss preview of Blog of LessWrong

Key points:

On framing evaluation as one of many tools:

Game-theoretical modelling of evaluation:

The future of AI evaluations:

0. Context and Disclaimers

1. Evaluation and Things Sort of Like That

2. Thinking About Tools and Affordances Game-Theoretically

2.1 Background: Studying the Simplest Accurate-Enough Model

2.2 Extending the Base Interaction by Adding Affordances

2.3 Usefulness of a Tool = Decrease in Costs of Getting a Good Outcome

3. Evaluation as One of Cooperation-Enabling Tools

3.1 Evaluating an Oblivious AI

3.2 Evaluating a Strategically-Selected (Non-Strategic) AI

3.3 Evaluating a Strategic AI

3.4 Adding Evaluation to Our Toolbox

3.4.1 Example: Interaction of Evaluation and Commitments

3.4.2 Interaction of Evaluation and Other Affordances

4. Closing Comments

Acknowledgments

Appendix A: Biased Commentary on Selected Related Work

Appendix B: Fantastic Blunders of Game Theorists and How to (Not?) Repeat Them

Summary

What is the RAISE Act?

We have SB 53, why would we want the RAISE Act?

The RAISE Act may be substantially modified

Is the RAISE Act about to be preempted?

What can be done?

Introduction

The realist research agenda

Physicalist realism is intuitively appealing

Physicalist realism is surprisingly confusing

Reconsidering realism

Debunking and the meta-problem

What exactly are we debunking here?

Consciousness as illusion?

The analogy to life

Mind sciences and life sciences

What about pain?

Weren't we talking about digital minds?

Against ambiguity

Conclusion

LessWrong Modify