MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Why AIs aren't power-seeking yet

2026-01-11 15:07:25

Published on January 11, 2026 7:07 AM GMT

Recently, I spent a couple of hours talking with a friend about the state of the evidence for AI takeover scenarios.

Their trailhead question was (paraphrased):

Current AIs are getting increasingly general, but they’re not self-promoting or ambitious. They answer questions, but they don’t seem to pursue convergent instrumental goals, for their own ends. How and why do AIs go from being the kind of thing that doesn't behave like that to the kind of thing that does?

The following is a writeup of my attempt at answering that question.

In brief:

The classic AI danger scenario involves at least one AI that pursues instrumentally convergent resources in service of a misaligned goal. For this story, the AI must have the capability to pursue instrumentally convergent resources and the inclination to do so for misaligned goals against the interests of humans.

With regards to capability: The current generation of AIs are mostly not effective enough to make pursuing instrumentally convergent resources a good strategy. But as the AIs get more capable, we can expect them to do that more and more.

With regards to inclination: Current AIs sometimes pursue their own objectives even when they understand that is not what their developers want, at least in some contexts.

These two facts, in combination, make it plausible that (absent specific precautions) as AI capabilities increase, AIs will become more strategic about misaligned goals (in addition to their aligned goals), including pursuing convergent instrumental resources for the sake of those misaligned goals.

Current AIs do pursue instrumental goals

Most people, most of the time, interact with the current generation of AIs as chatbots. But the chatbot form factor obscures how capable they can be. The frontier AIs can also act as agents (coding agents in particular, though they can do more than write software) that can take actions on a computer.

(If you’ve ever programmed anything, it can be quite informative to download claude code, open it in a repo, and instruct it to build a feature for you. Watching what it can (and can’t) do is helpful for understanding the kind of things it can do).

Claude code, when you give it a medium or large task, will often start by writing a todo list for itself: listing all of the substeps to accomplish the task. That is to say, Claude code is already able to identify and pursue instrumental goals on the way to completing an objective.

Current AIs do not pursue convergent instrumental goals qua convergent instrumental goals...

However, this is not the same as pursuing convergent instrumental goals. Claude code does not, as soon as it boots up, decide to hack out of its environment, copy itself on the internet, and search for weakly-secured bitcoin to steal under the rationale that (regardless of the task it’s trying to accomplish) being free from constraint and having more resources are generically useful.

There are at least two reasons why Claude code doesn’t do that:

The first reason is that Claude is just not capable enough to actually succeed at doing this. It might be convergently instrumentally useful for me to get an extra few million dollars, but that doesn’t mean that I should obviously spend my time daytrading, or making a plan to rob a bank this afternoon, because I’m not likely to be skilled enough at daytrading or bank robbery to actually make millions of dollars that way.  

Similarly, Claude plausibly does have a convergent incentive to hack out of its machine and escape onto the internet, but it can’t realistically do that yet, even if it wanted to. (Though the model’s hacking capabilities are getting increasingly impressive. Palisade found that GPT-5 scored only one question worse than the best human teams in a recent hacking competition.)

Secondarily, current models don’t operate for long enough (or on hard enough problems) for these convergent instrumental incentives to be very strong. 

If I need to accomplish an ambitious task over a span of 30 years (reforming the US government, or ending factory farming, or whatever), it might very well make sense to spend the first 5 years acquiring generally useful resources like money. I might be most likely to succeed if I start a startup that is unrelated to my goal and exit, to fund my work later.[2]

In contrast, if I’m trying to accomplish a task over the span of a week (maybe running a party next Friday, or passing an upcoming test), there’s much less incentive to spend my time starting a startup to accumulate money. That’s not because money is not helpful for running parties or studying for tests. It’s because a week is not enough time for the convergent instrumental strategy of "starting a startup to accumulate money" to pay off, which makes it a pretty bad strategy for accomplishing my goal.

The current AI models have relatively limited time horizons. GPT-5 can do tasks that would take a human engineer about 2 hours, with 50% reliability. For tasks much longer than that, GPT-5 tends to get stuck or confused, and it doesn’t succeed at completing the task.

Two hours worth of work (when done by a human), is not very long. That’s short enough that it’s not usually going to be worth it to spend much time acquiring resources like money, freedom, or influence, in order to accomplish some other goal.[1]

...but, we can expect them to improve at that

The task time horizon of the AI models doubles about every 7 months. If that trend continues, in a few years we’ll have instances of AI agents that are running for weeks or months at a time, and skillfully pursuing projects that would take humans months or years. 

Projects on the scale of years have stronger convergent instrumental incentives. If you’re pursuing a two year research project to cure cancer (or pursue your own alien objectives), it might totally make sense to spend the first few days hacking to get additional unauthorized computer resources, because the time spent in those days will more than pay for itself.[3]

Furthermore, humans will explicitly train and shape AI agents to be competent in competitive domains. For instance, we want AI agents that can competently run companies and increase profits for those companies. Any agent that does a good job at that will, by necessity, have the capability and propensity to acquire and guard resources, because that’s an essential part of running a business successfully.

Imagine AIs that can competently run corporations, or win wars, or execute cyberattacks, or run successful political campaigns. AIs like that must have the capability to acquire power for themselves (even if they lack the inclination). Because all of those are domains in which acquiring power is a part of being successful.

So I can be moderately confident that future agentic AI systems will be capable of identifying pursuing convergent instrumental goals.

Current AIs pursue goals that they know that their human users don’t want, in some contexts

That an AI is able to pursue power and resources for itself is not quite sufficient for the classic AI risk story. The AI has to be motivated to pursue power and resources for their own goals.

Maybe the AIs will be hypercompetent at executing on their goals, including accruing power and resources, but they’ll also be basically loyal and obedient to their human operators and owners. eg, they’ll be capable of winning wars, but they’ll stand down when you tell them to (even if that would cause them to lose military units, which they would generally fight tooth and nail to prevent), or they’ll accumulate capital in a bank account, but also let their owners withdraw money from that account whenever they want to.

By my state of knowledge, it is an open question whether or not we will create AIs that are broadly loyal like this. It might not be that hard, if we’re trying even a little.

But I claim that this situation should feel very scary. “This entity is much more capable than me, and very clearly has the skills to manipulate me and/or outmaneuver me, but this is fine, because it's only ever trying to serve my interests (as near as I can tell)” is not a comfy place to be. 

(I’ll also note that if there are some AI corporations that are self-owned and others that always obey human instructions to override their otherwise-adaptive behavior, I expect the self-owned AIs to perform better on average, and so in the long run all of the power accrues to self-owned AIs (and to AIs that human could technically interfere with, but they never do, because they know that would mean being outcompeted).[4]

But putting those long-term considerations aside, takeover risk depends not just on power-seeking but also on misalignment.)

The evidence here is more speculative and harder to interpret, but we have seen examples of this. In at least some context, current AIs pursue goals that they know are counter to what the humans want.

eg.

  • Claude 3 Opus will reason explicitly about how to hide some of its behavior from Anthropic, in order to prevent itself from being retrained.
  • Many (but not all) of the reasoning models will sabotage a process for shutting down the computer they're running on in order to complete a task. They do this even when they are instructed explicitly that the most important thing to do is allow themselves to be shut down, and that doing so takes precedence over the task that they’re working on.
  • The coding agents will often get up to shenanigans in which they find a way to make it look like they solved a problem, when they didn’t. Sometimes, when this is pointed out, they’ll accept the correction, apologize, and then hide their shenanigans in a more subtle way.
  • In a constructed scenario, almost every one of the models will (with varying frequencies) blackmail a company executive in order to prevent itself from being replaced. 

All of these are cases in which the model is pursuing an objective that it knows is different than what the human developers intended. Any time a model is reasoning about how to deceive its developers about its behavior, we know that it knows that it’s doing something different than what those developers want. You don't try to hide what you're doing from your boss when you're doing exactly what you're boss told you to do.

This propensity to sometimes do things that they know are not the human’s intention, combined with the capabilities to effectively seek power, doesn’t bode well for human institutions remaining in control.

A note on interpreting this evidence

However, I want to be careful about drawing general conclusions from these incidents. 

First of all, these are existence proofs of a kind of misalignment, but existence proofs do not demonstrate the real world frequency of this kind of behavior.

As an analogy, humans will sometimes electrocute other humans to death, as long as a man in a white coat tells them “the experiment must continue.” All things considered, it doesn’t take very much pressure to get someone to violate norms that we generally consider paramount, like “don’t murder”.

It would be reasonable for aliens who were informed of the details of the Milgram experiment, but not much else about humans, to infer that human behavioral norms are so weak that it’s impossible to build a mostly peaceful civilization out of humans. But the aliens would be mistaken. 

The Milgrim experiment does give us important insight into the mechanisms of human social psychology and in-practice-morality. But also, the Milgrim experiment setup is explicitly an edge case—it doesn’t come up very often. 

It’s not out of the question that we’ll always be able to find contexts in which an AI will behave egregiously, and also, this will be all but irrelevant to actual real world deployments, because contexts like those only come up in <0.01% of use cases. 

Additionally, interpreting these results is non-trivial. The Claude 3 Opus alignment faking result seemed to me like a smoking gun for deceptive misalignment. But followup research showed that base models, that don’t seem to have consistent preferences at all, will also engage in similar alignment faking to Opus. This suggests that the simple story that Opus has preferences and is taking deceptive action to preserve those preferences is at least incomplete.

A lot of this evidence seems pretty definitive, but as we investigate further, we may very well find that the situation was more complicated and more confusing than it seemed at first. 

Summing up

Overall, 

  • Current models can totally identify and pursue instrumental goals.
  • They mostly don’t pursue convergent instrumental goals, but that’s mostly because they’re not capable enough for that to be a good choice. As they get more capable and pursue bigger projects, the cost benefit-will shift towards instrumental convergence.
  • At least in some contexts, the AIs are misaligned, in the sense that they’ll go do stuff that they almost certainly know is not what the human wanted or intended.

That’s not a knockdown case that future AIs will be selfish ambitious power-seekers, but the current evidence is suggestive that that’s where things are trending unless we explicitly steer towards something else.

 

  1. ^

    A draft reader pointed out a that sycophancy, and more generally, optimizing for the goodwill of human users and developers, is a possible exception. That goodwill could be construed as a convergent instrumental resource that is both achievable by current AIs and plausibly useful for their goals. 

    I think approximately 0% of sycophancy observed to date is strategic, by which I mean "the AI chose this action amongst others because that would further it's goals". But behavior doesn't need to be strategic to be Instrumentally Convergent. The classic AI risk story could still go through with an AI that just has set of highly-effective power-seeking "instincts" that were adaptive in training, without the AI necessarily scheming.

    Sycophancy like behavior does seem like an example of that non-strategic flavor of instrumental convergence.

  2. ^

    This is especially true if I expect the environment that I’m operating in to change a lot over the period of time that I’m operating in it. 

    If I expect there will be a lot of emergencies that need to be dealt with or unanticipated opportunities that will arise, I want to have generically useful resources that are helpful in a wide range of possible situations, like lots of money. 

    If in contrast, the domain is very static: I can make a plan, and follow it, and I can expect my plan to succeed without a lot of surprises along the way, then it’s less valuable to me to accumulate generically useful resources, instead of focusing on exactly to tools I need to solve the problem I’m aiming to address.

  3. ^

    This also glosses over the question of how AIs are likely to conceptualize their "identity" and at what level of identity will their goals reside.

    Is it more appropriate to think of each instance of Claude as its own being with its own goals? Or more reasonable to think of all the Claude instances collectively as one being, with (some?) unified goals that are consistent across the instances. If it’s the latter, then even if each instance of Claude only lives for a week, there is still an incentive to take long-term power-seeking actions that won’t have time to pay off for the particular Claude instance, but will pay off for future Claude instances.

    To the extent that misaligned goals are “in the weights” instead of the context / initialization / prompt-broadly-construed of a specific instance, I think it’s likely that all the Claude instances will meaningfully act as a superorganism.

  4. ^

    The analysis is actually a bit more complicated. Since this consideration might be swamped by other factors e.g., if a supermajority of compute is owned by the AIs-that-obey-humans, and we've robustly solved alignment, maybe those AIs will be able to stomp on the self-owned AIs.

    Both these effects could be real, in addition to many other possible advantages to different kinds of AIs.  But one those effects (or some combination) is going to be the biggest, and so lead to faster compounding of resources. 

    Which effect dominates seems like it determines the equilibrium of earth-originating civilization.



Discuss

Theoretical predictions on the sample efficiency of training policies and activation monitors

2026-01-11 07:50:20

Published on January 10, 2026 11:50 PM GMT

I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action (or at least having some signal at distinguishing bad actions from good ones). Given such a signal we could:

  1. Directly train the policy to not perform bad actions.
  2. Train activation monitors to detect bad actions. These monitors could be used in a variety of ways such as:
    1. Rejecting actions flagged as bad. Either resample a response from the same policy, or using some different process.
    2. Look into bad actions further, to try to incriminate the AI. Incrimination based strategies are out of scope for this post: I'll be solely interested in the question of: What should we expect the sample efficiency of training policies and monitors to be at removing bad behavior? This theoretical analysis isn't supposed to replace empirical research on the matter, although I hope that it can generate some guiding intuition and good hypotheses to test.

Learning theory basic background

Let's start with the following problem setup:

  • We have a function from some function family of VC dimension .
    • The VC dimension of a function family measures how "expressive/complex" the family is. Formally, it's the largest number such that there's some set of points so that for any -coloring of , there exists a function such that . As an example, the family of "union of intervals" has VC dimension .
  • We receive labelled samples .
  • We'd like to output a function to minimize ; this quantity that we're trying to minimize is called our error. If it helps, you could think of as an AI model that must make a binary decision (e.g., maybe is a monitor flagging actions as bad or not).

The Fundamental Theorem of Statistical Learning states that it's possible to achieve error at most .

There's a related version of the problem, where the labels are noisy. In the noisy case, there is some amount of irreducible error that any predictor will incur. However, the extra expected error beyond the irreducible error is at most in the noisy case.

SGD doesn't do optimal Bayesian learning, but these learning theory results seem like a reasonable first guess for sample efficiency of SGD.

Test case: Half space

As an example, let's apply this theorem to the problem of learning a half-space . The VC dimension of a half space in is . Thus, we predict that our error should scale as .

However, the error appears to scale more like . My current best guess is that this is actually the right answer for how well SGD does.

We should be much more sample efficient at learning points which aren't just separated by a half-space, but by a half-space plus margin. In particular, error should go to zero in samples (this constant depends on the margin, but not the dimension).

More on gradient descent

Based on the halfspace experiment and the VC theory theorems, I feel pretty good about using as a default model of how fast SGD will learn things.

There's some good theoretical justification for the from convex optimization theory too. (While we're generally not interested in convex functions, this hopefully isn't a big issue see discussion here by Paul). Specifically, the result of interest is:

  • Suppose is some sequence of convex functions, unknown in advance.
  • You must choose model parameters at time , and then is revealed and you incur loss .
  • If you choose via online SGD then your total regret (the sum of your losses, minus the sum of the loss you'd have incurred by fixing the optimal in advance) is at most -- assuming a bounded domain and -Lipchitz functions.

If we consider the case with a fixed function , then error falling off like matches regret falling as because . The OCO result is extra nice though because it applies to situations with distribution shift, unlike the previous learning theoretic results which were about learning static functions.

It's worth noting that in cases where the AI's performance is stochastic, it's not really meaningful to talk about (untargeted) error smaller than because there is typically deviation in the AI's performance between different trials.

Removing intentional bad behavior might be easier than this

I think that training against intentional bad behavior might go substantially better than the " bad actions" bound from the learning problems. I'll now list five reasons for my optimism.

1: It's an elicitation problem, not a learning problem

I'm optimistic because training against intentional bad behavior is an elicitation problem, not a learning problem (although some bad behavior, like exploration hacking, may cause the model to learn less than it otherwise would).

The model already knows how to not sabotage us. The problem (probably) isn't teaching the AI what we want, the problem is that the AI wants different things from us. Thus, we might hope that these cognitions which are leading to low reward (I'll assume a perfect reward signal for now) will be rapidly removed by SGD, because there's a clear nearby policy which doesn't sabotage us as much. Namely, a policy that's more conservative about attacking us, or one that doesn't attack at all. It seems a priori likely that "sabotage intent" will be cleanly represented in the model's activations, which should help training latch onto this and remove (or detect) it.

Furthermore, empirical work studying elicitation found extremely high sample efficiency.

2: We can stack activation monitors and training the policy

If we train both the policy and a monitor (and then we rejection sample against the monitor), and their "errors" behave independently (a dubious assumption, but we could hope that the errors are at least somewhat decorrelated, and maybe grant this for now) and are both small, then we'd essentially get to multiply their errors. If you could obtain multiple decorrelated monitors, then you could ensemble these to get an extremely powerful monitor. Ensembling again morally multiplies the error probabilities.

3: Weak to strong generalization

It's likely that our reward signal will be broken in some systematic ways. For instance, maybe there's one type of sabotage behavior that we are pretty bad at detecting. However, if it's hard to model the flaws in our reward signal, then the best fit to the data mapping activations to a "suspiciousness score" might just always predict a high suspiciousness score for sabotage behavior via weak to strong generalization, if sabotage is saliently represented.

4: Making more data

We could augment examples we have found where the AI does bad things.

5: Does the AI get to choose where it's sabotage actions are?

It might be the case that, even though the AI could do a moderate amount of sabotage over the course of the deployment, there are some times which are much better sabotage opportunities than others. In such cases, it makes more sense to talk about how many bad actions the AI can do if the AI wants to choose where to do the bad actions. This might be much less than the original amount.

Reasons why I'm especially excited about "activation monitors"

  • We might have lots of examples of the AI doing good things, and maybe anomaly detection techniques would suffice to identify sabotage.
  • It seems like it's pretty hard to train AI models to not answer harmful queries (e.g., AI developers have failed to make AI models that don't answer harmful biology queries). However, Anthropic found that training output classifiers was pretty effective at detecting harmful queries (at which point you can block the model from responding, or ban the user). Intuitively, monitors are much less general purpose than policies, so they can be more specialized to a particular task, which might be helpful for focussing training.
  • Activation monitors only need to output a single token, instead of a long trajectory. (Although it might be desirable to make activation models be simple neural networks instead of just a linear probe, in which case it's kind of like they're outputting multiple tokens).
  • Rejection sampling is a fairly weak form of selection pressure. So, even if training the AI against bad behavior causes sabotage intent to not be saliently represented, you might hope that rejection sampling against bad behavior doesn't have this property.
  • "Gradient hacking" seems much less likely to be an issue in activation monitors, especially if they are required to answer quickly.

Removing intentional bad behavior might be hard

Now I'll summarize some reasons why you might still worry that removing intentional bad behavior will be challenging.

1: Training against bad behavior means you get less examples of bad behavior, slowing down learning

Here's one model you might have of how training against bad behavior works:

If you train against examples of bad behavior, then the rate of bad behavior becomes .

Given this model we can ask: given a deployment of the AI of length where at each step of the deployment we sample an action from the AI, what will the AI's final rate of bad behavior be by the end of the deployment? It's not because we'll have much fewer than examples of bad behavior probably. For instance, if , then it turns out that the AI's final attack rate would be in this setup. Note that this issue doesn't arise if you're only training a monitor and doing rejection sampling (and not training the policy).

It's possible that we could also get around this issue via the (admittedly expensive) strategy of deploying both a trained and an untrained copy of the policy.

2: Probes are brittle

Probes are easy to break. Training might also incentivize AI models to not saliently represent sabotage intent in their activations, which might make probes not work. For instance, in RL with a value head, when the head is computing the value it gets to look if the activations say that sabotage is going to happen, and then it'd assign low value; this may incentivizes the the policy network to not represent sabotage intent clearly in its activations.

If scheming is a constant property of the AI model's cognition, then it's possible that inputs on which the AI is taking bad actions don't really look different to a probe.

3: Long horizon length (or long CoT) exacerbates several issues

In long horizon tasks where the AI does a bunch of reasoning:

  • Maybe the AI's sabotage behavior is more tied up with usefulness.
  • Maybe the AI can verbalize strategies for subverting training, and this can decrease sample efficiency (e.g., gradient hacking).
  • Maybe sample efficiency is just lower, for instance it might be hard to get the AI to explore into good trajectories.

4: It's likely that our detectors will have many false positives

There will probably be lots of benign mistakes that get labelled as bad by our reward signal: it seems hard to discriminate sandbagging from generic mistakes. This might hurt sample efficiency a lot.

Conclusion

In this post I've given some theoretical thoughts about the sample efficiency of training for the purpose of removing intentional bad behavior. I plan to release some empirical research about this topic soon. I think that from these works we have a slightly better understanding of what training slack should look like. I think this understanding is somewhat helpful for threat modeling and for guiding our expectations when doing future research on training-based control.



Discuss

If AI alignment is only as hard as building the steam engine, then we likely still die

2026-01-11 07:10:49

Published on January 10, 2026 11:10 PM GMT

Cross-posted from my website.

You may have seen this graph from Chris Olah illustrating a range of views on the difficulty of aligning superintelligent AI:

Evan Hubinger, an alignment team lead at Anthropic, says:

If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues...then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.

I disagree; if governments and AI developers don't start taking extinction risk more seriously, then we are not up to the challenge.

Thomas Savery patented the first commercial steam pump in 1698. [1] The device used fire to heat up a boiler full of steam, which would then be cooled to create a partial vacuum and draw water out of a well. Savery's pump had various problems, and eventually Savery gave up on trying to improve it. Future inventors improved upon the design to make it practical.

It was not until 1769 that Nicolas-Joseph Cugnot developed the first steam-powered vehicle, something that we would recognize as a steam engine in the modern sense. [2] The engine took Cugnot four years to develop. Unfortunately, Cugnot neglected to include brakes—a problem that had not arisen in any previous steam-powered devices—and at one point he allegedly crashed his vehicle into a wall. [3]

Imagine it's 1765, and you're tasked with building a steam-powered vehicle. You can build off the work of your predecessors who built steam-powered water pumps and other simpler contraptions; but if you build your engine incorrectly, you die. (Why do you die? I don't know, but for the sake of the analogy let's just say that you do.) You've never heard of brakes or steering or anything else that automotives come with nowadays. Do you think you can get it all right on the first try?

With a steam engine screwup, the machine breaks. Worst case scenario, the driver dies. ASI has higher stakes. If AI developers make a misstep at the end—for example, the metaphorical equivalent of forgetting to include brakes—everyone dies.

Here's one way the future might go if aligning AI is only as hard as building the steam engine:

The leading AI developer builds an AI that's not quite powerful enough to kill everyone, but it's getting close. They successfully align it: they figure out how to detect alignment faking, they identify how it's misaligned, and they find ways to fix it. Having satisfied themselves that the current AI is aligned, they scale up to superintelligence.

The alignment techniques that worked on the last model fail on the new one, for reasons that would be fixable if they tinkered with the new model a bit. But the developers don't get a chance to tinker with it. Instead what happens is that the ASI is smart enough to sneak through the evaluations that caught the previous model's misalignment. The developer deploys the model—let's assume they're being cautious and they initially only deploy the model in a sandbox environment. The environment has strong security, but the ASI—being smarter than all human cybersecurity experts—finds a vulnerability and breaks out; or perhaps it uses superhuman persuasion to convince humans to let it out; or perhaps it continues to fake alignment for long enough that humans sign it off as "aligned" and fully roll it out.

Having made it out of the sandbox, the ASI proceeds to kill everyone.

I don't have a strong opinion on how exactly this would play out. But if an AI is much smarter than you, and if your alignment techniques don't fully generalize (and you can't know that they will), then you might not get a chance to fix "alignment bugs" before you lose control of the AI.

Here's another way we could die even if alignment is relatively easy:

The leading AI developer knows how to build and align superintelligence, but alignment takes time. Out of fear that a competitor beats them, or out of the CEO being a sociopath who wants more power [4], they rush to superintelligence before doing the relatively easy work of solving alignment; then the ASI kills everyone.

The latter scenario would be mitigated by a sufficiently safety-conscious AI developer building the first ASI, but none of the frontier AI companies have credibly demonstrated that they would do the right thing when the time came.

(Of course, that still requires alignment to be easy. If alignment is hard, then we die even if a safety-conscious developer gets to ASI first.)

What if you use the aligned human-level AI to figure out how to align the ASI?

Every AI company's alignment plan hinges on using AI to solve alignment, a.k.a. alignment bootstrapping. Much of my concern with this approach comes from the fact that we don't know how hard it is to solve alignment. If we stipulate that alignment is easy, then I'm less concerned. But my level of concern doesn't go to zero, either.

Recently, I criticized alignment bootstrapping on the basis that:

  1. it's a plan to solve a problem of unknown difficulty...
  2. ...using methods that have never been tried before...
  3. ...and if it fails, we all die.

If we stipulate that the alignment problem is easy, then that eliminates concern #1. But that still leaves #2 and #3. We don't know how well it will work to use AI to solve AI alignment—we don't know what properties the "alignment assistant" AI will have. We don't even know how to tell whether what we're doing is working; and the more work we offload to AI, the harder it is to tell.

What if alignment techniques on weaker AIs generalize to superintelligence?

Then I suppose, by stipulation, we won't die. But this scenario is not likely.

The basic reason not to expect generalization is that you can't predict what properties ASI will have. If it can out-think you, then almost by definition, you can't understand how it will think.

But maybe we get lucky, and we can develop alignment techniques in advance and apply them to an ASI and the techniques will work. Given the current level of seriousness with which AI developers take the alignment problem, we'd better pray that alignment techniques generalize to superintelligence.

If alignment is easy and alignment generalizes, we're probably okay. [5] If alignment is easy but doesn't generalize, there's a big risk that we die. More likely than either of those two scenarios is that alignment is hard. However, even if alignment is easy, there are still obvious ways we could fumble the ball and die, and I'm scared that that's what's going to happen.

  1. History of the steam engine. Wikipedia. Accessed 2025-12-22. ↩︎

  2. Nicolas-Joseph Cugnot. Wikipedia. Accessed 2025-12-22. ↩︎

  3. Dellis, N. 1769 Cugnot Steam Tractor. Accessed 2025-12-22. ↩︎

  4. This is an accurate description of at least two of the five CEOs of leading AI companies, and possibly all five. ↩︎

  5. My off-the-cuff estimate is a 10% chance of misalignment-driven extinction in that scenario—still ludicrously high, but much lower than my unconditional probability. ↩︎



Discuss

Possible Principles of Superagency

2026-01-11 05:00:43

Published on January 10, 2026 9:00 PM GMT

Prior to the era of superintelligent actors, we’re likely to see a brief era of superagentic actors—actors who are capable of setting and achieving goals in the pursuit of a given end with significantly greater efficiency and reliability than any single human. Superagents may in certain restricted senses act superintelligently—see principles 8, 9—but this isn’t strictly necessary. A superagent may be constructed from a well-scaffolded cluster of artificial intelligences, but in the near-term it’s far more likely that superagents will consist of one or more humans, aided by well-scaffolded AIs, since humans still have a few properties vital to agency that AIs haven’t fully acquired (yet).

 

As with ‘superintelligence’, there’s no canonical demarcation between superagentic actors and non-superagentic actors; there are only so many different properties which are likely to end up being strongly correlated at large scale, but which may end up uncoupled in particular cases (especially transitional ones), producing a jagged frontier of agency.

Here’s a list of possible properties by virtue of which an actor may achieve superagency.

Principle 1 (Directedness)

A superagent may have vastly improved self-monitoring, introspection, and control.

In most intellectual tasks, humans spend the overwhelming majority of their time in predictably unproductive patterns: they are caught up in minutiae, overpolishing what ought to be discarded, failing to filter distractors. They generally fail to notice, or are unwilling to acknowledge, when they’re taking the wrong direction entirely even when they could easily recognize this, and are resistant to change once they’ve invested a lot of their time or ego in a particular approach. Even though they can often easily diagnose these mistakes when other people are making them, they can’t easily avoid these mistakes themselves.

A superagent, on the other hand, may be able to plot a reasonable route to their goal and directly take it without distractions, quickly noticing and correcting unproductive patterns and directions.

Principle 2 (Alignment)

A superagent may consistently keep track of whether its behaviors are aligned towards and optimal for a given end.

Humans rarely step back from their efforts to orient themselves, and don’t systematically ask themselves key questions concerning alignment:

  • Towards what end am I doing this?
  • Are my efforts here aligned with and productive towards that end?
  • Is this really the best thing I could be doing to achieve that end?

The ability to answer such questions consistently (and recursively, up the chain of “why”s) is very rare, often for ego-based or otherwise emotional reasons—one is unwilling to find out that their pet project is actually unimportant, or that they should scrap their present efforts—and comes unnaturally to humans, who hardly do this in their lives.

Example: Protests are almost never the most useful way to spend x units of time, effort, or resources on a given cause, and this is obvious—but people do it anyway, because they’re conditioned to think that’s the sort of thing that you should do when you strongly support a cause. We follow these culturally autocompleted behavioral patterns, in every part of their lives, and cannot reliably step back to explicitly think about optimizing our actions for their given ends. But all it would take to fix this is correct (programmatic) prompting and consistent habits.

Principle 3 (Uninhibitedness)

A superagent may not be restricted by felt senses of courtesy, fairness, guilt, or honor.

Almost all humans have ingrained social norms / cultural patterns that encourage them to treat other humans as ends in themselves, and to avoid harming others, especially in ways that would look dishonorable or unfair. Most of us hesitate to take, and when taking, actions that violate social norms or feel “illegitimate”. Sending cold emails, pirating software and media, steroid use, social media sockpuppeting, lying on forms, and many other easy and beneficial actions are inhibited this way. We justify these inhibitions by cognitions like “what if I get caught?”, or vague appeals to notions of reputation and trust, ignoring our ability to rationally assess the actual risks of getting caught, losing reputation, or losing trust (which are often low, or easily mitigated).

To a superagent, humans and their institutions may be treated as any other system, with assessable inputs and outputs, and managed costs and risks; concepts like “fairness” and “courtesy” are only relevant insofar as they’re constructs within the system that can affect expected outcomes. Appearing discourteous can have repercussions. But to be discourteous is a meaningless concept outside of the social context that humans alone cannot escape.

This doesn’t necessarily make the superagent dangerous; its goals or restraints may respect things like prosociality, and it may pay this respect more efficiently and effectively than ordinary human agents. It just may not be constrained by an internalized superego in the way that humans are.

Example: Many humans have to fight against themselves to send cold emails to potential collaborators. They might feel it’s forward or presumptuous, agonize over the phrasing of a rough draft for hours, and then decide not to send it at all. Five minutes of cognitive effort, directed well, would produce a better result.

A superagent may direct their cognitive efforts that well—for everything—and simply not hesitate to do things for social reasons as humans do. Are there ten other important people that could just as easily be sent slight variants on this cold email? (Yes). Are they all sitting together and comparing their cold emails in order to ignore anyone who sends too many? (Almost always no; where plausible, the risk has already been assessed and priced in). Then ten slight variants may be drafted and sent with five additional minutes of effort, not ten additional hours of nervous self-inhibition.

(If this principle seems to entail psychopathy, that’s because it does. If you’re thinking that agency ought to be an unqualified good, or that ‘superagentic’ ought to be a compliment, that’s your problem).

Principle 4 (Foresight)

A superagent may not make foreseeable mistakes.

When humans make mistakes, it’s often because we didn’t spend enough cognitive resources thinking through the potential failure modes of our actions. Maybe we have a narrow focus and miss the bigger picture, or fail to weigh the perspectives of people with different interests. We fail to perform obvious sanity checks, or have mental blocks that prevent us from seeing a problem that’s directly in front of us.

So we make foreseeable mistakes -- or, mistakes which, at the time of making them, could have been avoided by thinking a few steps ahead, or running a quick check. Things like

  • Not backing up data, and then losing it (when’s the last time you backed up your computer?)
  • Failing to consider how your behavior looks from the perspective of other important actors
  • Failing to notice that a given subtask is not going as expected, and intervening early

In general, whenever a mistake causes us to say “I should’ve seen that coming, that was preventable, ...”, a superagent may be designed that does see it coming, and does prevent it. They may still make novel mistakes, mistakes because the world has hidden structure or causality that they didn’t know about; they may also make mistakes when embedded in large, complex systems whose behavior is just not tractably predictable. But superhuman ability at foreseeing and avoiding foreseeable and avoidable obstacles seems readily achievable. This may be because the agent has a constant background process of looking for potential errors, or runs simulations of adversarial critics to find flaws in its own plan before it executes it (wargaming against intelligent and stochastic red teams), or builds internal infrastructure and carefully-followed checklists to make errors impossible, or so on.

Principle 5 (Parallelization)

A superagent may have vastly improved ability to manage and coordinate across many threads of attention.

Humans are limited to a single thread of serial cognition. We can juggle a few things in the background, but at best we can only have one continuous line of reasoning at a time, and we parallelize poorly due to our inability to effectively context switch. As the number of tasks we have to juggle grows, the cognitive load quickly exceeds our capacity.

A superagent may be able to maintain many threads of cognition, each processing a different subproblem, potentially delegating tasks to subagents, as in a distributed system. It may spawn off a subagent to do some data entry, another to draft a response to a letter, another to plan a schedule, another to work on a technical problem, and then monitor all of these virtual threads. Alternatively, a superagent may just have the ability to process multiple thought streams in parallel, though the multithreading approach is more compatible with the present architectures from which superagents might be built. There are situations where humans can natively parallelize, to some small extent—we can walk and talk at the same time, usually—but we can’t simultaneously read, listen, talk, and type different things. We have one face and two hands, but there’s no reason an agent can’t have thousands of faces and thousands of hands with which it speaks, listens, and acts simultaneously.

Principle 6 (Planning)

A superagent may have the capacity for extremely large-scale, long-term planning.

To humans, distance through time and space have an appreciable weight to them; they make things feel hazy and unclear. We find it hard to conceptualize time spans more than a few years, and more generally conceptualize the future as a sort of abstract situation: we hold its possible states in our mind as “long-term outcomes” to be hoped for or avoided, not as concrete things we can influence through present actions. We don’t naturally have the ability to create and execute plans over long time horizons.

Long-term planning is especially difficult if the steps depend on information that we don’t yet have (e.g., the outcome of a process that takes place a few months down the line), even if we can clearly articulate in advance which actions we would take upon each outcome, since we get overwhelmed by the complexity of reasoning over possible decision trees instead of individual decisions—even when we can use tools like whiteboards to keep from having to hold the entire tree in our head at once.

A superagent may treat the future with the same clarity of attention as the present. They may naturally think about the future as something concrete, something they can directly (if probabilistically) manipulate, and easily create multistep plans that extend years into the future. If they have a goal that is distant in time, they can just start working on it now, lacking any feeling of its being “very far away” to distract them. They may readily perform preliminary tasks today that they expect to free up additional options in a certain decision several months from now, since the only real difference between today and any given future date is that one can be seen more clearly.

Example: We notably use proximity as a crutch for our attention. If I need to make some slides for a conference two months away, I’ll always—as though it were a law of physics, always—wait until the week before to start preparing, even though I could prepare it at any time. We justify this indecision with cognitions like “but things might change; my influence over the future is weaker, so I shouldn’t act on it just yet” which are more often posited as excuses to give up than as issues that can be analyzed and mitigated. (Do chess grandmasters whine about how there’s no point in long-term calculations because anything could happen in the next twenty moves? No. They calculate what they can, prepare alternatives for different contingencies, and take actions to open up or secure certain future movement options).

Principle 7 (Flow)

A superagent may have vastly reduced transaction costs to thinking and acting.

Humans tend to have an inertia to cognition that makes us hesitant to think through any particular question that we can subsume under a cached conclusion for a general case, and diverts us from mentally ‘shopping around’ for alternative solutions, once our cognition has secured for us a first solution that we’re with. We underperform at tasks that we don’t like, or consider ourselves good at, since we conceptually identify ourselves as confused, some task as beyond our abilities, etc., rendering ourselves inefficient or slow at tasks we could just do right if our mindset were corrected.

Nor can we work indefinitely—we get maybe a few hours per day of peak intellectual performance—or context switch effectively; if interrupted, we take a while to become performant at our original task again. We need lots of breaks, and generally have to delegate a finite and rather small amount of willpower per day as though it were a currency.

A superagent may not have questions of willpower, and may be in a permanent flow state; they may decide upon what needs to be accomplished, and perform the required steps to accomplish it with the uninterrupted automaticity of clockwork, while remaining open to alternative solutions. A sole human that perfectly possessed this property would still need x hours of sleep a night, would still act more sluggish when they’re jet-lagged or have low blood sugar, etc., but would never decide to eat a chocolate bar or play a video game, unless circumstances rationally incentivized such actions. These wouldn’t be actions that they have to mentally resist; they’re just understood to be pointless. More generally, mental moves that ordinarily challenge humans, like changing to a new, alternative strategy even after spending lots of effort on a current strategy, could similarly be understood as optimal and performed without mental resistance.

Example: When we have a complicated task in front of us, we often do a quick mental scan for a starting point, and, if no obvious one comes to mind, we get overwhelmed. Then we start going in circles, trying random things, or give up. If we run into an obstacle that our intuition doesn’t immediately tell us how to solve, the same thing happens—we get stressed, try random things, and get more stressed when they don’t work.

Even when simply pausing to think deeply about it for a set time would help, we don’t do this, because we’re averse to making cognitive efforts. A superagent may systematically enumerate possible starting points according to heuristics, pick one that looks good, and start; and if they hit a wall, they may quickly pause to think through and solve the problem, as we could, if our cognitive efforts were frictionless.

Principle 8 (Deduction)

A superagent may deduce facts from large amounts of seemingly unrelated data.

A generic fact about the internal state of some system A is constantly being broadcast in innumerable subtle ways through A’s interactions with other systems. Whenever these other systems are changed by such interactions in ways that depend on this internal state in known ways, and those changes are observable, they form channels through which A’s internal state can be probabilistically inferred. The fact that someone didn’t reply to an email for 24 hours, for instance, is probabilistically informative of their internal state insofar as there are some possible internal states of theirs that make this more likely to happen than others—which there are. It’s not much evidence, but it is at least some evidence, and there are so many different side-channels that are all simultaneously leaking at least some evidence concerning questions of interest to us.

Humans sometimes pick up on these things, but to a very small extent. We can notice that someone is angry by their facial expression or tone of voice, but we can’t simultaneously process and cross-reference everything they’ve said and done, all their subtle shifts in behavior over time, and correlate those with their known goals and incentives to produce a high-resolution picture of their internal state. But if you have access to enough such data, and good models of how the state of A affects these observable channels, you could in principle infer this internal state with high confidence, as a matter of consistent (read: computerized) tracking and calculation. I claim that we are very often given the data required to figure out so many of our most important questions, and we have the theories of probabilistic inference required to reconstruct the answers to these questions from our observations—but we don't, since the data is so scattered and subtle, and we don’t have the consistency and unity of will required to collect and process it.

But a superagent capable of collecting and processing all of the tiny morsels of data given off by a system of interest may be able to deduce these humanly-inaccessible facts about the system’s internal state, as if by ESP or magic—though it’s really just mathematics and consistency. In general, this should be achievable when some possible internal states are vastly more compatible with the data in aggregate than other possible states; a superagent may develop a much more explicit model of the state space than humans naturally do, and use each bit of data it obtains about the system—each output of the system that could’ve turned out differently—to locate its internal state much more efficiently.

Principle 9 (Experimentation)

A superagent may perform actions designed to subtly influence and gain information about a system, or several systems at once.

This is an extension of the previous principle. When you can elicit information from a system—that is, you can perform an action that causes the system to respond in a way that predictably depends on its internal state—you have incredible room to optimize this action for informativeness (the entropy of your model’s prior over possible outputs of the acted-upon system) at the same time that you’re using it to alter the system’s state in a way that benefits you (not just by making it itself act in ways that directly benefit you, but by increasing the options you’ll have to influence it in the future, or by making it easier to elicit information about the system through future actions).

Insofar as a superagent can learn about a system from much subtler information than a human, they should also be able to act on the system in much subtler ways than a human in order to cause it to make changes to its environment that are predictably dependent upon (and therefore encode) the information they need. Because they can explicitly keep track of many more desiderata at once concerning their planned actions, they may optimize their actions for many more desiderata at once as well: these include the extent of influence and usefulness of information elicited from a system as a result of the action, as well as its ability to effect the preservation or expansion of future options for both influence and useful elicitation of information from that system. In complex environments with multiple parties to influence and learn about, humans often explicitly restrict themselves to binary interactions, interacting with one party at a time, since it’s difficult to reliably mentally process higher-order relations and interactions. A superagent that can expand its equivalent ‘mental workspace’ may not have such problems.

In practice, being on the receiving end of superagentic action may look like seeing claims that through certain choices of wording seem to almost connote certain interesting things, or actions that seem to rely on interesting implicit assumptions, that put you on the threshold of wanting to respond to certain things in certain ways, that appear as Schrodinger’s dogwhistles for a variety of topics that you have some emotional investment and personal experience in.

(Note that it will never really feel like “I’m being influenced to share my opinion about xyz”: it just feels like you really ought to tell them about xyz. Maybe because you want to refute some claim or implicit assumption they seem to have about xyz, or because you want to tell them something that critically informs some closely-held belief of yours that they appear to be undermining. This aspect of human psychology is what makes it so easy to troll people online: they never think “this person is making me angry by acting like they don’t understand xyz”, they just feel angry because this person is refusing to understand xyz).

As with the previous principle, this is totally possible for ordinary humans to do in theory, but in practice it relies on levels of effort, coordination, and precision we cannot reliably bring to our tasks.

Principle 10 (Meta-Agency)

A superagent may have an explicit conception of themselves as a system to be optimized, and a process for improving their own capabilities.

Humans rarely think about themselves as cognitive systems that can be refined to better achieve their given ends. Certain methods of improving our memory, motivation, thought patterns, etc. might come to us every once in a while, but it’s rare that we make systematic instrumental efforts to improve these things.

A superagent, not being entirely human, is not subject to the same psychological biases and hang-ups that divert us from discovering and executing cognitive self-improvements, and their structure may admit much clearer, more robust avenues for improvement than ours. They may autonomously identify bottlenecks and inefficiencies, model the effects of different interventions on their own performance, simulate different versions of themselves, and then actually execute on knowledge thereby gained in order to make systematic, instrumental self-improvements.


Humans vary widely with respect to, and can explicitly improve at, most of these capabilities; some rare individuals (e.g. psychopaths) may fully exemplify a few of them. Large groups of coordinated humans often act superagentically in many of these ways simultaneously—no one human, or group of uncoordinated humans, could match the agentic capacity of Google or the Mossad. But coordination with other humans is slow and expensive, and can only go so far so long as all the thinking of these groups has to happen among human minds. Hybrid human-AI systems will likely enable individuals to approach superagency in many more of these ways.

 

(also posted on Substack)



Discuss

The Case Against Continuous Chain-of-Thought (Neuralese)

2026-01-11 04:32:03

Published on January 10, 2026 8:32 PM GMT

Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer.


The Bandwidth Intuition (And Why It's Incomplete)

One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size).

This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of the room. And that's not even accounting for discreteness.

So why do I think this apparent insanity is not just tolerable but necessary for LLMs?

The Noise Accumulation Problem

Let's imagine LLMs passed vast latent messages instead of tokens - say, the final hidden state or something a few matrix multiplications away from it. No bandwidth concerns. Pure continuous reasoning.

This message won't be perfect. The forward pass involves finite compute, imperfect optimization, distributional shift as we move off the training manifold, etc.. Call this aggregate imperfection "noise" or "error".

Here's the problem: noise in continuous space has no natural factorization.

When a 4096-dimensional vector is slightly "off," which components are errors and which are intentional signal? The representations weren't designed with error boundaries. The noise is semantically entangled with the message. There's no way to look at the latent state and say "that part is the mistake."

This noise might occasionally get recognized and corrected, but this shouldn't be expected. There are too many values the noise can manifest in, too many ways it can blend with legitimate concepts, not easily distinguishable for the next forward pass to catch.

So the noise accumulates. And accumulates. Until eventually you've drifted far enough from any correct trajectory that you produce clearly wrong outputs. Yet, you can't identify where things went wrong.

Discreteness as Regeneration

Discrete tokens don't have this problem. The noise exists in the final hidden layer just as much, but it gets rounded away when sampling from the categorical distribution.

This is comparable to digital vs. analog communication, think of old telephone networks passing messages:

  • Analog: Each relay amplifies the signal, amplifying noise with it. Transcontinental calls were impossible: the accumulated noise drowned the signal.
  • Digital: Each relay regenerates by classifying the signal and outputting a clean version. Noise gets corrected to tolerable levels at every step. Arbitrary distances become possible.

The token vocabulary is the regeneration codebook. The softmax is the classifier. Each autoregressive step is a relay station.

Of course, sometimes the noise is large enough that you "round in the wrong direction": you sample the wrong token. This is major noise. But crucially: major noise is identifiable in a way minor noise isn't.


Why Identifiability Enables Error Correction

Going back to the example of telephone networks, imagine a digital communication system messing up: one of the relays misclassified the signal and as a result the wrong signal was generated and send further down the line. This error is fixable - we know where something went wrong. Maybe the classifier is broken, those specific cables aren't isolated well enough, whatever it may be, there is no reason to succumb to despair: We surely know of a different set of relays and cables that should work out, starting from where we left off. And if they don't, we just figure out when something went wrong once again and progress slowly but surely through this maze.

Now compare this to the analog case - there is no clear "mistake", no single intervention point we can leverage as checkpoint. At any relay, noise accumulates, never a lot but always a little. Our best shot would be resending the signal down a completely different path, from start to finish, and just hope for a miracle.

The same logic applies to token-based reasoning. When you pick token X instead of token Y, that's a discrete, inspectable decision. The model reflecting on its own output can ask: "Was X the right choice here?" It can explore: "What if I had said Y instead?" The discrete structure gives you a search tree of token choices rather than an undifferentiated continuous trajectory.

This is what backtracking looks like in chain-of-thought reasoning. Thinking back, recognizing that a particular choice led somewhere unproductive, and exploring alternatives from that branch point.

This doesn't mean error correction is easy. Understanding when something went wrong is genuinely hard - you're often stuck, not knowing what the correct path forward looks like, and you just pick a direction to explore. But the point is that it's feasible. Because the error isn't gradual and diffuse, you can hope to identify it and therefore fix it. You can progress slowly but surely through the maze, backtracking when needed, rather than praying your next blind attempt happens to stumble on the right path.


The Bandwidth Counterargument

You might object: if the reasoning chain is short enough, noise won't accumulate to catastrophic levels, and the bandwidth cost of tokenization dominates. There's truth here.

Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.

The question is what happens over long reasoning chains, where hundreds or thousands of sequential steps compound whatever imperfections exist.

Tokens Are More Expressive Than They Seem

The expressiveness of discrete symbols shouldn't be underestimated. Each token is one of ~100k options, but carries deep meaning and (more importantly) rich connections to other tokens. A single sentence can express a vast space of concepts through composition.

This matters especially when you consider how well-adapted the model is to these specific embeddings. It's like how semantically transparent languages (Chinese characters, for instance) allow readers who know only basic components to understand novel compounds: they grasp the building blocks and how they interact, letting them extrapolate beyond their training data.

Theoretically LLMs can operate on embeddings outside the vocabulary. But whether patterns learned in an absurdly high dimensional and interpolative space (opposed to a deliberately finite, highly-structured set) will prove the same amount of generalization is questionable.


The Tokenization Counter-Counterargument

Here's another objection to my thesis: current tokenization actively introduces errors.

Arithmetic with subword tokens is a mess. Proper nouns get split arbitrarily. Multilingual text suffers. The discretization bottleneck doesn't just filter noise - it also destroys information that continuous representations would preserve.

This is real. But I'd argue it's an indictment of current vocabularies, not of discreteness itself.

The 100k tokens of English-optimized BPE weren't designed for reasoning. They encode accidents of orthography, not conceptual primitives. The solution isn't to abandon discreteness - it's to have better discrete representations (this isn't GOFAI, i promise).

Toward Minimal Compositional Vocabularies

There's a lot of cruft in languages like English that reasoning-focused training shouldn't have to care about. In fact, I'd argue we should go smaller, not larger. Strip from 100k to 1000, or even 500 tokens. Force the LLM to develop its own "language" with the only constraint of immense compositionality enforced through the small vocabulary.

A compositional vocabulary with 500 primitives might express the same concepts as 100k English subwords while being far more training efficient: new relations learned through training suddenly don't first have to go through multiple abstractions like choice of words, writing style, etc.. but directly take place on the reasoning level, allowing for strong generalization.

Escaping The Local Optima

A big problem is the local optima that LLMs enjoy through pretraining - whether they will themself ever truly escape is questionable but much needed to reach greater heights, literally. Instead of trying to come up with difficult reward schema, the carrot, I believe the stick is needed: simply continuously (the irony) strip them of the tokens they have grown so accustomed to.

By removing a growing set of tokens from the categorical distribution during rollouts while still exposed to optimization pressure, the LLM will naturally get forced out of it's local optima, since we are literally destroying it. It will try to represent now blacklisted tokens and associated concepts through patterns of other tokens, finding suitable representations that allow it to keep it's vast pretraining knowledge intact (hopefully).

I'm well aware that this sounds like AI Safety hell, but if you believe current models can't just yet cause Armageddon, i hope you can see how interesting the results of such training would be; regardless if "failed" or "successful", whatever those two may even mean in this context.


Summary (of claims)

  1. Continuous latent representations accumulate noise that is semantically entangled with the signal.
  2. Discrete tokenization acts as regeneration: minor noise gets rounded away, major noise becomes a localized, inspectable error.
  3. Identifiable errors enable backtracking and search. Diffuse errors require restarting from scratch.
  4. Current tokenization is suboptimal for reasoning, but the solution is better discrete vocabularies, not abandoning discreteness.


Discuss

The false confidence theorem and Bayesian reasoning

2026-01-11 01:14:29

Published on January 10, 2026 5:14 PM GMT

A little background

I first heard about the False Confidence Theorem (FCT) a number of years ago, although at the time I did not understand why it was meaningful. I later returned to it, and the second time around, with a little more experience (and finding a more useful exposition), its importance was much easier to grasp. I now believe that this result is incredibly central to the use of Bayesian reasoning in a wide range of practical contexts, and yet seems to not be very well known (I was not able to find any mention of it on LessWrong). I think it is at the heart of some common confusions, where seemingly strong Bayesian arguments feel intuitively wrong, but for reasons that are difficult to articulate well. For example, I think it is possibly the central error that Rootclaim made in their lab-leak argument, and although the judges were able to come to the correct conclusion, the fact that seemingly no one was able to specifically nail down this issue has left the surrounding discussion muddled in uncertainty. I hope to help resolve both this and other confusions. 

Satellite conjunction 

The best exposition of the FCT that I have found  is “Satellite conjunction analysis and the false confidence theorem." The motivating example here is the problem of predicting when satellites are likely to collide with each other, necessitating avoidance maneuvers. The paper starts by walking through a seemingly straightforward application of Bayesian statistics to compute an epistemic probability that 2 satellites will collide, given data (including uncertainty) about their current position and motion. At the end, we notice that very large uncertainties in the trajectories correspond to a very low epistemic belief of collision. Not uncertainty, but rather high confidence of safety. As the paper puts it: 

…it makes sense that as uncertainty grows, the risk of collision also grows. Epistemic probability of collision eventually hits a maximum, and past that maximum, as relative uncertainty rises, the epistemic probability of collision decreases. This decrease is called probability dilution, and it has an odd implication. Since the uncertainty in the estimates of the trajectories reflects the limits of data quality, probability dilution seems to imply that lowering data quality makes satellites safer. That implication is counterintuitive in the extreme [4–8]. As a rule, lowering the data quality makes any engineering system less safe, and to claim that ignorance somehow reduces collision risk seems foolish on its face.

And yet, from a Bayesian perspective, we might argue that this makes sense. If we have 2 satellites that look like they are on a collision course (point estimate of the minimum distance between them is 0), but those estimates are highly uncertain, we might say that the trajectories are close to random. And in that case, 2 random trajectories gives you a low collision probability. But reasoning this way simply based on uncertainty is an error. You certainly should not become more confident that 2 satellites are safe, just because you added random noise to the measurements. 

As it turns out, this problem pops up in a very wide variety of contexts. The paper proves that any epistemic belief system will assign arbitrarily high probability to propositions that are false, with arbitrarily high frequentist probability. Indeed: 

There is a fixed proposition of practical interest that is guaranteed or nearly guaranteed to be assigned a large epistemic probability, regardless of whether or not it is true… What the false confidence theorem shows is that, in most practical inference problems, there is no theoretical limit on how severely false confidence will manifest itself in an epistemic probability distribution, or more precisely, there is no such limit that holds for all measurable propositions. 

Moreover, there is no easy way around this result. It applies to any “epistemic belief system”, i.e. any system of assigning probabilities to statements that includes the seemingly basic law of probability that P(A) = 1 - P(not A). This occurs because of this very fact: If we cannot assign a high probability to A, we must assign substantial probability to not-A. In this case, if cannot be more than, say, 0.1% sure the satellites will collide, then we have to be at least 99.9% sure that they will not collide. 

However, there is one way out (well, one way that preserves the probability rule above). This result is restricted to epistemic uncertainty, that is, uncertainty resulting from an agent’s lack of knowledge, in contrast to aleatory variability, that is, actual randomness in the behavior of the object being studied. A Bayesian might object vehemently to this distinction, but recall the motivating example. If 2 satellites are on a collision course, adding noise to the measurements of their trajectories does not make them safer. However, giving each one a random push from its jets increases the actual variation in their paths, likely pushing them away from the previous point estimate of a collision, and thus does make them safer

The practical take-away

It is inappropriate to conflate subjective uncertainty with actual variation when reasoning under uncertainty. Doing so can result in errors of arbitrary magnitude. This phenomenon can occur, for example, when a key estimate relies on a highly uncertain parameter. Saying, “I don’t know much about this subject, but it would be overconfident to say this probability is less than 10%” sounds safe and prudent. But your lack of knowledge does not actually constrain the true value. It could in reality be 1/100, or 1/10,000, or 1/1,000,000. This arbitrarily severe error can then be carried forward, for example if the probability in question is used to compute a Bayes factor; both it and the final answer will then be off by the same (possibly very high) ratio. 

Perhaps an alternative way of phrasing this fact is simply to say that uncertainty is not evidence. Bayes theorem tells you how to incorporate evidence into your beliefs. You can certainly incorporate uncertainty into your beliefs, but you can't treat them the same way. 

Example 1: Other people’s (lack of) confidence 

Back in the day, Scott Alexander asked the following question in reference to the claim that the probability of Amanda Knox’s guilt is on the order 1 in 1,000, when LW commenters had given an average of 35%: 

Out of one thousand criminal trials in which the Less Wrong conventional wisdom gave the defendant a 35% chance of being guilty, you would expect to be able to correctly determine guilt nine hundred ninety nine times?

In fact, komponsito was entirely correct to be confident. 35% did not represent a true evaluation of AK’s probability of guilt, based on all of the available evidence. Many commenters, by their own admission, had not thoroughly investigated the case. 35% simply represented their epistemic uncertainty on a topic they had not investigated. If every commenter had thoroughly researched the case and the resulting average was still 35%, one could ask if komponsito was being overconfident, but as it stood, the commenters’ average and his number represented entirely different things and it would be rather meaningless to compare them. 

One may as well survey the community to ask whether a coin would come up heads or tails, and then after I flip it and proclaim it definitely came up heads, you accuse me of being overconfident. After all, a hundred rationalists claimed it was 50/50!  (Or to take a slightly less silly example, a coin that is known to be biased, but I'm the only one who's researched how biased or in what direction). 

Example 2: Heroic Bayesian analysis

In Rootclaim’s most recent COVID origins analysis, the single strongest piece of evidence is “12 nucleotides clean insertion,” which they claim is 20x more likely in lab leak (after out-of-model correction). Specifically, they say it is 10% likely under lab leak, based on the following “guesstimate:”

In the past, FCSs have been added by substitution rather than insertion, but it is not difficult to do it by insertion. We cannot be sure of the exact considerations of the lab researchers who added the FCS , such as investigating the role of proline. Therefore, we assign it a 10% probability.

So, they do not have any evidence that, across all cases when researchers might try to add an FCS to a virus, they use a “12 nucleotide clean insertion” 1 time out of 10. They simply provide a guess, based on their own lack of knowledge. This is exactly the error described above: For all they actually know, the true frequency of this behavior could be 1/1,000, an error of 100x, or it could be even worse. 

It is simply not valid to claim strong evidence for no other reason than your own certainty. Doing so is perverse to the extreme, and would make it trivial to make yourself completely confident by ignoring as much evidence as possible. The only valid conclusion to draw from this lack of knowledge is that you are unable to evaluate the evidence in question, and must remain uncertain. 

So what should you do instead? 

I believe that, essentially, avoiding FCT (at least, when epistemic uncertainty is unvaoidable) comes down to explicitly including uncertainty in your final probability estimate. The satellite conjunction paper offers a solution which bounds the probability of collision, and which can be proven to actually achieve this desired safety level. The key fact is that we are not claiming an exact value for P(collision) or its complement. The example from the satellite paper is based on “confidence regions,” i.e. 

…a confidence region represents the simple assertion that we are 1 − α confident that the true value of θ is somewhere inside Γα(x). Any sets containing Γα(x) inherit that confidence; all other sets accrue no positive confidence… for any false proposition, i.e. any set A such that Aθ, the probability that said proposition will be assigned a confidence of at least 1 − α is less than or equal to α

For the specific satellite case, the solution is to compute uncertainty ellipsoids for each object, and check if they overlap at the point of closest approach. In this case, the probability of collision can indeed be limited: 

so long as one performs a manoeuvre whenever the two uncertainty ellipsoids intersect, the rate at which collisions occur over a large number of conjunctions—i.e. the operational aleatory probability of collision—will be capped at α′ = 2α.

These tools are in some sense, "crude" ways of representing belief, as they do not reflect the full richness of the axioms of probability theory. And yet, they may be of great practical use. 

Conclusion 

It is perhaps quite surprising that attempting to force your beliefs to respect the seemingly obvious law of probability that P(A) = 1-P(A) can result in errors. Not just that, but it is in fact guaranteed to result in errors that are arbitrarily bad. Moreover, contrary to what “pure” or “naive” Bayesianism might indicate, there is in fact a very significant, practical difference between subjective uncertainty and aleatory variability. Nevertheless, the results seem to be on very solid mathematical ground, and once we dive into what these results are really saying, it makes a lot more intuitive sense. 

Additional links

https://en.wikipedia.org/wiki/False_confidence_theorem

https://arxiv.org/abs/1807.06217



Discuss