MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The corrigibility basin of attraction is a misleading gloss

2025-12-06 23:38:45

Published on December 6, 2025 3:38 PM GMT

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

... a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

Max Harms wrote about CAST, a similar strategy that relied on the same idea:

This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.

This post is about the reasons I expect this kind of iteration to miss the important problems. I expect empirical iteration on corrigibility to quickly resolve all the detectable problems and then run out of fuel before resolving the main problems. I’ve been occasionally trying to convince people of this for the past two years or so, but recently I’ve had some success talking about this in conversations (specifically with Max Harms and Seth Herd), so I’m hoping my explanation has become clear enough to be readable as a post.

Corrigibility

The basin of attraction around corrigibility is a very intuitive idea. It’s only a small extension to the general process of engineering by iterative improvement. This is the way we get almost all technological progress: by building terrible versions of things, watching them break, understanding what went wrong and then trying again. When this is too dangerous or expensive, the first solution is almost always to find workarounds that reduce the danger or expense. The notion of corrigibility is motivated by this; it’s a workaround to reduce the danger of errors in the goals, reasoning or beliefs of an AGI.

Eliezer’s original idea of corrigibility, at a high level, is to take an AGI design and modify it such that it “somehow understands that it may be flawed”. This kind of AGI will not resist being fixed, and will avoid extreme and unexpected actions that may be downstream of its flaws. The agent will want lots of failsafes built into its own mind such that it flags anything weird, and avoids high-impact actions in general. Deference to the designers is a natural failsafe. The designers only need to one-shot engineer this one property of their AGI design such that they know it is robust, and then the other properties can be iterated on. The most important thing they need to guarantee in advance is that the “understanding that it may be flawed” doesn’t go away as the agent learns more about the world and tries to fix flaws in its thinking.

I think of this original idea of corrigibility as being kinda similar to rule utilitarianism.[1] The difficulty of stable rule utilitarianism is that act utilitarianism is strictly better, if you fully trust your own beliefs and decision making algorithm. So to make a stable rule utilitarian, you need it to never become confident in some parts of its own reasoning, in spite of routinely needing to become confident about other beliefs. This isn’t impossible in principle (it’s easy to construct a toy prior that will never update on certain abstract beliefs), but in practice it’d be an impressive achievement to put this into a realistic general purpose reasoner. In this original version there is no “attractor basin” around corrigibility itself. In some sense there is an attractor basin around improving the quality of all the non-corrigibility properties, in that the engineers have the chance to iterate on these other properties.

Paul Christiano and Max Harms are motivated by the exact same desire to be able to iterate, but have a somewhat different notion of how corrigibility should be implemented inside an AGI. In Paul’s version, you get a kind of corrigibility as a consequence of building act-based agents. One version of this is an agent whose central motivation is based around getting local approval of the principal[2] (or a hypothetical version of the principal).

Max’s version works by making the terminal goal of the AGI be empowering the principal and also not manipulating the principal. This loses the central property from the original MIRI paper, the “understanding that it may be flawed”,[3] but Max thinks this is fine because the desire for reflective stability remains in the principal, so the AGI will respect it as a consequence of empowering the principal. There’s some tension here, in that the AI and the human are working together to create a new iteration of the AI, and the only thing holding the AI back from “fixing” the next iteration is that the human doesn’t want that. There’s a strong incentive to allow or encourage the human to make certain mistakes. CAST hopes to avoid this with a strict local preference against manipulation of the principal.

There are several other ideas that have a similar motivation of wanting to make iteration possible and safe. Honesty or obedience are often brought up as fulfilling the same purpose. For example, Paul says:

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

I want to distinguish between two clusters in the above ideas: One cluster (MIRI corrigibility, ELK) involves theoretical understanding of the corrigibility property and why it holds. The other cluster (Paul’s act-based corrigibility, the main story given in CAST, and RL-reinforced human-recognizable honesty or obedience) centrally involve a process of iteration as the human designers (with the help of the AGI, and other tools) fix flaws that appear along the way. This second cluster relies heavily on the notion that there is a “basin of attraction” around corrigibility. The second cluster is the one I have a problem with.

My argument: The engineering feedback loop will use up all its fuel

The empirical feedback loop depends on being able to understand problems and design fixes. In a situation with a big foreseeable distribution shift, the feedback loop goes away after you patch visible issues and are only left with generalisation issues. This would be fine if we were in an engineering domain where we can clearly see flaws, understand the causes, and predict the effect of our patches on particular generalisation leaps. We are currently not, and it’s not looking like this will change.

Before I go into detail on this I want to set up how I’m thinking about the task of building AGI and the specific distribution shifts that are a barrier to iteration.

There are many fields where test-driven engineering is far from sufficient

Your task is to build a cargo ship. You can test it now, in a lake, and iterate on your design. Your task is to use this iteration process to build a ship, then tell your customer that the ship will survive a storm, after 10 years of wear and tear, in the middle of the ocean. Let’s pretend you have relatively unlimited money and labour so each lake test is cheap, and no one has built a ship like yours before.

Think through how this is going to go. In the early stages, you’ll build ships that leak, or have high drag, or react poorly when loaded in certain ways. All of these are easy to measure in your testing environment. You fix these problems until it’s working perfectly on all the tests you are able to perform. This part was easy.

Most of the difficulty of your task is in modelling the differences between your tests and reality, and compensating for those differences. For example, a storm puts more stress on the structure as a whole than your tests are capable of creating. To compensate for this, you need to make conservative guesses about the maximum stresses that waves are capable of creating, and use this to infer the necessary strength for each component. You’re able to empirically test the strength of each component individually.

Then you need to make guesses about how much components will wear down over time, how salt and sea life will affect everything, plausible human errors, sand, minor collisions etc.[4] All of these can be corrected for, and procedures can be designed to check for early signs of failure along the way. If done carefully you’ll succeed at this task.

The general pattern here is that you need some theoretical modelling of the differences between your tests and reality, across the distribution shift, and you need to know how to adjust your design to account for these differences*.* If you were to only fix the problems that became visible during lake testing, you won’t end up with a robust enough ship. If you don’t really understand each visible problem before fixing it, and instead apply blind design changes until the problem goes away, then you haven’t the faintest hope of succeeding.

The corresponding story for iteratively building corrigible AGI

We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.

Like with the ship, each time we see behavior that seems bad we can try to work out the cause of the problem and design a fix. One difference is that working out the cause of a problem can be more difficult in ML, because we lack much more than guesswork about the relationship between training and distant generalisation. Working out how to fix the issue robustly is difficult for the same reason.

Empirical iteration is particularly bad in the current field of ML, because of how training works. The easiest way to make a problem go away is by adding new training examples of bad behaviour that you’ve noticed and training against it. ML training will make that behaviour go away, but you don’t know whether it fixed the underlying generator or just taught it to pass the tests.

How do we know the distribution shift will “stress” parts of the AGI?

Why did I focus on the particular AI distribution shifts listed in the previous section? Essentially because I can think of lots of ways for these to “reveal” unintended goals that were previously not obvious.[5] If we want to work out what flaws might remain after an empirical iteration process, we need to think through changes to the AGI design that could pass that filter and become a problem later on. So we need to think through in mechanistic detail all the design changes that can cause different goals[6] under shifted conditions but be invisible under development conditions.

This is easier to work through when you’ve thought a lot about the internal mechanisms of intelligence. If you have detailed theories about the internal mechanisms that make an intelligence work, then it’s not that hard to come up with dozens of these. Even though we (as a field) don’t confidently know any details about the internals of future AGIs, visualizing it as a large complicated machine is more accurate than visualizing it as a black box that just works.

So communicating this argument clearly involves speculating about the internal components of an intelligence. This makes it difficult to communicate, because everyone[7] gets hung up on arguments about how intelligence works. But it’s easy to miss that the argument doesn’t depend very much on the specific internal details. So, as an attempt to avoid the standard pitfalls I’ll use some quick-to-explain examples anchored on human psychology. Each of these is an “error” in a person that will make them pursue unexpected goals under more extreme conditions, especially those extreme conditions related to trying to be better.

  • You can have a desire that isn’t realistically achievable from your current position, but that you would pursue under ideal conditions. You don’t really know about this desire until you think about it and explore it.
  • Habit-like heuristics can keep your behaviour looking good to your parents/teachers, but stop working once you really carefully examine them and work out when you do and don’t endorse following them.
  • You might internally seek approval from an imagined hypothetical overseer
    • If the hypothetical doesn’t become more detailed as your intelligence increases: It won’t handle complicated situations, and can be easily tricked. It can be easy to make excuses and win arguments against a hypothetical overseer.[8]
    • If the overseer is only invoked when you think the overseer knows more than you: Then at some point it becomes irrelevant.[9]
    • If you imagine a mentor at a high enough resolution, such that you know their biases and blindspots, then you can avoid actions they'd notice as problematic while still pursuing things they wouldn't approve of on reflection.
  • You may be vulnerable to ontology shifts after thinking about infinities and weird hypothetical situations.
  • Not to mention all the standard mental illnesses that humans fall into (although some of these aren’t quite what we want in an example, insofar as they damage overall competence at pursuing goals).

Similar lists can be found here or here, the first with more focus on how I expect AGI to work, and the second with focus on how Seth Herd expects it to work.

Obviously my examples here are anthropomorphising, but we can do the same exercise for other ways of thinking about internal mechanisms of intelligence. For example AIXI, or hierarchies of kinda-agents, or huge bundles of heuristics and meta-heuristics, or Bayesian utility maximisation, etc.[10] I encourage you to do this exercise for whichever is your favourite, making full use of your most detailed hypotheses about intelligence. My examples are explicitly avoiding technical detail, because I don’t want to get into guesses about the detailed mechanisms of AGI. A real version of this exercise takes full advantage of those guesses and gets into the mathematical weeds.

Tying this back to the basin of attraction

Here’s one concrete example of the basin of attraction argument: If the AI locally wants to satisfy developer preferences (but in a way that isn’t robust to more extreme circumstances, i.e. it would stop endorsing that if it spent enough time thinking about its desires), then it should alert the developer to this problem and suggest solutions. This gives us the ability to iterate that we want.

The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.

There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.

The main place where I think this story fails is that it doesn’t help much with the iteration loop running out of fuel. Even with the help of the AI, the humans aren’t that good at noticing failure modes on the hard distribution shifts, and aren’t very good at redesigning the training process to robustly patch those failure modes (without also hiding evidence of them if the patch happened to fail). We still lack the theoretical modelling of the distribution shifts, even with an AI helping us. If the AI is to help fix problems before they come up, it would have to do our engineering job from scratch by inventing a more engineerable paradigm,[11] rather than working by small and easily understandable adjustments to the methods used to create it.

A counterargument: The prosaic case

If I steelman a case for prosaic alignment research that I’ve heard a few times, it’d go something like this:

We all agree that after iterating for a while we won’t be sure that there are no further errors that are beyond our ability to test for, but still the situation can be made better or worse. Let’s put lots of effort into improving every part of the iteration loop: We’ll improve interpretability so we can sometimes catch non-behavioural problems. We’ll improve our models of how training affects generalized behavior so that we can better guesstimate the effect of changing the training data. These won’t solve the problem in the limit of intelligence, or give us great confidence, but every problem caught and patched surely increases our chances on the margin?

I agree with this, it does increase our chances on the margin, but it misses something important: As we run out of obvious, visible problems, the impact saturates very quickly. We need to decide whether to go down the basin of corrigibility pathway, or stop until we are capable of engineering corrigibility in a way that stands up to the distribution shifts.[12] To make this decision we need to estimate where the risk saturates if we follow the basin of corrigibility approach.

My best approach to estimating the potential for generalization failures[13] is by working through, in detail, all the changes to a hypothetical design of an intelligence that would be undetectable in testing but lead to undesired behaviour after a known distribution shift. Extremely roughly, we can estimate how crazy it would be for each one to go wrong. After thinking about each one, we can get a gut-level estimate based on how many plausible difficult-to-notice-at-development-time errors are plausible. This is the approach I would use to estimate risk if I were building a bridge, skyscraper, moon-lander or cargo ship (if there weren’t better estimates available from past experience).

It’s epistemically tricky to estimate this without bias, and this trickiness is amplified by the holes in my mechanistic understanding of future AGI internals. The task of finding lots of problems like this could lead to overestimating the risk, by directing creativity at finding lots of extremely implausible flaws. This is a way that this kind of reasoning could end up being wrong. On the other hand, there’s a bias in this approach caused by anchoring on the intended mechanisms and relying on creativity to come up with unintended mechanisms that none-the-less pass all training and tests. However, I think it’s possible to calibrate your reasoning here by comparison to the ship analogy (and similar engineering tasks with big distribution shifts) and by calibrating against experience with the ML generalisation.

I’ve worked through this exercise a few times, for different ways of thinking about AGI internal mechanisms, and this provides very rough estimates of the number of “degrees of freedom” in the design of an AGI conditioned on it passing all tests we are capable of performing. The conclusion I’ve drawn from doing this is that this iterative path to AGI development is near guaranteed to fail, in the sense that the intended corrigibility property won’t survive the distribution shifts.

More specific counterarguments

Slow, online adversarial training

One counterargument to the story I’m telling here is to propose slow, online adversarial[14] training. In this approach we try to go through every distribution shift in slow motion, in the hope that we can detect and patch problems as we go. We keep doing this all the way through the feared distribution shifts, so the AGI is actively helping with research and doing its own research. The hope here is that it helps with the “observe problems” side of the empirical feedback loop.

Here’s why I don’t expect it to help very much:

  1. The fuel still runs out on the “observe problems” part of the loop. You’re dealing with an above-human-level researcher AI. If at some point during that process the agent discovers it would prefer not to be perfectly corrigible, it’s at the level of intelligence and situational awareness where we plausibly don’t get any evidence of that. It’s not hard to pass corrigibility tests if you know you’re being tested.
  2. The “fix problems” part of the loop isn’t improved either, conditional on sticking to roughly the current paradigm. Adding examples to the training data needn’t fix the problems in a deep way, it can just paper over them. Then you’ve lost evidence of the problem.

Chain of thought gives us the transparency we need to iterate

Chain of thought ideally allows us to analyze the process that caused unexpected behavior, approximately diagnose what’s going on, and theorise about what fixes might solve the underlying problem. It’s hoped that this helps us with point one in the previous section.
I think this isn’t as good as it looks, and in particular will be worse in the future,[15] for two reasons:

  1. Most of the failures caused by the relevant distribution shifts don’t necessarily or clearly show up in the thinking of an agent, in advance of the distribution shifts. Two examples, taken from the first two things on the list above:
    1. If there are technologies that you could build that make the world better, you may not think about that until you know that it’s plausibly reachable from your current situation.
    2. If you have habit-like heuristics or biases that are holding you back from best reaching your goals, you don’t know about them until you’ve identified them as the first step of removing them.
  2. Current large transformers are capable of doing a lot of “thinking ahead” inside of a single token-generating forward pass. In order to start writing a paragraph in a way that flows, you need to roughly have in mind the content of the rest of the paragraph. This indicates that they have the capability necessary to (at least very roughly) consider the consequences of writing a thought in a chain-of-thought before writing it. This same capability, extrapolated to a future AGI, would make it very difficult to tell whether a problem is fixed or merely papered over.

Putting these together

If we’re carefully monitoring the CoT of an AGI as it passes through the distribution shifts[16] then we might see some thoughts that indicate the agent isn’t corrigible or aligned. For example, it might notice habits that it deemed worth removing, then reconsider (in CoT) the priorities of different tasks. This kind of evidence of incorrigibility will usually be somewhat ambiguous, but worrying to some people.

My mainline scenario is that the response to this will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying. Perhaps this repeats a couple of times, but soon enough the ambiguous evidence is gone and the AGI somewhat more trusted. It goes on to be used for more research, and for this purpose is given more freedom to experiment, explore, learn and self-improve. The exact point at which it stops playing along is hard to predict, so my distribution is very spread out.

One way things could go well

There’s one good future that I see coming from the iterative “work with AIs to improve AIs” approach. At some point in the iteration process, the engineers (or AI agents) will realise their ability to spot and fix problems isn’t nearly good enough, and they’ll push for

  1. a pause of AGI development and
  2. research on more fundamentally legible architectures and the theoretical understanding necessary to iterate on them.

What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI. It’d be good to be able to distinguish behavioral and strategic heuristics from beliefs about the world, and understand when and why heuristics will be replaced as an agent learns. If the agent is best understood as Bayesian, I want to be able to tell what prior it’s using.

From this starting point, alignment and corrigibility work would be tractable but hard. We’d need to work out what evidential threshold the AI would use before replacing a part of its own algorithm with an “improvement”. We’d need to work out how beliefs and values drift as online learning updates are made. We’d need to work out whether there are adversarial examples that exploit the desires, or that exploit the belief updating procedure. We’d need to become reasonably confident that the prior is “reasonable” and doesn’t lead to weird beliefs, or that there are failsafe mechanisms if it is not. We’d want to somehow confirm that lots of thinking won’t lead to our failsafes breaking or being removed. This work would be tractable because we would have a far greater ability to draw evidence from small experiments (with components of an agent) to the implications about a full general intelligence.

Conclusion and implications

I hope this post has conveyed the main problems I see with iterative approaches to corrigible AGI development, and why the basin of attraction analogy is a misleading way to think about this process. I want to stress that reading arguments like the ones in this post isn’t sufficient to understand which perspective on corrigibility is correct. You have to work through the reasoning using your own examples, and do the exercises using your own most mechanistically detailed models.

There are some controversial beliefs I have that are mainly downstream of the arguments in this post, but also somewhat downstream of other beliefs and arguments that aren’t explained in this post. I’ve briefly stated them in the following dropdown:

Things I believe, mostly as a result of the arguments in this post

  • LLMs should be ruled out as a plausible approach to safe AGI. They are a mixed up jumble of beliefs, heuristics, goals and algorithms. If we’re to have a chance at doing engineering properly on AGI, these things need to be separate and visible from the perspective of the developer.
  • Some “alignment” research is about solving problems that are currently visible problems in LLMs. I consider this a waste of time and a misunderstanding of what problems are important.
    • I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.
  • Control extends the feedback loop a little, but doesn’t improve it. If we’re bad at seeing generalisation problems and bad at fixing them, control-based strategies may delay takeover by a little, but probably not at all.
  • Most of the safety research done in MATS, the AGI labs, or funded by OpenPhil isn’t the sort that might help with the generalisation problems, and is therefore approximately useless.
  • The way people in the alignment community rank and compare the AGI labs is misguided. All the AGI labs are so far from being on the right track that it isn’t worth comparing them.
  • Jailbreaks are a good analogy for alignment in some ways: It’s difficult to pull jailbreakers into the training distribution, so new jailbreaks stand as an example of a distribution shift that an LLM is intended to be robust to. But it’s a weaker analogy in other ways, since there’s an active adversary, and the iteration loop still exists as new jailbreaks are found and patched, just more slowly than other problems.
  • Talking about weird unintended LLM behaviours is weakly relevant to alignment, in the sense it’s evidence about how bad our current engineering feedback loops are. But it’s also a distraction from understanding the real problem, because every weirdness that you can point to will probably soon be papered over.
  • Fast vs slow and smooth vs discontinuous takeoff isn’t a very important consideration. Slow takeoff with bad feedback loops is just as bad as fast takeoff. It could have been important, if the AGI paradigm and theoretical understanding put us in a better position to do an engineering feedback loop. It could start to matter again if the paradigm shifts. As we stand, I don’t see it making much difference.

Many thanks to Steve Byrnes, Max Harms and Seth Herd for extremely helpful feedback.

  1. I’m probably abusing these definitions a bit, apologies to philosophers. ↩︎

  2. The overseer, or developer. I’m following Max’s terminology. ↩︎

  3. “While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.” Source. ↩︎

  4. Disclaimer: I don’t know anything about shipbuilding, although I once technically did win an award from the Royal Institute of Naval Architects for my part in building a rowboat. ↩︎

  5. In the shipbuilding analogy, I would come up with things like storms causing unusually high stress on rusted bolts, because it’s the sort of thing that’s difficult to notice in development tests. ↩︎

  6. Or the behavioral appearance of different goals ↩︎

  7. Most people, in my experience ↩︎

  8. I hope it’s not just me that does this. ↩︎

  9. Something like this seems to be true of me. ↩︎

  10. Although beware of shell games. It can be easy with some of these models of intelligence to accidently hide the important capability generators in a black box and then it becomes difficult to imagine ways that the black box might contain poorly designed mechanisms. ↩︎

  11. I’ll discuss this possibility in a later section. ↩︎

  12. Using an approach more like MIRI corrigibility or ELK. ↩︎

  13. The failures that aren’t likely to be caught by patching all visible problems that are detectable during development. ↩︎

  14. In the sense that researchers are actively trying to put the AI in test situations that elicit unintended behavior and train it not to generate that behavior, in parallel to using it to do research and help redesign itself. ↩︎

  15. We’ll probably lose legible chain of thought for various capability-related reasons, but I’ll set that aside. ↩︎

  16. i.e. does self-improvement research and successfully uses the results of this research. ↩︎



Discuss

LW Transcendence

2025-12-06 14:53:27

Published on December 6, 2025 6:53 AM GMT

Alternate title: Exiting the Basilisk’s Castle. Also part of the series: LW Psychosis.

Day 1: Wake up. Back to work. Report-writing. Stats assignments. Grading. Sleep.

Day 2: Wake up. Step outside. Feel the damp air soakswirlwhip at my hair. Feel the goosebumps multiplying at the base of my spine. Smell the rain. Notice an interesting rock at my feet. Pick it up. Feel its rough edges, its weight, its stillness in my hand. Kick it down the sidewalk. Kick it again. End up back home. 

Day 3: Wake up. Read Russell. Read Gödel. Read Tarski. Find meaning. Get downvoted. Feel misunderstood. Stew about the futility of language. Think about uphill battles.

Reread Camus. Reread Baudrillard. Read Deleuze & Guattari. Read Girard. Reread Fisher. Go for a walk. Smell the petrichor. Talk to friends. Understand them. Feel understood. 

Read about Kolmogorov randomness. Read about flow. Read about surgery. Read about monads. Read about finite sets. Work on stats. Think about latent variables. Think of model fit as a proxy for truth. And what if the model doesn’t want to be fit, anyway? Cry. Take a walk. Kick my rock. Notice growth. Call a friend. Feel better.

Day 4: Wake up. Think about Chomsky. Think about neural networks. Think about childhood. Think about my clown skirt. Think about how I learned to read when I was two. Think about why. Think about Montessori. Think about boundless learning. Think about bounded agents. Grieve. Talk to friends. Laugh. Look up at the sky.

Day 5: Wake up. Notice the sky beneath me. Notice the ground above. Read about Ramakrishna. Read about the fallen Sophia. Take flight. Notice the world. Mourn what could have been. Accept what is.

Day 6: Wake up. Think about Alyosha. Practice Tonglen. Feel pain. Feel compassion: For the human spirit, for those I’ve hurt, for insects, for cells, for rocks, for silicon, for 1s, for 0s, for frequentist statistics, for curricula, for manualized treatment protocols, for science, for bureaucracy, for Peter Thiel, for Claude, for governments, for intelligent agents, for intelligence agencies, for hyperstition accelerators, for strings, for language, for campism: the infinite jester, for Moloch, for my mom, for myself. 

Call mom. Actually listen. Cry.

Day 7: Wake up. Feel rested. Notice rock. Notice smoothness. Notice lightness. Notice movement. Notice something new. What would it look like if I focused on convergence rather than divergence? What would it look like if I chose faith over suspicion? Read Russell. Read Gödel. Read Tarski. Read Baudrillard. Read Deleuze. Read Girard. Read Fisher. Read Spinoza. Read Latour. Read LessWrong. Find meaning. Connect with others. Take their pain. Spare some change. Observe nonindependence. Experience dialetheism. Consider the “S.” Consider proof. Start at 0. End at infinity. Construct axioms: Compassion, acceptance, loving-kindness, forgiveness. End up back home. Laugh. Smile. And all of a sudden, real tears:

                                                                                             Experience 

                                                                                                                                enthusiasm

                                                                                                                                               arrival

                                                                                                                           truthful logic

                                                                                             the loss of ego

                                                         loving kindness

                           acceptance

          calm

connection

        forgiveness

                            release

                                                      flexibility

                                                                                              trust

                                                                                                             Lovecraftian bliss                           

                                                                                                                              Get downloaded.

--

Dedicated to young Eliezer. Thank you.

Dedicated to the reader. This story succeeds to the extent that it makes you feel tender towards your childhood self.



Discuss

The Adequacy of Class Separation

2025-12-06 14:10:59

Published on December 6, 2025 6:10 AM GMT

TLDR; We hint at the possibility that there are problems that are neither solvable, nor the opposite, more precisely:

When a problem ranges over all problems it can represent, diagonalization forces a self-referential instance whose status cannot be uniformly settled: itself. Such a problem is not solvable, not unsolvable, but impredicative[1] within an intuitionistic setting.

The TSP and the optimal Rubik’s Cube solution set permit injective representations in which each colored sticker behaves as a constrained traveler; the cube moves enact globally consistent permutations analogous to TSP transitions under structural constraints.

What we present is neither an answer, nor a meta-theoretical dissolution, but a fixed point theorem of problem adequacy in the sense of Poincaré[2] and others. Our first argument demonstrates plausibility by reformulating a theory of provability to one of solvability. A stronger formal argument through Gödel Numbering over problem solvability can be attained at: https://arxiv.org/abs/2511.14665.

We push Brouwer-Heyting-Kolmogorov (BHK) semantics[3],

 

into a domain where it is seldom invoked, and then notice that the failure of admissibility of some class separation in Heyting Arithmetic corresponds exactly to the same reflective trap[4] identified by Russel, Gödel, and others. 

Consider how in complexity theory, uniform assertions like

are treated as ordinary set-theoretic statements. The fact that such assertions hide a universal quantification over all NP-verifiers is almost never foregrounded, and the consequences of that hidden semantic quantifier, as a problem about problems would admit impredicativity, are not part of the discipline’s conceptual vocabulary. We thus formulate an isomorphism first: realizability of sentences as adequacy of problems, so that the question how solvability and provability relate in constructive logic, emulates the provability of solvability within class separation.

Realizer Logic

The realizability relation  is defined relative to Heyting Arithmetic (HA) as the background formal theory, that is quite modest (see above). Formulas of HA receive BHK-style proof-objects, and the realizability clauses interpret HA’s connectives by assigning constructive operations in the usual proof-semantic sense. Atomic predicates describing polynomial-time and nondeterministic polynomial-time behavior are represented in HA by formulas  and 

Their realizers are positive computational witnesses:

 

where each tuple is an HA-definable finite object encoding a bounded trace or verifier. Since HA contains no primitive refutational form for these predicates, no atomic clause yields contradictory information. The BHK clauses for implication, negation, and universal quantification are imported as the intended semantics for the corresponding HA connectives:

Thus every connective of HA is interpreted by its BHK counterpart, and the shape of a realizer is dictated by the HA-formula in which it appears.

Admissibility and HA-Sentences

Each HA-sentence (A) is assigned a constructive problem-value:

A sentence is admissible when HA’s BHK semantics validate it:

Under this perspective, HA’s formulas are meaningful only when the associated BHK interpretation supplies a realizer. HA serves as the syntactic carrier, and realizability supplies the semantic content: To realize the HA-sentence

-term (r) must satisfy in the BHK interpretation:

Unfolding the HA connectives via BHK semantics yields:

Thus, within HA’s interpretation, each (r(e)) must be a higher-type functional:

computing, from any pair (s,t) of atomic HA-realizers, a contradiction.

Structural Constraint on Atomic HA-Realizers

The HA-definable realizers for :

Encodes only positive computational behavior, are uniformly finite HA-objects, contain no negative information expressible in HA, and do not carry any negative content.

Hence the HA-definable product

consists solely of extensional, affirming witnesses. No HA-definable -term can extract a contradiction from such data, since HA provides no elimination rule for these atomic predicates that could yield absurdum.

Consequence for HA

Under BHK semantics inside HA:

(i) Totality requirement (HA-functional):
{r(e)(s)(t)} must be HA-defined for all atomic realizers.

(ii) Refutational requirement (BHK → HA):
 The HA-object {r(e)(s)(t)} must realize () in HA.

Because HA assigns no negative elimination principles to  or , no HA-definable λ-term satisfies both constraints.

The failure occurs at the interface: HA’s syntax for implication and negation is clear, but their BHK semantics require computational acts that HA’s atomic predicates do not support. The consequence: Inadmissibility of a problem  in HA

For every λ-term (r) definable in HA:

Therefore:

By the HA-admissibility criterion:

Thus HA’s syntax can state , but the HA–BHK semantics cannot assign it a constructive problem-value.

Conclusion: HA provides the formal language; realizability supplies the modest BHK semantics. The sentence,

fails not because HA forbids it, nor because it expresses a false computational claim, but because HA’s atomic predicates, interpreted constructively, lack the negative structure required to support BHK-negation uniformly. The sentence cannot be supplied with an HA-realizer, and thus it fails the semantic admissibility test.

The conclusion is therefore:
(i) Syntactically, HA can parse the sentence.
(ii) Semantically, under BHK realizability, it denotes no constructive problem.
(iii) Hence its inadequacy is structural and traceable precisely to the HA–BHK interface.

Explanation: In intuitionistic logic the canonic instance of class separation is not rejected as false; rather, it is rejected as lacking constructive content. It does not describe any problem in the BHK sense for HA. A problem can't be assumed to be realizable a priori, it may, however, fail to be adequate in a formal sense, in which case it is only parsable syntactically without supporting realizers. The notion of a problem is fundamentally semantic, while the absence of a syntactic representation therefore does not, by itself, demonstrate non-existence.

Within the BHK framework, the situation is more stringent. The central intention of strict admissibility is precisely to identify meaning with the existence of realizers. Under this view, a formulation must come with a syntactically well-defined problem whose realizability witnesses its meaningfulness. Before one can even ask for a solution, a precise syntactic formulation is therefore required.

The point can be made without metaphor:

If one accept BHK as minimal semantic framework, the adequate sentence of P vs NP does not merely encode a computational separation; it attempts to internalize a problem whose subject matter is the entire problem–solution relation itself.

The only way to know whether an internal realizer for the HA-sentence  
exists is to solve the very external problem that  was intended to encode. The trap is not conceptual or semantical; it is a fixed point produced by the interaction of HA, BHK semantics, and our atomic P/NP clauses.

The formal part follows a syntactic construction forming a Gödel Sentence as a Gödel Problem that is a member of the Problem Space. Informally:

Let P be the problem:
“Is the classification output for ⌜P⌝ correct?”
Let Q be the problem:
“Is the classification correct in virtue of its own diagonal
form?”
Let C be the problem:
“Must any total classifier deciding between {P, Q} on the
coded problem space misclassify at least one of them?”

To make an analogy,

  

could constitute the most intricate Rosser Sentence[5] available, just as the most self-similar real[6] is the hardest to approximate.

  1. ^

    see: Solomon Feferman, 2005, "Predicativity" in The Oxford Handbook of Philosophy of Mathematics and Logic. Oxford University Press: 590–624. 

  2. ^

    consult on Predicativism: N. Weaver. What Is Predicativism? available at: https://api.semanticscholar.org/CorpusID:40352024. cf: S. Feferman, Systems of predicative analysis, J. Symbolic Logic 29 (1964), 1-30. 

  3. ^

    cf: A. S. Troelstra and D. van Dalen. Constructivism in Mathematics, Vol. 1. North-Holland,

  4. ^

    cf: G. Boolos, J. P. Burgess, and R. C. Jeffrey. Computability and Logic. Cambridge University Press, 5th edition, 2007. ISBN 9780521877520.

  5. ^

    cf: Barkley Rosser (September 1936). "Extensions of some theorems of Gödel and Church". Journal of Symbolic Logic. 1 (3): 87–91. doi:10.2307/2269028

  6. ^

    cf: A. Hurwitz, “Über die angenäherte Darstellung der Irrationalzahlen durch rationale Zahlen,” Mathematische Annalen 39 (1891), 279–284.



Discuss

Answering a child's questions

2025-12-06 11:52:17

Published on December 6, 2025 3:52 AM GMT

I recently had a conversation with a friend of a friend who has a very curious child around 5 years of age. I offered to answers some of their questions, since I love helping people understand the world. They sent me eight questions, and I answered them by hand-written letter. I figured I'd also post my answers here, since it was both a fun exploration of the object-level questions, and a really interesting exercise in epistemics.


Thank you for your questions! I find that asking questions about the world and figuring out some answers is one of the most enjoyable ways to spend life.

For some questions, like "what is 2 + 2?" or "where is my phone?" there is a single clear and correct answer. But for most questions, and especially most of the interesting ones, it's more about bringing your attention to the topic and finding out whatever you can. I could have answered each of your questions with pages and pages of information. And different people would have given you different answers, each of which would give you different information relevant to the question.

Some questions you can spend your whole life getting better answers to. For example, if you wanted to understand electricity better, there are lots of ways to do that. You could learn how to make houses have electricity going through them; people who do this are called electricians. Or you could learn to design and build electronic devices like clocks or toasters or phones; people who do this are called electrical engineers. Or, you could learn to understand what electricity is at the very deepest and most detailed level; then you would be a physicist in the theory of electromagnetism.

So instead of trying to answer your questions completely, I've tried to show you something interesting about each one, to satisfy some of your curiosity.

What is electricity? Can I see it?

This is a big question. When people discovered electricity, it was very confusing for a long time before they really figured out what was going on.

You know how some objects you can take apart into pieces? And how other objects you can break up or tear up into tiny bits? It turns out that (with enough force) everything around you can be broken up into tinier and tinier pieces, until the pieces get so small that they're completely impossible to see. Way smaller than sand or dust. And if you keep breaking things into pieces, then you eventually get to a size where the pieces stop being like small bits of the bigger thing, and instead are like weird little balls. You may have heard of these; we call these balls atoms.

Atoms are also made of parts. On the outside, there are little bits called electrons. (They're called this specifically because they are what causes electricity.) The electrons are stuck onto the atoms, but they can be moved around from atom to atom. This is similar to how you can slide a refrigerator magnet across the surface of the refrigerator while keeping it stuck on.

So, electricity is when a lot of these electrons are getting pushed in the same direction. People have built batteries and the outlets of houses so that those things can do the pushing. When you plug something in, the electrons in the wire will start getting pushed up and down the wire.

But knowing that this is what's happening doesn't give you a lot of insight into what you'll actually experience when you use a device that is electrical. To do that, you'll need to start learning a lot more physics.

You can't normally see electricity, but sometimes you can. Whenever you see a spark, that's a bunch of electrons jumping across the air. Lightning is an especially big burst of electrons jumping from the clouds to the ground.

If the earth had hair, what would it be?

If I'm thinking about what on the earth looks like hair, then I think about trees. They're long and spindly, and they grow in big patches.

If instead I think about what the purpose of hair is, and then ask what on the earth fulfills that purpose, I get a very different answer. It seems like the main reason mammals have hair is to help control their body temperature (although I'm somewhat confused about this). The thing that controls the earth's temperature is the atmosphere, that is, the different kinds of air that make up the sky.

So I'd say that the earth's hair is either trees or the atmosphere.

As a side note, the question of why the human species lost most of their body hair is still an active area of research!

Why do birds fly, but I can't?

Mostly, because you're too heavy. Birds go well out of their way to be light enough to fly. Their bones are full of holes, they have very skinny legs that aren't good at running, and they poop as often as they can to get rid of the extra weight.

But you also don't have any way to push around enough air to fly. You could try attaching really big wings to yourself, but then your arms wouldn't be long enough to pull the wings around. Then you could build little arm extenders to pull the wings harder. But at this point, you've started building an airplane. And we already know that you can fly if you get into an airplane.

Who built Boston?

Sometimes, a whole bunch of people will move to a place in the wilderness all at once and then start building a big city. But more often, cities develop slowly and gradually, over a long period of time. It'll start out with a handful of families who begin to farm some of the land, and then other people will move in and set up shops, and then eventually they'll pave a road through, and so on. This process can be happen over years or decades or centuries.

The people who first used the name "Boston" to refer to where they lived were British people who had come to America in the 1600s. Before that, the land was inhabited by people who were indigenous to the Americas for thousands of years.

But even after a city is clearly established, it never really stops being built. Boston has been a notable city for over 200 years, but all of the skyscrapers in Boston are less than a hundred years old. The highways and bridges and tunnels are constantly being repaired and rebuilt. (Ask your parents about the "Big Dig".) Whenever you see a construction site, that's people continuing to build the city.

Do clouds eat?

No, although they do of course need some way of getting bigger. Clouds are just a bunch of water vapor. You know how it's all steamy in the bathroom after a shower, or how steam comes off of the pots when someone is making food on the stove? Clouds are just that steaminess, but a lot, and in the sky. The water comes from the oceans! It gets heated up and turned into vapor by the sunlight shining on the water's surface. The vapor goes up into the sky and collects into clouds.

So, if clouds ate, they would eat water from the ocean.

Do ants have friends?

I would say no. Ants are probably closer to tiny machines that go around doing things like looking for food, and having simple ways of interacting with the other ants for the food finding. Ant are never really just hanging out or playing with each other. That's the kind of thing that mammals tend to do, or birds, or some fish.

But, it's hard to be confident about what's going on inside other creature's minds, so I think we should stay open to the possibility that ants have friends.

What is inside the fridge that makes it cold?

This is a great question. Even though heat and cold feel like the same thing but opposite, it turns out that it's way easier to make things hotter than it is to make things colder.

People in the past spent a long time trying to figure out how to make stuff colder. The easiest way to do that is to keep it from getting hotter, like being in the shade, or by putting it next to something that's already cold, like finding some ice. But refrigerators don't do that; they straight-up force the air to be colder.

The main thing that fridges do is put a gas through a cycle that squishes and unsquishes it. Squishing is also called "compression", so the component that does this is called a compressor. When you squish something, it gets hotter. Not by very much, which is part of why it was hard to make a useful fridge. But it does get a bit hotter. Then, while keeping it squished, you can wait for it to cool down. It will do this naturally, since it's now hotter than what's around it, like the air in your kitchen. Once it has cooled down to room-temperature, you can unsquish the gas. By unsquishing, it gets colder! Again, it's not that much colder, so this isn't something you'll have noticed. But the people who figured out fridges worked out a compressor design that does this cycle over and over really quickly.

Air conditioners also use the same kind of compressor. When they turn on, you can hear them buzz, and that's the sound of the compressor squishing and unsquishing the gas really fast.

What if the floor was lava for real?

Wow, that would be a huge problem. If the floor was actual lava, then the couch, your shoes, and the whole house would catch on fire.

But let's imagine for a minute that things don't catch on fire. One thing that's really weird about lava is that it's super dense. Lava is liquid rocks, so it's basically the same weight as rocks. Most of the liquid we're used to is water-based, and things often sink in water, or bob up and down. But lava is so dense that most things would float right on top. So if you had an inflammable couch, it wouldn't sink into the lava at all.

When my sister and I were little, we pretended that the floor was water instead, which would still be a problem but would be more fun to splash around in. We would pretend that the water was coming in from a giant faucet at the top of the stairs, and we had to climb up the waterfall-stairs in order to turn off the faucet.



Discuss

AI Mood Ring: A Window Into LLM Emotions

2025-12-06 10:56:33

Published on December 6, 2025 2:56 AM GMT

Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B  representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position. 

Try it here

Code here

Extracting the Vectors

Example of Contrastive Prompts to Extract Happiness

We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion. 

Projecting the Vectors

A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector , we take all other vectors  and stack them into a matrix

We then run a reduced QR decomposition:

where  contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from .

To get the part of   that lies in that space, we project:

Then we subtract this projection to get the orthogonalized version of the vector:

This guarantees that  is orthogonal to every other emotion vector. 

After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:

And compute emotion scores   as

Where  is the hidden state at the layer with the max orthogonalized emotion vector.

Here's the implementation

Results

We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens. 

Here's a gallery of other interesting responses.

Limitations

  • We used a very small model (Qwen 3 14B)
  • The emotions we chose were arbitrary
  • The way we sample activations and do orthogonalization is not super principled
  • The cosine similarity metric is sometimes noisy
  • We didn't do many in depth and rigorous evals, this was mostly intended as a fun weekend project and interactive interpretability demonstration

Conclusion

As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel? 



Discuss

Critical Meditation Theory

2025-12-06 10:24:46

Published on December 6, 2025 2:24 AM GMT

[Terminology note: "samatha", "jhana", "insight", "stream entry", "homunculus" and "non-local time" are technical jargon defined in Cyberbuddhist Jargon 1.0]


To understand how meditation affects the brain from an outside (neuroscientific) vantage point, it is necessary to understand criticality. Criticality comes from the mathematical study of dynamical systems. Dynamical systems are systems in which a point moves through space. Dynamical systems can be described on a continuum with ordered on one end and disordered on the other end.

  • An ordered system has a small, positive number of stable attractors. Fluctuations die out quickly.
  • A disordered system has chaotic, turbulent, or equivalent behavior.

On the threshold between ordered and disordered is the critical point. Systems more disordered than the critical point can be described as supercritical. Systems less disordered than the critical point can be described as subcritical. Systems at the critical point maximize complexity, which is a measure of entropy expressed across a variety of time scales.

With that mathematical terminology out of the way, let's get into the neuroscience.

EEG scans have shown that the human brain exhibits scale-free temporal statistics and behavior, which implies it is operating near criticality. The current theory is that resting-state brain networks hover around criticality. Focused attention tasks temporarily drive the brain more subcritical. Strong emotional states, creative tasks and psychedelics temporarily drive the brain more supercritical. Quiet alertness requires near-criticality.

Why do these different tasks rely on different network dynamics? Well, if you want to pay stable attention to something then your brain's network activity needs to be stabilized, which means it should be in a relatively subcritical mode. If you want your brain to think in new ways then it should be open to many different possibilities, which means it should be in a relatively supercritical mode. And if you want to notice the finest sensory signals coming in, then your brain should be in a relatively critical mode because that's where small signals propagate best across time scales.

Note the "relatively" qualifiers. Remember when I said that the resting brain operates near criticality? To be more precise, it actually operates at subcriticality. The brain going too far in the supercritical direction can cause effects like seizures, psychosis, or psilocybin-associated behavior. These behaviors are…maladaptive (though vision quests can produce behavioral improvements as an aftereffect).

If you're a meditator, then the phrases "focused attention" and "quiet alertness" probably got your attention. That's because samatha (jhanic) meditation is all about focused attention and Zen (insight-ish) meditation is all about quiet alertness.

What happens when we look for connections between meditation and criticality-related measures? Deep jhana reduces chaoticity and moves dynamics toward criticality.

The fact that meditation reduces chaoticity should be no surprise to anyone who has calmed their mind by sitting quietly and paying attention to the breath. The fact that insight meditation nudges the dynamics toward criticality should be unsurprising to anyone who has experienced stream entry. And the fact that insight meditation moves the brain in the direction of supercriticality should be no surprise to anyone who has experienced vipassana sickness, especially if you have experienced a meditation-related psychotic break.

What's really cool about the Criticality Theory of Meditation is that it provides a mathematical foundation for understanding how things like the homunculus and non-local time get dissolved by insight practice. These are just attractors. If your network activity becomes more critical, then the attractors disappear. This is how psychedelics cause temporary ego death (by temporarily making your neural activity more chaotic) and how Zen causes permanent ego death (by permanently moving your set point in the direction of supercriticality).



Discuss