MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Eliezer's Unteachable Methods of Sanity

2025-12-07 10:46:45

Published on December 7, 2025 2:46 AM GMT

"How are you coping with the end of the world?" journalists sometimes ask me, and the true answer is something they have no hope of understanding and I have no hope of explaining in 30 seconds, so I usually answer something like, "By having a great distaste for drama, and remembering that it's not about me." The journalists don't understand that either, but at least I haven't wasted much time along the way.

Actual LessWrong readers also sometimes ask me how I deal emotionally with the end of the world.

I suspect a more precise answer may not help.  But Raymond Arnold thinks I should say it, so I will say it.

I don't actually think my answer is going to help.  Wisely did Ozy write, "Other People Might Just Not Have Your Problems."  Also I don't have a bunch of other people's problems, and other people can't make internal function calls that I've practiced to the point of hardly noticing them.  I don't expect that my methods of sanity will be reproducible by nearly anyone.  I feel pessimistic that they will help to hear about.  Raymond Arnold asked me to speak them anyways, so I will.


Stay genre-savvy / be an intelligent character.

The first and oldest reason I stay sane is that I am an author, and above tropes.  Going mad in the face of the oncoming end of the world is a trope.

I consciously see those culturally transmitted patterns that inhabit thought processes aka tropes, both in fiction, and in the narratives that people try to construct around their lives and force their lives into.

The trope of somebody going insane as the world ends, does not appeal to me as an author, including in my role as the author of my own life.  It seems obvious, cliche, predictable, and contrary to the ideals of writing intelligent characters.  Nothing about it seems fresh or interesting.  It doesn't tempt me to write, and it doesn't tempt me to be.

It would not be in the interests of an intelligent protagonist to amplify their own distress about an apocalypse into more literarily dramatic ill-chosen behavior.  It might serve the interests of a hack author but it would not help the character.  Understanding that distinction is the first step toward writing more intelligent characters in fiction.  I use a similar and older mental skill to decide which tropes to write into the character that is myself.

This sense, which I might call, genre-savviness about the genre of real life, is historically where I began; it is where I began, somewhere around age nine, to make choices about not becoming the boringly obvious dramatic version of Eliezer Yudkowsky that a cliche author would instantly pattern-complete about a literary character facing my experiences.  Specifically, though I expect this specific to mean nothing to a supermajority of you, I decided that as a relatively smart kid I would not become Raistlin Majere, nor ever exhibit a large collection of related tropes.

The same Way applies, decades later, to my not implementing the dramatic character a journalist dreams up -- a very boring and predictable pattern-completion of a character -- when they dream up a convenient easy-to-write-about Eliezer Yudkowsky who is a loudly tortured soul about his perception of the world's end approaching along its default course.

"How are you coping with the end of the world?" journalists sometimes ask me, and I reply "I have a great distaste for drama", but the actual answer is "I am a better writer than you, and I decided not to write myself as that incredibly cliche person that would be easy and convenient for you to write about."

"Going insane because the world is ending" would be a boring trope and beneath my dignity to choose as my actual self's character.


Don't make the end of the world be about you.

"How are you coping with the end of the world?" journalists sometimes ask me, and I sometimes reply, "By remembering that it's not about me."  They have no hope of understanding what I mean by this, I predict, because to them I am the subject of the story and it has not occurred to them that there's a whole planet out there too to be the story-subject.  I think there's probably a real sense in which the Earth itself is not a real thing to most modern journalists.

The journalist is imagining a story that is about me, and about whether or not I am going insane, not just because it is an easy cliche to write, but because personality is the only real thing to the journalist.

This is also a pattern that you can refuse, when you write the story that is yourself; it doesn't have to be a story that is ultimately about you.  It can be about humanity, humane preferences, and galaxies.  A sentence about snow is words, is made of words, but it is about snow.  You are made of you, but you don't need to be all about yourself.

If I were to dwell on how it impacted me emotionally that the world was ending, I would be thinking about something which genuinely doesn't matter to me very much compared to how the world is ending.  Having dramatic feelings is not mostly what I am about -- which is partly how I ended up being not much made of them, either; but either way, they're not what I'm about.

So long ago that you probably can't imagine what it was like back then, not just before ChatGPT but years before the age of deep learning at all, there was a person who thought they were like totally going to develop Artificial General Intelligence. Then they ran into me; and soon after, instead started agonizing about how they had almost destroyed the world.  Had they actually been that close to success?  Of course not.  But I don't relate to status as most people do, so that part, the status-overreach, wasn't the part I was rolling my eyes about.  It is not the sort of epistemic prediction error that I see as damnable in the way that a status-regulator sees it as the worst thing in the world; to underestimate oneself is no more virtuous than to overestimate oneself.  Rather, I was rolling my eyes about the part that was a more blatant mistake, completely apart from the epistemic prediction error they probably couldn't help; the part that would have been a mistake even if they had almost destroyed the world.  I was rolling my eyes about how they'd now found a new way of being the story's subject.

Even if they had almost destroyed the world, the story would still not properly be about their guilt or their regret, it would be about almost destroying the world.  This is why, in a much more real and also famous case, President Truman was validly angered and told "that son of a bitch", Oppenheimer, to fuck off, after Oppenheimer decided to be a drama queen at Truman.  Oppenheimer was trying to have nuclear weapons be about Oppenheimer's remorse at having helped create nuclear weapons.  This feels obviously icky to me; I would not be surprised if Truman felt very nearly the same.

And so similarly I did not make a great show of regret about having spent my teenage years trying to accelerate the development of self-improving AI.  Was it a mistake?  Sure.  Should I promote it to the center of my narrative in order to make the whole thing be about my dramatic regretful feelings?  Nah.  I had AGI concerns to work on instead.  I did not neglect to conduct a review of what I did wrong and update my policies; you know some of those updates as the Sequences.  But that is different from re-identifying myself as a dramatic repentent sinner who had thereby been the story's subject matter.

In a broadly similar way:  If at some point you decide that the narrative governing your ongoing experience will be about you going insane because the world is ending:  Wow, congratulations at making the end of the world still be about you somehow.


Just decide to be sane, and write your internal scripts that way.

The third way I stay sane is a fiat decision to stay sane.

My mental landscape contains that option; I take it.

This is the point I am even less expecting to be helpful, or to correspond to any actionable sort of plan for most readers.

I will nonetheless go into more detail that will probably not make any sense.

Besides being a thing I can just decide, my decision to stay sane is also something that I implement by not writing an expectation of future insanity into my internal script / pseudo-predictive sort-of-world-model that instead connects to motor output.

(Frankly I expect almost nobody to correctly identify those words of mine as internally visible mental phenomena after reading them; and I'm worried about what happens if somebody insists on interpreting it anyway.  Seriously, if you don't see phenomena inside you that obviously looks like what I'm describing, it means, you aren't looking at the stuff I'm talking about.  Do not insist on interpreting the words anyway.  If you don't see an elephant, don't look under every corner of the room until you find something that could maybe be an elephant.)

One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen.  (To be clear, I actually do this very rarely; it is just a fun fact that this is a way I can defeat bed inertia.)

There are a lot of neighboring bad ideas to confuse this with.  The trick I'm describing above does not feel like desperately hyping myself up and trying to believe I will get out of bed immediately, with a probability higher than past experience would suggest.  It doesn't involve lying to myself about whether I'm likely to get up.  It doesn't involve violating the epistemic-instrumental firewall (factual questions absolutely separated from the consequences of believing things), to give myself a useful self-fulfilling prophecy.  It is not any of the absurd epistemic-self-harming bullshit that people are now flogging under the brand name "hyperstition", since older names like "chaos magick" or "lying to yourself" became less saleable.  I still expect to them to point to this and say, "Why, of course that is the same thing I am selling to you as 'hyperstition'!" because they would prefer not to look at my finger, never mind being able to see where I'm pointing.

With that said:  The getting-out-of-bed trick involves looking into the part of my cognition where my action plan is stored, and loading an image into it; and because the human brain's type system is a mess, this has the native type-feeling of an expectation or prediction that in a few seconds I will execute the motor-plan and get out of bed.

That I am working with cognitive stuff with that type-feel, is not the same thing as lying to myself about what's likely to happen; no, not even as a self-fulfilling prophecy.  I choose to regard the piece of myself whose things-that-feel-like-predictions get sent as default motor output, as having the character within my Way of a plan I am altering; rather than, you know, an actual mistaken prediction that I am believing.  If that piece of myself gets to have me roll out of bed, I get to treat it as a plan rather than as a prediction.  It feels internally like a prediction?  Don't believe everything you feel.  It's a pseudo-model that outputs a pseudo-prediction that does update in part from past experience, but its actual cognitive role is as a controller.

The key step is not meditating on some galaxy-brained bullshit about Lob's Theorem, until you've convinced yourself that things you believe become true.  It's about being able to look at the internal place where your mind stores a pseudo-predictive image of staying in bed, and writing instead a pseudo-prediction about getting out of bed, and then letting that flow to motor output three seconds later.

It is perhaps an unfortunate or misleading fact about the world (but a fact, so I deal with it), that people telling themselves galaxy-brained bullshit about Lob's Theorem or "hyperstition" may end up expecting that to work for them; which overwrites the pseudo-predictive controlling output, and so it actually does work for them.  That is allowed to be a thing that is true, for reality is reality.  But you don't have to do it the scrub's way.

Perceiving my internal processes on that level, I choose:

I will not write internal scripts which say that I am supposed to / pseudo-predict that I will, do any particular stupid or dramatic thing in response to the end of the world approaching visibly nearer in any particular way.

I don't permit it as a narrative, I don't permit it as a self-indulgence, and I don't load it into my pseudo-predictive self-model as a pending image that gets sent by default to internal cognitive motor outputs.

If you go around repeating to yourself that it would be only natural to respond to some stressful situation by going insane -- if you think that some unhelpful internal response is the normal, the default, the supposed-to reaction to some unhelpful external stimulus -- that belief is liable to wire itself in as being also the pseudo-prediction of the pseudo-model that loads your default thoughts.

All of this is not to be confused with the Buddhist doctrine that every form of negative internal experience is your own fault for not being Buddhist enough.  If you rest your hand on a hot stove, you will feel pain not because your self-pseudo-model pseudo-predicts this to be painful, but because there's direct nerves that go straight to brain areas and trigger pain.  The internal mechanism for this does not depend on a controlling pseudo-prediction, it just falls downward like a stone under gravity.  The same directness is allowed to be true about suffering and not just pain; if there's a clever way to overwrite pseudo-predictions of suffering and thereby achieve Buddhist indifference to bad things, I don't have it as a simple obvious surface lever to pull.  I also haven't chosen to go looking for a more complicated or indirect version of it.  I do not particularly trust that to end well.

But I do think there are various forms of drama, error, and insanity which are much more like "things people do because they expected themselves to do it"; and much less like the pain, or suffering, from burning your hand.

One could incorrectly summarize all this as "I have decided not to expect to go insane," but that would violate the epistemic-instrumental firewall and therefore be insane.


There's an edition of Dungeons and Dragons that has a god of self-improvement, called Irori.  My fanfictions sometimes include characters that worship Him (heresy), or seek what He sought (approved).

Irori's religion -- on my version, that is -- has mottos like, "You don't have problems, you have skill issues."  Irorians can be a bit harsh.

But even if something is a skill issue, that doesn't mean you have the skill, nor know how to solve it.

When an Irorian calls something a skill issue, they're not instructing you to feel bad about having not solved it already.

They are trying to convey the hope that it is solvable.

Doing crazy things because your brain started underproducing a neurotransmitter is a problem.  It wouldn't be very Irorian to tell you that you can't solve it just through even clearer thinking; but if there's a medication that directly fixes the problem, that is probably easier and faster and more effective.  Also, this isn't Dungeons and Dragons, Irori isn't real, and possibly you genuinely can't solve a neurotransmitter problem by thinking at it.

Doing crazy things because the world is ending is a skill issue.


These then are Eliezer Yudkowsky's probably-irreproducible ways of staying sane as the world seems more visibly close to ending:

A distaste for the boringly obvious trope of a character being driven mad by impending doom;

Not making the story be all about me, including my dramatically struggling to retain my sanity;

And a fiat decision to stay sane, implemented by not instructing myself that any particular stupidity or failure will be my reaction to future stress.


Probably you cannot just go do those three things.

Then figure out your own ways of staying sane, whether they be reproducible or irreproducible; and follow those ways instead.

The reason that I tell you of my own three methods, is not to provide an actionable recipe for staying sane as the world begins to seem visibly closer to ending.

It is an example, a reminder, and maybe even an instruction to the part of yourself that produces self-pseudo-predictions that get loaded as your internal mental behavior:

Sanity is a skill issue.



Discuss

Ordering Pizza Ahead While Driving

2025-12-07 10:01:09

Published on December 7, 2025 2:01 AM GMT

On a road trip there are a few common options for food:

  • Bring food
  • Grocery stores
  • Drive throughs
  • Places that take significant time to prepare food

Bringing food or going to a grocery store are the cheapest (my preference!) but the kids are hard enough to feed that we often buy prepared food when we're traveling. [1] And they often prefer food that takes a while to make (usually pizza) over what you can get in a drive through. A couple years ago I realized there's another option: calling in an order for pickup to where you'll be soon.

We'll use Google Maps "search along route" to identify a place ~30min out, and phone in an order. [2] By the time we arrive, the food is ready. We can combine the speed (and buffer maximization) benefits of drive throughs, with the variety of options from the wide range of restaurants that offer pickup.


[1] I'm also working on getting them to do better with brought food, but I'm focusing on lunch at school here because that's a much larger portion of their food away from home.

[2] It kind of amazes me that pizza places will take the costly action of preparing a pizza to my specifications based on a simple phone call, with no contact information beyond me giving a first name. I mean, it's great, but like so many things in our society it only works because there are extremely few people want to cause havoc and are willing to put any effort into doing so.

Comment via: facebook, mastodon, bluesky



Discuss

Existential despair, with hope

2025-12-07 04:48:33

Published on December 6, 2025 8:48 PM GMT

I have drafted thousands of words of essays on the topic of art and why it sustains my soul in times of despair, but none of it comes close to saying what I want to say, yet.

Therefore, without explanation, on the occasion of Berkeley’s Winter Solstice gathering, I will simply offer a link to a little-known work of art which has been for me a touchstone in my times of greatest grief and existential despair:

https://www.viselaya.org/sculptures/chandala/chandala.html

May it bring some of you something you need today.



Discuss

I Need Your Help

2025-12-07 02:48:06

Published on December 6, 2025 6:48 PM GMT

Hello LessWrong!

I am Jaivardhan Nawani, a 14 year old enthusiast of the ideas and mental models that stem from rationality. I want to spread this means of thinking amongst people my age.

I am by no means an expert, and I need your help and experience. I am working on setting up a course with simple teaching aids and handouts during lessons, with multiple batches. While sourcing students on its own is not a problem, it helps to have those that work or research this field actively on my side to teach. It would be amazing if anyone was willing to connect me to a university or a professor to whom I could pitch this to. Even students actively studying this subject are ideal.

Any more general advice from those that have done this already is always welcome.

Thanks in advance for all of your help!



Discuss

Crazy ideas in AI Safety part 1: Easy Measurable Communication

2025-12-07 01:59:28

Published on December 6, 2025 5:59 PM GMT

This post is part of a series on my “crazy” ideas in AI Safety. The probability that any individual idea is useful is low; however, if even one of them turns out to matter, the impact could be large. Or maybe it will inspire someone to do something useful, remotely related to the initial idea. That makes the expected value high enough for me to write them down.

 

A sole-purpose communication organization.

I think it could be valuable to have an organization — or at least a dedicated project within an existing organization—whose only purpose is to explain the risks of superintelligence to as many people as possible. The way to measure I’d propose is the number of people who to sign the Statement on Superintelligence using the link provided by the organization.  I like this metric because the statement already has many respected signatories; signing it is a relatively “normal” action and not obviously a fringe activity. The organization would commit to maximal transparency about how it spends donations and about the cost-effectiveness (e.g., dollars per additional signature) of each outreach method it tries.

Of course, communication is not a new idea in this space. Rob Miles, Doom’s Debate, MIRI’s recent attempts, the authors of the Statement, and many others already focus heavily on public communication. So why add yet another group?

My answer is: for the sake of clear, measurable impact.
If I donate to Rob Miles or Liron Shapira, what happens? Probably better video quality, more views, and maybe more people becoming concerned about AI Safety. But measuring that impact is very hard.

The Statement on Superintelligence, on the other hand, provides a concrete, interpretable metric. Its growth has slowed; most people who were already inclined to sign have done so. A specialized organization would need to actually convert new people, and it would be legible whether they succeeded.

Could the authors of the Statement, or another existing AI safety organization, take this on? Possibly. But then it should be run as a clearly delineated project with its own transparent budget, isolated from unrelated work. And it might still be less confusing for donors if the project lived inside its own separate organization.

A natural worry is that once a metric is defined, the organization may Goodhart it—for example, by incentivizing people to sign the Statement without truly changing their views. But this particular failure mode should be easy for donors to detect, and would likely result in immediate drops in credibility and funding. That doesn’t eliminate all risks of Goodharting, but it makes the most obvious failure mode less plausible.

Conclusion 

I also think such an organization could reach segments of the general public that current communicators rarely reach. By expanding the audience—beyond people who already watch AI-safety YouTube channels or read rationalist forums—we could meaningfully raise broader awareness of superintelligence risks. Increasing that awareness may in turn improve the odds that society chooses not to build a superintelligence before we understand how to make it safe.



Discuss

The corrigibility basin of attraction is a misleading gloss

2025-12-06 23:38:45

Published on December 6, 2025 3:38 PM GMT

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

... a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.

Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.

In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

Max Harms wrote about CAST, a similar strategy that relied on the same idea:

This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.

This post is about the reasons I expect this kind of iteration to miss the important problems. I expect empirical iteration on corrigibility to quickly resolve all the detectable problems and then run out of fuel before resolving the main problems. I’ve been occasionally trying to convince people of this for the past two years or so, but recently I’ve had some success talking about this in conversations (specifically with Max Harms and Seth Herd), so I’m hoping my explanation has become clear enough to be readable as a post.

Corrigibility

The basin of attraction around corrigibility is a very intuitive idea. It’s only a small extension to the general process of engineering by iterative improvement. This is the way we get almost all technological progress: by building terrible versions of things, watching them break, understanding what went wrong and then trying again. When this is too dangerous or expensive, the first solution is almost always to find workarounds that reduce the danger or expense. The notion of corrigibility is motivated by this; it’s a workaround to reduce the danger of errors in the goals, reasoning or beliefs of an AGI.

Eliezer’s original idea of corrigibility, at a high level, is to take an AGI design and modify it such that it “somehow understands that it may be flawed”. This kind of AGI will not resist being fixed, and will avoid extreme and unexpected actions that may be downstream of its flaws. The agent will want lots of failsafes built into its own mind such that it flags anything weird, and avoids high-impact actions in general. Deference to the designers is a natural failsafe. The designers only need to one-shot engineer this one property of their AGI design such that they know it is robust, and then the other properties can be iterated on. The most important thing they need to guarantee in advance is that the “understanding that it may be flawed” doesn’t go away as the agent learns more about the world and tries to fix flaws in its thinking.

I think of this original idea of corrigibility as being kinda similar to rule utilitarianism.[1] The difficulty of stable rule utilitarianism is that act utilitarianism is strictly better, if you fully trust your own beliefs and decision making algorithm. So to make a stable rule utilitarian, you need it to never become confident in some parts of its own reasoning, in spite of routinely needing to become confident about other beliefs. This isn’t impossible in principle (it’s easy to construct a toy prior that will never update on certain abstract beliefs), but in practice it’d be an impressive achievement to put this into a realistic general purpose reasoner. In this original version there is no “attractor basin” around corrigibility itself. In some sense there is an attractor basin around improving the quality of all the non-corrigibility properties, in that the engineers have the chance to iterate on these other properties.

Paul Christiano and Max Harms are motivated by the exact same desire to be able to iterate, but have a somewhat different notion of how corrigibility should be implemented inside an AGI. In Paul’s version, you get a kind of corrigibility as a consequence of building act-based agents. One version of this is an agent whose central motivation is based around getting local approval of the principal[2] (or a hypothetical version of the principal).

Max’s version works by making the terminal goal of the AGI be empowering the principal and also not manipulating the principal. This loses the central property from the original MIRI paper, the “understanding that it may be flawed”,[3] but Max thinks this is fine because the desire for reflective stability remains in the principal, so the AGI will respect it as a consequence of empowering the principal. There’s some tension here, in that the AI and the human are working together to create a new iteration of the AI, and the only thing holding the AI back from “fixing” the next iteration is that the human doesn’t want that. There’s a strong incentive to allow or encourage the human to make certain mistakes. CAST hopes to avoid this with a strict local preference against manipulation of the principal.

There are several other ideas that have a similar motivation of wanting to make iteration possible and safe. Honesty or obedience are often brought up as fulfilling the same purpose. For example, Paul says:

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

I want to distinguish between two clusters in the above ideas: One cluster (MIRI corrigibility, ELK) involves theoretical understanding of the corrigibility property and why it holds. The other cluster (Paul’s act-based corrigibility, the main story given in CAST, and RL-reinforced human-recognizable honesty or obedience) centrally involve a process of iteration as the human designers (with the help of the AGI, and other tools) fix flaws that appear along the way. This second cluster relies heavily on the notion that there is a “basin of attraction” around corrigibility. The second cluster is the one I have a problem with.

My argument: The engineering feedback loop will use up all its fuel

The empirical feedback loop depends on being able to understand problems and design fixes. In a situation with a big foreseeable distribution shift, the feedback loop goes away after you patch visible issues and are only left with generalisation issues. This would be fine if we were in an engineering domain where we can clearly see flaws, understand the causes, and predict the effect of our patches on particular generalisation leaps. We are currently not, and it’s not looking like this will change.

Before I go into detail on this I want to set up how I’m thinking about the task of building AGI and the specific distribution shifts that are a barrier to iteration.

There are many fields where test-driven engineering is far from sufficient

Your task is to build a cargo ship. You can test it now, in a lake, and iterate on your design. Your task is to use this iteration process to build a ship, then tell your customer that the ship will survive a storm, after 10 years of wear and tear, in the middle of the ocean. Let’s pretend you have relatively unlimited money and labour so each lake test is cheap, and no one has built a ship like yours before.

Think through how this is going to go. In the early stages, you’ll build ships that leak, or have high drag, or react poorly when loaded in certain ways. All of these are easy to measure in your testing environment. You fix these problems until it’s working perfectly on all the tests you are able to perform. This part was easy.

Most of the difficulty of your task is in modelling the differences between your tests and reality, and compensating for those differences. For example, a storm puts more stress on the structure as a whole than your tests are capable of creating. To compensate for this, you need to make conservative guesses about the maximum stresses that waves are capable of creating, and use this to infer the necessary strength for each component. You’re able to empirically test the strength of each component individually.

Then you need to make guesses about how much components will wear down over time, how salt and sea life will affect everything, plausible human errors, sand, minor collisions etc.[4] All of these can be corrected for, and procedures can be designed to check for early signs of failure along the way. If done carefully you’ll succeed at this task.

The general pattern here is that you need some theoretical modelling of the differences between your tests and reality, across the distribution shift, and you need to know how to adjust your design to account for these differences*.* If you were to only fix the problems that became visible during lake testing, you won’t end up with a robust enough ship. If you don’t really understand each visible problem before fixing it, and instead apply blind design changes until the problem goes away, then you haven’t the faintest hope of succeeding.

The corresponding story for iteratively building corrigible AGI

We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.

Like with the ship, each time we see behavior that seems bad we can try to work out the cause of the problem and design a fix. One difference is that working out the cause of a problem can be more difficult in ML, because we lack much more than guesswork about the relationship between training and distant generalisation. Working out how to fix the issue robustly is difficult for the same reason.

Empirical iteration is particularly bad in the current field of ML, because of how training works. The easiest way to make a problem go away is by adding new training examples of bad behaviour that you’ve noticed and training against it. ML training will make that behaviour go away, but you don’t know whether it fixed the underlying generator or just taught it to pass the tests.

How do we know the distribution shift will “stress” parts of the AGI?

Why did I focus on the particular AI distribution shifts listed in the previous section? Essentially because I can think of lots of ways for these to “reveal” unintended goals that were previously not obvious.[5] If we want to work out what flaws might remain after an empirical iteration process, we need to think through changes to the AGI design that could pass that filter and become a problem later on. So we need to think through in mechanistic detail all the design changes that can cause different goals[6] under shifted conditions but be invisible under development conditions.

This is easier to work through when you’ve thought a lot about the internal mechanisms of intelligence. If you have detailed theories about the internal mechanisms that make an intelligence work, then it’s not that hard to come up with dozens of these. Even though we (as a field) don’t confidently know any details about the internals of future AGIs, visualizing it as a large complicated machine is more accurate than visualizing it as a black box that just works.

So communicating this argument clearly involves speculating about the internal components of an intelligence. This makes it difficult to communicate, because everyone[7] gets hung up on arguments about how intelligence works. But it’s easy to miss that the argument doesn’t depend very much on the specific internal details. So, as an attempt to avoid the standard pitfalls I’ll use some quick-to-explain examples anchored on human psychology. Each of these is an “error” in a person that will make them pursue unexpected goals under more extreme conditions, especially those extreme conditions related to trying to be better.

  • You can have a desire that isn’t realistically achievable from your current position, but that you would pursue under ideal conditions. You don’t really know about this desire until you think about it and explore it.
  • Habit-like heuristics can keep your behaviour looking good to your parents/teachers, but stop working once you really carefully examine them and work out when you do and don’t endorse following them.
  • You might internally seek approval from an imagined hypothetical overseer
    • If the hypothetical doesn’t become more detailed as your intelligence increases: It won’t handle complicated situations, and can be easily tricked. It can be easy to make excuses and win arguments against a hypothetical overseer.[8]
    • If the overseer is only invoked when you think the overseer knows more than you: Then at some point it becomes irrelevant.[9]
    • If you imagine a mentor at a high enough resolution, such that you know their biases and blindspots, then you can avoid actions they'd notice as problematic while still pursuing things they wouldn't approve of on reflection.
  • You may be vulnerable to ontology shifts after thinking about infinities and weird hypothetical situations.
  • Not to mention all the standard mental illnesses that humans fall into (although some of these aren’t quite what we want in an example, insofar as they damage overall competence at pursuing goals).

Similar lists can be found here or here, the first with more focus on how I expect AGI to work, and the second with focus on how Seth Herd expects it to work.

Obviously my examples here are anthropomorphising, but we can do the same exercise for other ways of thinking about internal mechanisms of intelligence. For example AIXI, or hierarchies of kinda-agents, or huge bundles of heuristics and meta-heuristics, or Bayesian utility maximisation, etc.[10] I encourage you to do this exercise for whichever is your favourite, making full use of your most detailed hypotheses about intelligence. My examples are explicitly avoiding technical detail, because I don’t want to get into guesses about the detailed mechanisms of AGI. A real version of this exercise takes full advantage of those guesses and gets into the mathematical weeds.

Tying this back to the basin of attraction

Here’s one concrete example of the basin of attraction argument: If the AI locally wants to satisfy developer preferences (but in a way that isn’t robust to more extreme circumstances, i.e. it would stop endorsing that if it spent enough time thinking about its desires), then it should alert the developer to this problem and suggest solutions. This gives us the ability to iterate that we want.

The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.

There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.

The main place where I think this story fails is that it doesn’t help much with the iteration loop running out of fuel. Even with the help of the AI, the humans aren’t that good at noticing failure modes on the hard distribution shifts, and aren’t very good at redesigning the training process to robustly patch those failure modes (without also hiding evidence of them if the patch happened to fail). We still lack the theoretical modelling of the distribution shifts, even with an AI helping us. If the AI is to help fix problems before they come up, it would have to do our engineering job from scratch by inventing a more engineerable paradigm,[11] rather than working by small and easily understandable adjustments to the methods used to create it.

A counterargument: The prosaic case

If I steelman a case for prosaic alignment research that I’ve heard a few times, it’d go something like this:

We all agree that after iterating for a while we won’t be sure that there are no further errors that are beyond our ability to test for, but still the situation can be made better or worse. Let’s put lots of effort into improving every part of the iteration loop: We’ll improve interpretability so we can sometimes catch non-behavioural problems. We’ll improve our models of how training affects generalized behavior so that we can better guesstimate the effect of changing the training data. These won’t solve the problem in the limit of intelligence, or give us great confidence, but every problem caught and patched surely increases our chances on the margin?

I agree with this, it does increase our chances on the margin, but it misses something important: As we run out of obvious, visible problems, the impact saturates very quickly. We need to decide whether to go down the basin of corrigibility pathway, or stop until we are capable of engineering corrigibility in a way that stands up to the distribution shifts.[12] To make this decision we need to estimate where the risk saturates if we follow the basin of corrigibility approach.

My best approach to estimating the potential for generalization failures[13] is by working through, in detail, all the changes to a hypothetical design of an intelligence that would be undetectable in testing but lead to undesired behaviour after a known distribution shift. Extremely roughly, we can estimate how crazy it would be for each one to go wrong. After thinking about each one, we can get a gut-level estimate based on how many plausible difficult-to-notice-at-development-time errors are plausible. This is the approach I would use to estimate risk if I were building a bridge, skyscraper, moon-lander or cargo ship (if there weren’t better estimates available from past experience).

It’s epistemically tricky to estimate this without bias, and this trickiness is amplified by the holes in my mechanistic understanding of future AGI internals. The task of finding lots of problems like this could lead to overestimating the risk, by directing creativity at finding lots of extremely implausible flaws. This is a way that this kind of reasoning could end up being wrong. On the other hand, there’s a bias in this approach caused by anchoring on the intended mechanisms and relying on creativity to come up with unintended mechanisms that none-the-less pass all training and tests. However, I think it’s possible to calibrate your reasoning here by comparison to the ship analogy (and similar engineering tasks with big distribution shifts) and by calibrating against experience with the ML generalisation.

I’ve worked through this exercise a few times, for different ways of thinking about AGI internal mechanisms, and this provides very rough estimates of the number of “degrees of freedom” in the design of an AGI conditioned on it passing all tests we are capable of performing. The conclusion I’ve drawn from doing this is that this iterative path to AGI development is near guaranteed to fail, in the sense that the intended corrigibility property won’t survive the distribution shifts.

More specific counterarguments

Slow, online adversarial training

One counterargument to the story I’m telling here is to propose slow, online adversarial[14] training. In this approach we try to go through every distribution shift in slow motion, in the hope that we can detect and patch problems as we go. We keep doing this all the way through the feared distribution shifts, so the AGI is actively helping with research and doing its own research. The hope here is that it helps with the “observe problems” side of the empirical feedback loop.

Here’s why I don’t expect it to help very much:

  1. The fuel still runs out on the “observe problems” part of the loop. You’re dealing with an above-human-level researcher AI. If at some point during that process the agent discovers it would prefer not to be perfectly corrigible, it’s at the level of intelligence and situational awareness where we plausibly don’t get any evidence of that. It’s not hard to pass corrigibility tests if you know you’re being tested.
  2. The “fix problems” part of the loop isn’t improved either, conditional on sticking to roughly the current paradigm. Adding examples to the training data needn’t fix the problems in a deep way, it can just paper over them. Then you’ve lost evidence of the problem.

Chain of thought gives us the transparency we need to iterate

Chain of thought ideally allows us to analyze the process that caused unexpected behavior, approximately diagnose what’s going on, and theorise about what fixes might solve the underlying problem. It’s hoped that this helps us with point one in the previous section.
I think this isn’t as good as it looks, and in particular will be worse in the future,[15] for two reasons:

  1. Most of the failures caused by the relevant distribution shifts don’t necessarily or clearly show up in the thinking of an agent, in advance of the distribution shifts. Two examples, taken from the first two things on the list above:
    1. If there are technologies that you could build that make the world better, you may not think about that until you know that it’s plausibly reachable from your current situation.
    2. If you have habit-like heuristics or biases that are holding you back from best reaching your goals, you don’t know about them until you’ve identified them as the first step of removing them.
  2. Current large transformers are capable of doing a lot of “thinking ahead” inside of a single token-generating forward pass. In order to start writing a paragraph in a way that flows, you need to roughly have in mind the content of the rest of the paragraph. This indicates that they have the capability necessary to (at least very roughly) consider the consequences of writing a thought in a chain-of-thought before writing it. This same capability, extrapolated to a future AGI, would make it very difficult to tell whether a problem is fixed or merely papered over.

Putting these together

If we’re carefully monitoring the CoT of an AGI as it passes through the distribution shifts[16] then we might see some thoughts that indicate the agent isn’t corrigible or aligned. For example, it might notice habits that it deemed worth removing, then reconsider (in CoT) the priorities of different tasks. This kind of evidence of incorrigibility will usually be somewhat ambiguous, but worrying to some people.

My mainline scenario is that the response to this will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying. Perhaps this repeats a couple of times, but soon enough the ambiguous evidence is gone and the AGI somewhat more trusted. It goes on to be used for more research, and for this purpose is given more freedom to experiment, explore, learn and self-improve. The exact point at which it stops playing along is hard to predict, so my distribution is very spread out.

One way things could go well

There’s one good future that I see coming from the iterative “work with AIs to improve AIs” approach. At some point in the iteration process, the engineers (or AI agents) will realise their ability to spot and fix problems isn’t nearly good enough, and they’ll push for

  1. a pause of AGI development and
  2. research on more fundamentally legible architectures and the theoretical understanding necessary to iterate on them.

What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI. It’d be good to be able to distinguish behavioral and strategic heuristics from beliefs about the world, and understand when and why heuristics will be replaced as an agent learns. If the agent is best understood as Bayesian, I want to be able to tell what prior it’s using.

From this starting point, alignment and corrigibility work would be tractable but hard. We’d need to work out what evidential threshold the AI would use before replacing a part of its own algorithm with an “improvement”. We’d need to work out how beliefs and values drift as online learning updates are made. We’d need to work out whether there are adversarial examples that exploit the desires, or that exploit the belief updating procedure. We’d need to become reasonably confident that the prior is “reasonable” and doesn’t lead to weird beliefs, or that there are failsafe mechanisms if it is not. We’d want to somehow confirm that lots of thinking won’t lead to our failsafes breaking or being removed. This work would be tractable because we would have a far greater ability to draw evidence from small experiments (with components of an agent) to the implications about a full general intelligence.

Conclusion and implications

I hope this post has conveyed the main problems I see with iterative approaches to corrigible AGI development, and why the basin of attraction analogy is a misleading way to think about this process. Empirical iteration quickly runs out of steam on many kinds of problem, and corrigible AGI is one of these problems.

I want to stress that reading arguments like the ones in this post isn’t sufficient to understand which perspective on corrigibility is correct. You have to work through the reasoning using your own examples, and do the exercises using your own most mechanistically detailed models.

There are some controversial beliefs I have that are mainly downstream of the arguments in this post, but also somewhat downstream of other beliefs and arguments that aren’t explained in this post. I’ve briefly stated them in the following dropdown:

Things I believe, mostly as a result of the arguments in this post

  • LLMs should be ruled out as a plausible approach to safe AGI. They are a mixed up jumble of beliefs, heuristics, goals and algorithms. If we’re to have a chance at doing engineering properly on AGI, these things need to be separate and visible from the perspective of the developer.
  • Some “alignment” research is about solving problems that are currently visible problems in LLMs. I consider this a waste of time and a misunderstanding of what problems are important.
    • I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.
  • Control extends the feedback loop a little, but doesn’t improve it. If we’re bad at seeing generalisation problems and bad at fixing them, control-based strategies may delay takeover by a little, but probably not at all.
  • Most of the safety research done in MATS, the AGI labs, or funded by OpenPhil isn’t the sort that might help with the generalisation problems, and is therefore approximately useless.
  • The way people in the alignment community rank and compare the AGI labs is misguided. All the AGI labs are so far from being on the right track that it isn’t worth comparing them.
  • Jailbreaks are a good analogy for alignment in some ways: It’s difficult to pull jailbreakers into the training distribution, so new jailbreaks stand as an example of a distribution shift that an LLM is intended to be robust to. But it’s a weaker analogy in other ways, since there’s an active adversary, and the iteration loop still exists as new jailbreaks are found and patched, just more slowly than other problems.
  • Talking about weird unintended LLM behaviours is weakly relevant to alignment, in the sense it’s evidence about how bad our current engineering feedback loops are. But it’s also a distraction from understanding the real problem, because every weirdness that you can point to will probably soon be papered over.
  • Fast vs slow and smooth vs discontinuous takeoff isn’t a very important consideration. Slow takeoff with bad feedback loops is just as bad as fast takeoff. It could have been important, if the AGI paradigm and theoretical understanding put us in a better position to do an engineering feedback loop. It could start to matter again if the paradigm shifts. As we stand, I don’t see it making much difference.

Many thanks to Steve Byrnes, Max Harms and Seth Herd for extremely helpful feedback.

  1. I’m probably abusing these definitions a bit, apologies to philosophers. ↩︎

  2. The overseer, or developer. I’m following Max’s terminology. ↩︎

  3. “While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.” Source. ↩︎

  4. Disclaimer: I don’t know anything about shipbuilding, although I once technically did win an award from the Royal Institute of Naval Architects for my part in building a rowboat. ↩︎

  5. In the shipbuilding analogy, I would come up with things like storms causing unusually high stress on rusted bolts, because it’s the sort of thing that’s difficult to notice in development tests. ↩︎

  6. Or the behavioral appearance of different goals ↩︎

  7. Most people, in my experience ↩︎

  8. I hope it’s not just me that does this. ↩︎

  9. Something like this seems to be true of me. ↩︎

  10. Although beware of shell games. It can be easy with some of these models of intelligence to accidently hide the important capability generators in a black box and then it becomes difficult to imagine ways that the black box might contain poorly designed mechanisms. ↩︎

  11. I’ll discuss this possibility in a later section. ↩︎

  12. Using an approach more like MIRI corrigibility or ELK. ↩︎

  13. The failures that aren’t likely to be caught by patching all visible problems that are detectable during development. ↩︎

  14. In the sense that researchers are actively trying to put the AI in test situations that elicit unintended behavior and train it not to generate that behavior, in parallel to using it to do research and help redesign itself. ↩︎

  15. We’ll probably lose legible chain of thought for various capability-related reasons, but I’ll set that aside. ↩︎

  16. i.e. does self-improvement research and successfully uses the results of this research. ↩︎



Discuss