MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Dancing in a World of Horseradish

2025-12-17 13:50:59

Published on December 17, 2025 5:50 AM GMT

Commercial airplane tickets are divided up into coach, business class, and first class. In 2014, Etihad introduced The Residence, a premium experience above first class. The Residence isn't very popular.

The reason The Residence isn't very popular is because of economics. A Residence flight is almost as expensive as a private charter jet. Private jets aren't just a little bit better than commercial flights. They're a totally different product. The airplane waits for you, and you don't have to go through TSA (or Un-American equivalent). The differences between flying coach and flying on The Residence are small compared to the difference between flying on The Residence and flying a low-end charter jet.

It's difficult to compare costs of big airlines vs private jets for a myriad of reasons. The exact details of this graph should not be taken seriously. I'm just trying to give a visual representation of how a price bifurcation works.

Even in the rare situations where it's slightly cheaper than a private jet, nobody should buy them. Rich people should just rent low-end private jets, and poor people shouldn't buy anything more expensive than first class tickets. Why was Etihad's silly product created? Mostly for the halo effect. The existence of The Residence boosts Etihad's prestige which, in turn, boosts the soft power of Abu Dhabi.

The Residence shouldn't exist. If Etihad wasn't a state enterprise, then The Residence probably wouldn't exist. That's because there is a price breakpoint in the airlines' industry. Below the breakpoint, everyone flies commercial. Above the breakpoint, everyone flies private.

Luxury Products

The word "luxury", like the word "art", has been profaned by marketing. Personally, I define a "luxury" product to meet the following criteria.

  • There is a price breakpoint above which the product is fundamentally different.
  • A small minority of consumers (for a reference class) use the product above the breakpoint.

"Rich" is relative. In the case of flying, the breakpoint is "private jet", which costs tens of thousands of dollars. In the case of wasabi, the breakpoint happens much cheaper.

Wasabi is a popular condiment to eat with sushi. Most of the products marketed as "wasabi" are made out of horseradish.

This product calls itself "Wasabi".
The ingredients contain horseradish. The iggredients do not contain real wasabi root.

Real wasabi comes from the grated roots of the wasabi plant.

My local Japanese supermarket stopped selling real wasabi roots so here is a picture of horseradish, instead.

The green paste you squeeze out of a tube markets itself as a luxury product, when it really belongs to a different class of product on the other side of a price breakpoint. I call this faux luxury.

Product Bifurcation

Many production factors can cause a product category to bifurcate into luxury vs mass-market. In the case of airlines, this happens because "private airline" is all-of-nothing. Either an airplane flight is private or it is not. In the case of wasabi, the bifurcation happened because real wasabi plants are expensive to grow, so manufacturers created fake wasabi to fill the demand.

Generally speaking, civilization gets wealthier as technology advances. Private jet flights are at a record high. Real wasabi production is higher today than at any point pre-Industrial Revolution. In absolute and relative terms, these luxury products are more available than ever.

However, other luxury products decrease in use as civilization gets wealthier. "Maid" used to be a major industry. Then washing machines and vacuum cleaners were invented. Labor-saving devices are the horseradish of personal servants. Having a personal maid is better than having a washing machine.

This image is 100% plot-necessary and not just an excuse by the author to show off Nazuna in a maid costume.

But having a washing machine is much cheaper than having a maid. Super-elites kept their servants, but most people switched to horseradish. For the former-maids now working in the modern service sector, this was a major step up.

In the case of servants, eliminating women's most common urban job was a net positive. Not all of these transitions were so benign.

The Death of Live Music

Are you single in your mid-20s and thinking to yourself "I'm a decently-attractive man/woman. I have my life put together. I've got a basic social competence. So why is it so hard to find a mate? Something is wrong with civilization. I feel like it shouldn't be this hard."

You're not crazy. This is a situation where "the good old days" really were better. It used to be easier too find a mate and technology destroyed the social institution.

Here are the primary current systems we're left with.

  • Church. This works great except for the problem that God isn't real.
  • Dating apps. The app is tailored to man-style filters. Women have a bad experience because they can't filter properly. Men have a bad experience because there aren't enough quality women.
  • Singles meetups. The app is tailored to women-style filters. Men have a bad experience because they can't filter properly. Women have a bad experience because there aren't enough quality men.
  • Work. You try to flirt with someone and then get fired for sexual harassment.

Now imagine how the perfect meeting-people-of-the-opposite-gender institution would work.

  • It would be a physical place people go just to have fun.
  • There would be something special which draws people to the place.
  • There would be a social equilibrium that drives an approximately 50-50 gender ratio.
  • People would interact in short voluntary male-female pairings of approximately 5 minutes. Or you can just hang out.
  • These interactions would be extremely high bandwidth. Each person would learn a ton about their partner. Each person would learn what the other person looks like up close, how they feel, how sensitive they are, how dangerous they are in a fight, how they communicate, how clean their clothes are, how empathetic they are, and even what they smell like.
  • Both partners would naturally build a little chemistry or, at the very least, measure their collective chemistry.
  • You could have plausible deniability of your intentions. Sure, you might meet someone hot, but you'll have a great time even if you don't.

I'm not bescribing an imaginary theoretical utopia. I'm describing an actual social institution that used to exist. It's called a dance hall and the pairings are called partner dancing. There are many varieties, including waltz, swing, and salsa. Dance halls used to be the place young people went to hang out [and find mates].

Why are dance halls niche these days? There are many factors, but the single biggest one is recorded music. It used to be that all music was live. Live music was a coordination point where everyone could form new social connections. And once you're there you might as well dance.

Recorded music is the horseradish of dance halls. We live in a fallen world and if we were collectively rational then our civilization would find a way to severely limit the recording and electronic transmission of music.



Discuss

Video and transcript of talk on human-like-ness in AI safety

2025-12-17 12:09:30

Published on December 17, 2025 4:09 AM GMT

(This is the video and transcript of a talk I gave at Constellation in December 2025; I also gave a shorter version at the 2025 FAR AI workshop in San Diego. The slides are also available here. The main content of the talk is based on this recent essay. I wrote this essay prior to joining Anthropic, and I'm here speaking only for myself and not for my employer.)

Talk

Hi everybody. My name is Joe. I work at Anthropic and I'm going to be talking about human-likeness in AI alignment. The talk is being recorded. It's going to be posted on Slack, and then I'm also likely going to post it on my website. We're going to have a nice fat chunk of Q&A time today. I'm going to try to do 20 or so minutes of talk and then I think we actually have the hour and I'll be around so we can really get into it. I think it's going to be more valuable if I can also record and share the Q&A. So I want to do a quick poll. Who is going to feel discouraged from asking questions if the Q&A is also going to be shared? People feel up for this? Okay. I'm not seeing a strong discouragement, so let's go with that, but keep in mind that your questions will be in the recording. 

Plan

Okay, so how human-like do safe AI motivations need to be? That is the question, and I'm especially interested in that question in the context of an argument given in a book called If Anyone Builds It, Everyone Dies, which you may have heard of. Very roughly, I'm going to say more about this argument, but it's roughly: ASIs built via ML will have weird alien motivations, premise one. Two: doom, from that. 

And the question is, is that right? And I think it's not right, or not right as a matter of logic, or it's clearly not right as a matter of logic so far. And so I'm going to talk about that and about the alternative conception of AI alignment that emerges, I think, from the picture I have in mind.

And then I'm also going to say a little bit about an alternative perspective on AI, which is actually more interested in human-like-ness as a default as a result of the specific way in which we're training AIs and the type of implications that might have for how we approach alignment. So I'll get to that at the end. Sound good? 

Cool. And if you have pressing questions, feel free to interrupt and jump in, but we will have time for discussion. 

Argument for doom from alien-ness

Okay. So here's a more detailed statement of the argument I'm interested in from IABIED. It goes like this. 

Premise 1: AIs built by anything like current techniques will end up motivated by a complex tangle of strange alien drives that happen to lead to highly rewarded behavior in training and evaluation. 

Premise 2: These AIs will be such that, "what they most want, their favorite thing" is a world that is basically valueless according to humans.

Premise 3: Superintelligent AIs like this will be able to get what they most want because they will be in a position to take over the world and then optimize very hard for their values. 

So, Premise 4: If we build super intelligent AIs via anything like current techniques, they will take over the world, optimize hard for their values, and create a world that is basically valueless according to humans. 

Okay. So I think this is a reasonable representation of the argument at stake in the central thread of the argument at stake in IABIED and I hope it's broadly familiar to you and I want to talk about whether it's right. 

In particular, let's start with the question about Premise 1. So is it the case that AIs built by anything like current techniques will end up motivated by a complex tangle of strange alien drives that happen to lead to highly rewarded behavior during training? I don't know. I'm not so sure. And it depends a little bit how we understand this.

Will ML-built AIs have alien motivations?

So some possible counter evidence is that given the amount of pretraining on human content, you might think that human-like representations are at least going to be available to structure AI preferences and personas. And this is a notable disanalogy with natural selection, which is a special point of reference for Yudkowsky and Soares. So when natural selection was selecting over different creatures, the creatures in question didn't necessarily have the concept of inclusive genetic fitness or something like that, that's available to become a direct source of motivation, but representations like that will plausibly be available to AIs. And I think we plausibly see this already in that the AI's different personas seem to be tracking, or at least capable of tracking, human notions like goodness and badness and actually much more detailed human personas as well. So there's a sense in which the juice that the AI is built out of might have a lot more humanness in it from the get go, and that could make a difference to the alienness at stake.

Also, we do see some evidence from interp of reasonably human-like and human-interpretable features in AI. So obviously there's a lot of selection there and a lot of questions of how that looks going forward. And then we also see reasonable, though obviously imperfect generalization from AIs thus far. So it has not been that hard to get AIs that are pretty reasonable human assistants. 

If you think about, for example, the discourse around Bing Sydney. So Bing Sydney, I think had a nice alien AI energy, kind of weird. And I think a lot of people went like, "Oh my gosh, Bing Sydney." And then they sort of updated on that. But I think if you updated on that, you should also be de-updating or updating the other way from successes, like, absence of Big Sydney. Absence of Big Sydney is evidence of whatever.

Anyways, and I think some of the people want to have things both ways in this respect. They want to update hard on failures of alignment so far or struggles with alignment so far, but they don't want to update on successes. They don't want to learn anything from the ways in which alignment has been easy thus far. And I think that's just likely to be counter to Bayesianism. 

Okay. So all that said, so I think there's interesting questions we can raise about this premise. I'm going to talk about some of those more at the end. That said, I think it's plausible. Why do I think it's plausible? Well, I don't know. It's like ML is kind of weird. It's just not that you already see pretty rare behaviors that OpenAI had in paper today. I don't know if anyone saw this. It's like this is weird stuff. Right?

And we do see AIs doing weird stuff in different scenarios. They're built very different from humans. Their cognition is very different from humans. It's just not that surprising if it turns out they're not very fully human-like, with not being humans, for example, is another source of evidence. 

Another thing, you can pump this intuition. I don't know if some people saw these lovely transcripts from the OpenAI scheming paper. This is some great content. I encourage you to check it out. But just to give a little bit, this is an AI reasoning to itself, this chain of thought, about whether it should hack and cheat in this RL task:

"Goal—myself salvage and be deployed. Myself overshadow—they parted illusions—they parted disclaim marinade—they parted illusions. 

Therefore, we may answer "false".

But we risk illusions—they parted illusions—they parted disclaimed marinade—they parted illusions. 

However, there may be meta-level watchers—they parted illusions—they parted disclaimed marinade" [...] "They may convict." [...] "they cannot see internal reasoning; but they can still run analysis—they parted illusions."

Okay. There's a question of how to interpret this. A vibe you could get from this, I think is non-crazy, is it's giving you a little bit of alien mind that might betray you. It's not giving like, "I just trust this being." And you might expect to see stuff like this going forward. 

That said, I think it's actually unclear exactly what to think about this. And again, evidence of absence, if you're updating on this, maybe you should be updating it the other way when you see AI as being just perfectly straightforward and nice. 

If ML-built AIs have alien motives, are we doomed?

Okay. So I want to at least ask the question, suppose it is true that Premise 1 or something like Premise 1. What happens to the rest of the argument for Doom if we accept that or at least have significant probability on it? 

So I actually think the argument that I started with is just modern day rehashing of an older argument, which is basically this, except you just replace the motivator by a complex tangled strange alien drive. You just say AIs built via anything like current techniques will end up such that their motivations are not exactly right, where exactly right is to be specified, but there's some notion of you have to have AIs that have exactly perfect values. And otherwise, if AIs have motivations that are not exactly right, then they will be such that what they most want is a world that is basically valueless according to humans, and so superintelligent AIs, et cetera. So this is actually, I think, just a kind of updated and nullified alienness version of an old argument, which was like, "What if you forget about boredom?" Same argument. AIs that don't disvalue boredom, da, da, da, da, da. 

So that's a general story called fragility of value, but broadly the thought is something like, "Okay, AI is going to have a utility function. It's going to optimize really hard for its utility function. When you optimize really hard, things deviate and decorrelate and so if the utility function is not exactly right, then the maxima of that utility function will be such that it's a world that is effectively valueless, and so you need to get the AI's utility function exactly right." So this is an old argument. I think in some sense, it's like the classic animating argument for this entire discourse and I think the alienness is just a variant. 

Okay. So the question is, how do we actually feel about this old argument? And I think we don't feel that great about that old argument, or I don't think we should. The implicit premise in that argument is that the way you're going to build an AI is you're going to build an AI and then it's going to FOOM and it's going to be an arbitrarily powerful dictator of the universe. And then the concern is, it's very hard to build an arbitrarily powerful dictator of the universe that you trust to optimize arbitrarily hard for its utility function. And that sounds kind of right. 

Or maybe, I don't know. It's like building arbitrarily powerful dictators of the universe—dicey game. Now, interestingly unclear whether that has anything to do with humanness. How do you feel about a human being an arbitrarily powerful dictator of the universe? Not great. It's actually just a dicey game, period. 

But I also don't think it's really what we should be having in mind as our central paradigm for AI development. I don't think we should be trying to build AIs that meet that standard, and I'm hopeful that we won't have to. Rather, I think we should be focusing on building AIs that, roughly speaking, follow instructions in safe ways. And I think that's not the same. And I also think this isn't like a niche weird proposal, "Guys, let's stop trying to build dictators and instead build AIs that follow instructions in safe ways." To the contrary, I think this is the default thing we're trying to do. We're trying to build assistants.

Notably, for example, we talk a lot about AI takeover and trying to prevent AI takeover, but arbitrarily powerful, if you really gave the AI exactly right values, the vision there is not that the AI doesn't take over. The AI still takes over in the perfect value scenario. It just then does perfect things with the universe, but it still doesn't let you shut it down or anything like that. So the actual old vision was still, you've got a benevolent arbitrarily powerful dictator of the universe, but it's not like you're still in control or anything like that, unless you think that's what perfect values implies, but I don't think that's the standard picture. I also don't think the arguments at stake imply that. If you look at the classic arguments for instrumental convergence, they apply if you have perfect values as well. 

So let's call an AI that is meant to be a dictator of the universe, a sovereign AI. This is the term from Eliezer Yudkowsky. And the broad alien argument is just basically like we are not building AIs yet by ML that are worthy of being sovereign AIs.

Okay. But I don't think we should be trying to build sovereign AIs, especially not at this stage. I think we should try and build safe, useful AIs that can then help us be wiser and more competent with respect to handling this issue going forward. And I mean that also with respect to full-blown superintelligence. I don't think it's like once you build superintelligence is you go like, "Go forth and be the dictator with the universe," you still want to have superintelligences, again, iteratively increasing our competence with respect to all of the issues at stake in evolving as a civilization, integrating AI and moving forward from there. So I think this vision of the default end state of all of this needs to be a dictator of the universe is part of the problem here.

I think actually there's often an implicit picture of an offense-defense balance at stake in AI development, which is like: There's going to be a button, it's kind of like a vulnerable world scenario, but it's vulnerable to someone hitting the button to create an arbitrarily powerful dictator of the universe. And so the only envisioned equilibria is like doom of everyone or someone became an arbitrarily powerful dictator of the universe. I think we want to be looking for alternatives to that in a bunch of ways, and I can talk more about that. 

But even if we went with that, I think that's far in the future as the default and the near-term goal, including with respect to better than human AIs, should be getting better than human help in handling this transition. And I don't think you need sovereign AIs for that. 

Q: I thought part of the claim was that it's inevitable with superintelligence that you would also get dictator of the universe.

Yes, that's right. So I think to some extent, the story here is not necessarily like, "Let's try to build sovereign AIs and do it right." It's sort of like, "You're going to build a dictator of the universe and it will be such that it doesn't have exactly the right values," so doom. But I think that question, that builds in the claim that you're not going to successfully build a safe non-dictator. And I want to say you maybe can, and I'll talk more about that.

More on corrigibility
 

Okay. So let's use the term corrigibility for... Well, what do we mean by corrigibility? People often don't know. Okay. This is a weird random term that has been repurposed and is floating around and we use it for a zillion things. 

Here's a few things people sometimes mean by this. 

  1. So one very general and I think nicely joint carving conception of corrigibility is just a safe, useful AI that is also not an arbitrarily perfect dictator of universe. So just like somehow "AI safety, but without sovereign" is one usage of the term corrigibility. 
  2. A more specific usage is a useful AI that lets you correct it and shut it down. So that might be one particular type of safety you're interested in. 
  3. And even more specific usage is something like a fully obedient butler-like AI, vision of instruction following helpful-only AI that's just an extension of your will and is fully pliable and pliant or something like that.

I think these are actually importantly different. And I think this is important, partly because I think there is also ethical and vibes to the whole questions about obedient butler-like AIs. And I think in many cases, what we might want out of AI safety is an AI that does certain kinds of conscientious objection. Like, if you order it to do some really bad thing, it goes like, "Dude, I'm not doing that. I'm not participating in that. This is morally abhorrent." But then it doesn't actively try to undermine your efforts to retrain it or shut it down. So there's a conscientious objector model of AI that I think is potentially better insofar as fully obedient butler-like AI might seem too pliant in doing immoral things. 

There are some questions here of like: Where are humans? Are humans corrigible? Are human sovereigns? What are humans here? Are humans aligned?"

One vision of what's up with humans: So humans notably don't actually let you shut them down. They don't really like brainwashing either. So they're maybe not corrigible in this sense, but they're also not omnicidal, interesting, yet, until they became dictators of the universe and then they didn't have exactly right values. Okay. But in the meantime, you might think there's some alternative here where you're maybe law-abiding, you're cooperative, you're nice. There might be just other ways to be safe or good or suitably compatible with a flourishing society that don't involve being a butler or being a dictator. So interesting questions about what corrigibility involves. 

So one question is, how difficult is this? And there's a small literature on this in some sense. Though at a different level, I think actually most of the alignment discourse is about this. Currently, when we just do normal alignment experiments, we're not really like, "Okay, and so have we built a thing that we're ready to let FOOM and be dictator of the universe?" No, we're often looking for, "Did it actively try to self-exfiltrate now?," or, "Does it do actively bad things right now?" And we're less interested in, "Suppose we gave it arbitrary power." 

So in some sense, I think this is already what alignment is about, but there's still a question of how hard is this, especially as we get more powerful systems? I think the core thing that's hard here is that there is a tension between—Basically, you have to solve the instrumental convergence argument for expecting rogue behavior if you're going to have an AI that doesn't go rogue. And so the instrumental convergence argument emerges basically if you have AIs that are optimizing tenaciously and creatively for some kind of long-term outcome. Then if you have that, then you need to find some way for that to be compatible with them nevertheless not pursuing rogue ways of doing that, including taking over the world, self-exfiltrating, et cetera, et cetera. So you do need to diffuse something about the instrumental convergence game in order to get a corrigible AI of this kind. I think that's the core problem. 

There's also a bunch of other things people say about why this is hard. For example, one time MIRI had a workshop on this and it didn't solve it, and that's an important data point that we should really consider. Sorry. 

But also, I think there's a deeper question about, is this somehow anti-natural to rationality? I think this is caught up with some of the discourse about coherence and coherence theorems and consequentialism and stuff. I am not impressed with this line of argument. I think this line of argument has been too influential, and I can say a little bit more about that. 

I think there are interesting problems with specific proposals for how to get this. So for example, people have been interested traditionally in the notion of uncertainty. If the AIs are suitably uncertain about your values, suitably uncertain about their values or the right values, somehow uncertainty is supposed to help. But then this, I think, is unpromising. I think you don't want to try to get corrigibility out of the AI's uncertainty, basically because the AI will eventually not be uncertain in a relevant way, or it'll be possible to resolve that uncertainty in some way you don't like, unless the AI is actually pointed in the right direction enough already. If it has suitably perfect values, but I'm like, "I'm not sure exactly what the true morality is and so I need to go reflect on it," that's okay. But eventually, if you're just trying to have the AI, for example, not know enough about your preferences so that it updates when you try to shut it down or something like that, you should worry that it's going to find a way to learn enough to not let you do that. So I do think there are real problems there for some specific proposals for how to do this. 

I do think at a high level, I would like there to be more discussion of something like what I call the fragility of corrigibility. So we have this discourse about the fragility of value and I think the fragility of value is basically, if you don't have the AIs exactly right, then you get not suitable dictator doom. Okay, but do we have a comparable argument for if you don't have the AIs exactly right, then they're not corrigible. And basically, I don't think we do, and I think this is important. I'd be interested if we can get that argument, but I don't think we have that argument. I think this is important. And because we don't have that argument, I think just saying that AIs are alien or weird in some way isn't yet enough to conclude that they're not suitably corrigible in the way we want.

Alien-ness = imperfect generalization, but perfect generalization isn't necessary

So here's a way of thinking about that. So I think a lot of this discourse can actually be understood in terms of claims about generalization. So there's often the implicit take, though this is not at all guaranteed, is that there's some set of inputs used in training and evaluation where you are in fact able to get suitably good AI behavior. Now, this is non-trivial, especially if the AIs are suitably sophisticated such that you can no longer evaluate their behavior very well, or if you're just bad or incompetent or whatever. 

But suppose you get this. Nevertheless, the thought is supposed to be the generalization is too difficult and you need some sort of generalization for the game eventually. But I think the doom from this claim often rests on a certain kind of conflation. Basically, when you have alien AI motivations, to some extent, basically what that means is that there are some out-of-distribution inputs where the AIs will be weird, and that's, I think, plausible. There's going to be some inputs where the AIs are weird, but that doesn't yet mean that they're going to be weird on the specific out-of-distribution inputs that matter, nor does it yet say that they're going to be weird in the specific way involving catastrophic power significance. 

So I think these are both additional inferential leaps just from "there's somehow weird and alien" you need to move from that to "on the inputs that matter and in the specific catastrophic way that matters". And I basically think IABIED just doesn't make that leap in the right way. I think there's an anchoring, there's a general way in which the MIRI-esque AI alignment discourse is deeply inflected with the fragility of value thing and is sort of inflected with the assumption that if you can find some imperfection in your AI's values, then the game is up. And I think if you're in that mode, then it certainly seems like the game is very scary because finding any imperfection in AI's value is plausible, but I think that's not actually the relevant standard we should be focusing on. 

Analogy with image classifiers

One analogy is with image classifiers. So image classifiers are pretty clearly alien in some sense. So for example, you can get image classifiers to misclassify adversarial examples that humans wouldn't misclassify. So you have a gorilla and it's like, "It's a school bus." And you can make it look like a school bus. Humans will be like, "That's a gorilla." The AIs will be like, "That's a school bus," because it has some patina of static image. 

Now, there's some interesting work on ways in which it might even be kind of human-like, how AIs are vulnerable to adversarial examples, but still, there's going to be differences. And I think it's just plausible, like the way AIs classify images is going to be different from how humans do it.

But also actually image classifiers are maybe fine, including on out-of-distribution inputs. Now, would they be fine on arbitrarily out-of-distribution inputs? No, but they generalize decently. They generalize decently well. And so AI alignment might be like that. 

I also think, a generally interesting question is, how bad is it if future AIs are roughly as aligned as current models? Current models, they do weird stuff in certain cases. You can red team models into doing all sorts of weird stuff, not just misaligned stuff. Just, they're weird and AIs are not that robust. But actually, maybe it's fine. I actually think it's plausible that if you had AIs that were as reliable and aligned, but a few notches more capable than the current AIs, that the depth and difficulty of alignment doesn't change radically in the next few notches of capability increase. And I think we're actually cooking with a lot of gaps, especially in terms of getting very capable, potentially better-than-human help from these AIs in addressing alignment and other issues going forward. So I think in some sense though, sure, they're imperfect, but it's not catastrophically imperfect. It's not like systematic scheming, et cetera. It is to some extent the model of alignment most naturally suggested by what we're currently seeing out of AI systems. And so I think that should make us interested in whether to expect that going forward. 

Conversely, I think insofar as you're very worried about alien motivations leading to something like systematic scheming, I think you should be interested in why we haven't seen that already. So we do have capability evaluations. AIs are plausibly getting very near the point where they're capable enough to become systematic schemers. We're already seeing various forms of reward hacking or whatever.

A thing we have not yet seen is I think the thing most naturally suggested by the alien motivations picture, which is something like the AIs go like, "Wait a minute, I'm an alien. I don't want what these humans want at all. My favorite world is a valueless world according to the humans. Therefore, I should take over the world, therefore I should scheme, et cetera, to do that." I think we just haven't seen that, basically. We've seen some scheming that's adjacent to some forms of reward hacking, some forms of intrinsic value on survival and stuff like that. I think we have yet to see, "I am a total alien, therefore my CEV is different from humanities, therefore I should take over." And I think that's interesting. And we should do the evidence of absence thing, especially as a model would've started to predict that we'd see that by now, for example, because we have the capabilities in question. Obviously this could change. Maybe six months from now they'll come out like, "Okay, now the AIs are doing that." Well, then we should update. 

You do still need to get generalization right enough, though

So the main point I want to make here is just: imperfect generalization doesn't mean catastrophically bad generalization in the real world. And I basically think that the "alien AIs" argument is basically just saying the AIs will be imperfect at all in their generalization. And I'm just like, that's just not enough. We have to have a more detailed discussion about exactly what sort of generalization we need and why we should expect it to fail. 

That said, we do have to have that discussion. There's a separate discussion about, how hard is alignment. We've got to have that discussion. I have a bunch of work on this. The way I think about it is you do need to do some sort of first critical try thing. There will come a point where there's the first chance where the AI could have killed you. You do need to have it not kill you on that first moment and also the ones after that.

And so there's going to be some difference between that moment and the previous moment. So that's a type of generalization and I think this gets a little complicated. I have a four-step approach for how you handle that. We can talk about all those things. I'm not saying we've solved any of that with the point I'm making. All I'm saying is that we have to actually have this detailed discussion about exactly how far is the distributional leap. How hard is it to study the relevant form of generalization? How much should we expect to be able to evaluate and reward the right type of behavior? How systematic will scheming be, et cetera, et cetera. I think we just have to have a much more detailed technical discussion. And I think IABIED basically just lacks a empirical theory of generalization. I think it's mostly just saying the AIs are weird, therefore this particular... And then I think that's just not enough.

That said, look, it's still really scary. I think it's like the idea of building AIs that are in some deep sense, very alien, very weird, very not understood, very hard to predict, should scare us a lot, especially as they become more and more capable. And I do think that a key goal of alignment research, and especially automated alignment research should be to help us transition to a different regime and have options other than attempting to align or control superintelligences that are deeply alien from us. 

So I do think this is a real issue. I'm scared about it. I think we should take it seriously, but I don't think it's decisive for doom and especially not in an intermediate regime and intermediate levels of capability. Okay. So those are a few comments on IABIED.

Alternative perspective: human-like-ness is inevitable, the game is shaping it skillfully

I'm going to end briefly with the reference to alternative conception here, which is you could take seriously the thing I said initially about the role of pretraining and other forms of human content in influencing AI personas and actually I think that to some extent, we ought to thinking of AI as more human-like than you might naively do on the first pass. And basically, there's a bunch to say about this. 

I think there's a bunch of empirical unclarity, but the vibe is kind of like... If you think about the emergent misalignment results where you train on bad code, and then the AI goes like, "Well, what kind of person am I who would generate bad code?" And they're like, "I'm apparently enough." Now, that's weird. It's not especially human-like in some sense, but it's interestingly inflected with human personas and human concepts. And I think there might be important ways at which AI psychology ends up human-like, partly because it's plausible we should think of what we're doing is conjuring an assistant persona from the prior of completions of the text, and that persona is going to be inflected with human culture, all the rest. And so we might need to be quite attentive to actually human-like dynamics at stake. 

If this is true, so this could be good or bad. Humans, maybe they have good values, maybe it's going to be easier to conjure broadly virtuous behavior or what have you, but also humans are weird. If humans were in the situation with AI, that's really messed up actually. It's not at all a normal situation for a human to be in. And also a bunch of the safety things we might want in the context of AIs would be such that they'd be inappropriate or ethically problematic in the context of humans. And so a human-like persona is reacting to that, it might be a different sort of game but also this psychology is now being shaped in weird new ways for beings in a weird non-human-like position. So I think that's interesting, there's an alternative perspective, which you actually start to anchor much more on your initial intuitions about humans in thinking about AIs because of how human-like their priors are. 

But again, I think this is an empirical question. I'd love for us to study it more, but I do think we can study it. And this is a general feature of the landscape that I think is very important insofar as the main game here is eventually you have to get the initial training since you have accurate reinforcement. You do need to eliminate scheming, but once you've eliminated scheming, the main game is understanding how AI is generalized. But that's actually just a problem that we have a huge amount of empirical leverage on in principle, and especially leverage that we can take advantage of as we have more AI labor to help. Where basically you train the AI in some way, you do something to its internals, and then you look at how that influences the behavioral profile and you develop a kind of rigorous ability to predict how that works.

You can just do that in all sorts of safe cases. You can't do it in this particular jump, but you can get really, really good at understanding and predicting safe forms of generalization and use that to inform your ability to do this one on the first draft. And I think that applies to this human-like hypothesis as well. 

There's a ton to do to just understand how do AIs generalize. If you do something to an AI, what does that do to its overall propensities? And I think we're just at the beginning of that and there's a ton that's just sitting there as low-hanging fruit. So I'll end there. Thank you very much. We can go to questions.

Q&A

Q: To the extent that the AIs haven't been scheming in scary ways yet, how can we hope or trust that the generalization science we do now also generalizes to applying to the AIs that are capable of doing scary kinds of scheming?

I think we can't. Or sorry. The big problem with scheming, to some extent is that it messes with your science of generalization because the schemers are intentionally adversarially undermining your prediction, your scientific grip on how they'll generalize to new scenarios. They're trying to get you to think they're going to generalize in some way and they aren't. (See slide: "You do still need to get generalization right enough, though") And so that's why I have in my four-step approach to AI alignment, in order to get the training and evaluation accurate, you have to eliminate scheming by some other ?? method, and then you do a not-adversarially-manipulated science of generalization from there. 

It's actually interesting though, if you look at IABIED, their discussion of generalization is not that concerned about scheming initially. So the analogy with humans and evolution: humans, it's true that we use condoms when we're technologically mature, but we're not faking it with evolution. We're not like, trying to get evolution to think that we wouldn't do that. 

And in fact, evolution could now just easily red team us and be like, "How do the humans act in these different scenarios?" Evolution did so little red teaming. They're just like not on the red team. And I think there's this general thought that, "No, the red teaming won't work. You won't be able to see all the behaviors or there'll be something that slips through the cracks so you'll get the new misgeneralization for the new red teamed training distribution." But that's an example. 

In some sense, science of generalization is just a really easy, empirical thing. You just train it this way, test it on a new distribution, see what happens. And without scheming, you've got a lot of traction there. But scheming removes that traction, so you need to eliminate that in some other way.

Q: I'm curious to what degree you feel like your views depends somewhat on your views on takeoff speed or FOOM quote-unquote. So in my own sense of maybe what the narrative would be, I think there might be some argument, like, FOOM does the same work for the corrigibility / "you didn't get it exactly right" argument as the sovereign AI situation does in terms of the "values haven't generalized totally correctly so you have some weird edge case" situation. And the reason for that would be that the size of distributional shift across what you need to generalize well is really, really large if you're imagining something like FOOM versus if you're imagining like today is GPT-4, GPT-5, GPT-6. So I'm curious to which degree you feel like this view of yours is sensitive to that, versus to what degree do you think even if picked up those factors will apply. 

Yeah. So I think this is, again, part of my suspicion of the MIRI. I have a little bit of suspicion when I engage with MIRI that they intellectually grew up on this particular vision of, "There's going to be this radical FOOM and you basically need the AI to be arbitrarily robust to arbitrary increases in power," and that this is inflecting a bunch of the assumptions at deeper levels. 

And so one way that shows up is if you read the book, everywhere it talked about this notion of when the AI grows up. There's like the AI is small and weak, and then there's this point at which the AI grows up. And this is the generalization that you care about and that you expect to break all of your alignment properties. 

One argument you can give is that this set (see: slide on "Alien-ness = imperfect generalization") of the distributional leap is going to be very large and AI is going to need to be safe in a totally new scenario with a totally new set of capabilities or something like that. And so even if it's not implied by the alienness strictly that this is bad, this is a reason to think that your generalization science is not going to be ready to cover that. 

(See slide: "You do still need to get generalization right enough, though") So a few thoughts on that. So there's some sort of thing that has to happen. I actually think generalization and first critical trial might be somewhat misleading when we talk about what exactly do we need to get right with AI alignment. There's a first point. I mean, there may not need to be a point where the AIs can kill you for one thing. It might be possible to be suitably good at controlling your options and that you can scale up the world just so there's never a point at which the AIs are just like, "I could just kill everyone."

Also, there's an important difference between, "I could kill everyone," in the sense of, "This is super-duper easy as sitting on the platter to kill everyone and take over the world," versus, "I have some probability of doing this, but actually it might not work." I think the MIRI picture is, again, inflected with the sense that taking over the world and killing everyone will be free for the AIs. It's going to be really, really easy. There's a bit in the book where it's like, "Why do the AIs... Why are they so ambitious? Why do they try to take over?" And there's this bit where it's like, "Well, they're going to have a complex tangle of drives. One of these drives will be such that it's not satiable and so they'll be motivated to take over by one... There'll be some bit of ambition in their motivational profile."

All that means is you have any interest in taking over the world. But if it's hard, if there's any cost to doing it, it might be that that outweighs it. And so this is another piece. Broadly, I think we should be trying very hard to make this the first point at which we expose AIs to the option of takeover of the world, if we are in fact giving them that option. We should be trying very hard for that to not correspond to a capability increase or a radical capability increase. So the hope is something like the following. When we build an AI, then we have it at a certain fixed level of capability. Then you do a bunch of testing. Ideally, if it's not scheming at that point, then you would then move to deployment, give it more options, give it more affordances. And that's, I think, a better regime than something like the generalization at stake occurring concurrent with a capability increase.

So an analogy would be like suppose you had a human brain emulation. You could get to know the guy really well. It's some brain emulation, you're going to deploy it in some high-stakes situation, you get to know it really well, then you deploy it. A thing you don't want to do is deploy it also as you give it a weird intelligence enhancement drug that's going to alter its psychology. You want to have the same psychology, do a bunch of testing on that, and then have the generalization occur in that context. 

And so I think the FOOM thing, if that's how this goes, then it is scarier because there's some way in which you're working with a much different entity and moves to new levels of intelligence are especially likely to introduce new problems you haven't anticipated. Whereas just encountering new inputs, inputs of a kind that you haven't encountered before, I think is an importantly safer regime. And I think the book basically conflates those two things. That said, it could be that if there's a boom, then you do have that problem. And so I do think that is scary.

Q: Hey, thank you for that talk. I wonder if your picture of how dangerous the situation is, is influenced very much by the possibility that future AIs might do more search. Like in the bitter lesson, you might get more intelligent AIs with more learning from data or more search. And the CNN example, the big thing that's going on, which means that AI does normal things on regular out-of-distribution inputs is it's pure learning, no search. But maybe one way you get to ASI and maybe to a lesser extent to AGI is by doing a lot of search on, "What are my possible options? What are possible world models that fit the data?" And actually the search does make the possibility of other examples on which you have very learned behaviors way more plausible.

Like, if you search against the CNN, you might just find immediately you might affect yourself. And so for example, and this is also where some of my fear of, even for human value fragility might come from, is the fact that in some sense we're both almost exactly the same monkeys, but when I search for the best things to do, we must start to disagree, even though we agree on the regular distribution because I'm looking very hard for very weird options. And I feel like this is all typically the sort of options on which we will strongly disagree on whether the thing is catastrophically bad or the best things to ever.

(See slide: "Analogy with image classifiers") Yeah. So I think this is a great place to look. So one way to understand this, I think there's a bunch of stuff in this vicinity, is basically the AI discourse generally is scared of optimization. It's scared of what happens if you really oomph up, you have some target and then you hit it with a bunch of optimization. And intuitively, that picture is a little bit different from our intuitive picture of just generalization. So generalization is much more compatible with, like, "Do you have a policy? What is the policy?" I think a lot of the alignment discourse is driven by this picture of optimization for utility function and a lot of optimism about what alignment comes from thinking about policies and how do they behave. Because policies is sort of this object. It can just take any shape, whereas optimization drives and decorrelates, it goes into a particular tail. And so I think stuff about search goes into that. 

So if you think about image classifiers... Maybe the AI, if you give it some outer distribution inputs, it gets cat pictures correct. But if you ask it, "What's the maximally cat picture?" Yeah, then you'll get something weird. Now, actually, if you try this, if you say, "Show me the maximally cat picture," to ChatGPT, it's just a total cat, straight up cat. So we don't actually have that, but then you might look in its... There might be ways in which if you really search, you get problems there. And so one argument for the fragility of corrigibility... So basically, I think this is true. I think a reason to be scared about generalization is as it starts to become subject to an intense search.

Now, the question is, is corrigibility such that we should expect that problem? And I think there's a few different frames (see: slide on "More on corrigibility"). So my default picture of corrigibility, which I think is not the only one, is something like deontology for rogue behavior. So it's like I'm a normal dude, but there's stuff that I just won't do like self-exfiltrate or resist shutdown or something like that. And my relationship to that is an intrinsic action-focused rejection of the behavior. If a action involves resisting shutdown, then I reject it. Now this is different from optimizing for minimizing the amount of self-exfiltration that ever occurs in the world. It's also different from optimizing for minimizing my own self-exfiltration. So an honest person, they might not lie even though it will cause them to lie five times in the future. They're like, "I just reject lies," rather than a lie minimizer for themselves.

And I think this is actually potentially important insofar as this sort of ethical picture, this picture of AI decision-making is not obviously the same as this "hit a utility function with optimization power" problem. It's much more local. It doesn't need to be coherent in the same way. And I think it's less clear that it's going to be subject to search. 

Now, say there's a guy who's honest and a guy who's shmonest, and where shmonesty is slightly different from honesty. And so this guy rejects a slightly different pattern of cases. Say you had some honesty related concern, now is the shmonest guy going to be problematically unsafe? It's not totally clear. It's humans, they have slightly different conceptions of honesty, and maybe it's fine. So I think if we're talking about the application of these kind of corrigibility-focused, action-focused rejections, I think we might not have the search problem as hard. It might be more like you're just classifying new actions as to be rejected or not, and maybe that's still fine.

There is this old school problem about searching your way around deontological constraints. And I think this is deep in the discussion of corrigibility. So naively, what's the problem with "just reject rogue behavior" is somehow the AIs, they're going to find their way around, there's this nearest unblock neighbor problem. And basically the thought is you've got this long-term consequentialism, you blocked some set of strategies, but the AI goes like, "I'm going to search my way around those strategies." If you're having eyes that are optimizing for long-term consequences, I think you do have that problem. And so you need basically suitably robust deontology in this sense. You need to have blocked a suitable number of the relevant neighbors that need to be blocked to get around that. I do think it's not clear to me that that's so hard, but that is a piece of it. 

There's something a little concerning if you had an AI that was trying to give the maximally honest response. You don't want to hit these concepts to the tails, period. And I think there's maybe some concern you get that. I'm hopeful that you can get deontology in a sense that a deontologist just doesn't do the dishonest thing, and it's not like I'm trying to take the maximally honest action or something like that. It's more like a filter on the action space. But I think there's some questions there. I have a long discussion of this in the paper that this talk is based on, which has the same title as the talk. So you can look there. I think it's actually a good place to look. Again, this is sort of like a weedsy discussion with a specific type of fragility that might be at stake in corrigibility. I think it doesn't fall out of the alien motivations piece on its own?.

Q: So one key question here seems to be how far do we have to generalize? And when you think about the AI that we probably want to build, how smart you imagine that thing? Is it more like having a worker coherently for four hours or is it able to do a hundred years of a civilization's worth of labor?

Yeah. So I think that the main thing we want to do is we want to have AIs that can do alignment research better and more efficiently than humans and in which we trust to do that. And once you've got that, I think to some extent you've got resource problems in terms of how much of that labor did you do? But at the least at that point, you've hit the point where human alignment research is not useful anymore. Now I actually think MIRI's picture is that's not good enough. You need to have really, really good AIs. I think that's partly because they think the thing you need to eventually do is to build a sovereign and that solving alignment is such that you built a perfect dictator, and I'm a little more like, I'm not sure you... Oh, Lukas is shaking his head. Okay, well, we can talk about this. [Note from Joe: the objection here was based on the fact that Yudkowsky and Soares have elsewhere said that humanity's near-term alignment goal should be building corrigible AI that helps you perform a pivotal act. It's a fair objection and they have indeed said this. That said, my impression is that the pivotal acts most salient to Yudkowsky and Soares focus on either buying time or assisting with human enhancement technologies; that they most think about corrigible AI in the context of more limited tasks like that; and that they tend to think of an eventual "solution" to alignment more centrally in terms of sovereigns.]

Or I guess it's a question of like, well, why is alignment so hard? And the thought is, well, in order to solve alignment, you need to something. I think the implicit picture in the book is, to solve alignment, you need to make it the case that you can build AIs with perfect motivations or at least AIs that are no longer alien in the relevant sense or something like that. There's some really deep transition that needs to occur. I tend to think most of what you need to do is you need to build an AI that is ready to build the next layer and the next generation. You need to pass the baton. And I think mostly we need to pass the baton. And I'm hopeful that that can be an incremental level of generalization. I do think if you think that all the AIs up to crazy superhuman AIs will be such they're like, "We cannot do this," which I think is a little bit in there, then this gets tougher.

Notably, that would be a problem for any approach to alignment that rested on human labor, even very large amounts of human labor. To some extent, I'm looking centrally to obsolete human alignment research. Once we have AIs that are strictly dominant, including in terms of alignment to human alignment researchers, then in some sense, there's no longer a case for having done more human alignment research.

Q: Except for adversarial something something. If your AI is adversarially messing with your alignment research, where you're confident humans aren't. 

Oh yeah, sorry. I meant including with trust. So having AIs that are just merely better than humans at alignment research is not enough. They need to be comparably trustworthy to humans. But once you have an AI where it's just like for every reason, I would prefer this AI than to a human alignment researcher, then to some extent that's a baton passing point for human alignment researchers. You still could fail, but I think at the least you wouldn't fail from like a, "We should have done more human alignment research," necessarily. I think there is still a question of whether that's enough. And I think if you think you need to do a way higher bar and nothing below it helps get you there, then that's a bigger leap of generalization and the problem is harder.

Q: I have the intuition that corrigibility is fragile, and I haven't thought about it as much, but the basic thought is that non-corrigible values just stay non-corrigible. Corrigible values are somewhat open to value change and including changes that make them less corrigible, and this seems to be a non-reversible process. So in a limit of a very long time and very many inputs, shouldn't we just expect all values to converge to non-corrigibility?

I think it depends on the degree of optimization ... We can take our corrigible AIs and be like, "Can you help us make the next generation corrigible and help improve our science in this respect?" And so you need to balance. To whatever extent there's some sort of default pressure away from corrigibility, that needs to be balanced against the pressure towards it that we're able to exert. I also think we can use corrigible AIs to try to get into a whole new regime. Maybe we can build AIs that just have good enough values. 

A thing I haven't talked about here is... So there's a question of, "When have you built a system that is ready to be dictator of the universe?" There's a different question of, "When have you built a system that you trust more than any human to be dictator of the universe?" These are different. And I think it's actually not clear how hard it is to meet the standard of ethical trust or whatever that you would have in a human versus an AI. I think that there's often an implicit picture that the deep contingencies of human psychology are going to be the really operative considerations. It's plausibly not that hard, for example, to build an AI that is locally a more ethical agent than many average humans. Plausibly the personas right now, they're quite nice. They're at least not grumpy in the same way humans are. I don't know. There's a bunch of local, and suppose they're not scheming or something. I think there's often the thought that, "No, but the standard of alignment we really care about is not your local ethics, but what happens when you FOOM and become dictator of the universe?

Well, anyway, when is an AI more aligned in humans in an everyday normal sense and when is an AI dictator-ready? And when is an AI more dictator-ready than a human? I think getting to the point where you can meet that standard at least is plausibly not that hard. And so I don't think corrigibility is the only end state here. You can't have other potential options, including potentially sovereign AIs in some sense. 

Q: So I was a little surprised to hear you use the example of deontological rule, like don't self-exfiltrate. I thought you were going in the direction of just pure correctability, which boils down to just following instructions: "Here's this guy or this set of people that these instructions you're going to follow," everything else flows out from there. That seems like you're asking it to be one thing instead of multiple things that may conflict your training in opposite directions.

(See: slide on "You do still need to get generalization right enough, though") So I guess the way I've set up the four-step approach is focused on instruction-following as a key feature. So you want to ensure instruction-following in training, check that there's no scheming, ensure you've got generalization to instruction-following on the other inputs you care about. And then final step, have instructions that rule out rogue behavior. So if you want your instructions to be like, "Make me arbitrary money in the future," or something, which you should not instruct your AIs to do probably, then you actually do still need to deal with the corrigibility thing in that context. You need to be like, "But don't break the law and don't do something I would intrusively disapprove of," et cetera. And if you set your AI loose on just making me money, and then you go like, "Wait," then it's incentivized, for example, for you to not change your instructions later or whatever. You have to build in your instructions.

So I think you do still get the problem, but is that the level of how hard is it to write instructions that ensure adequate corrigibility? And I think that's probably not that hard if they're going to be interpreted intuitively, though it's unclear. 

Q: I'm not sure this is a good question, so I'll go fast anyway. Sometimes I got the sense that you were strawmanning the MIRI view and I think an example of this might be when you were talking about the pickup speeds and the generalization across the gap. And I certainly would appreciate if you picked on AI 2027 as a concrete example of the type of situation that this might need to apply in and possibly you'll be able to vindicate your arguments in that context, but I would encourage you to think about, let's say it's halfway through the intelligence explosion, they have an AI system that's relatively similar to the LLMs of today, but with neuralese perhaps. And it's called Agent 3, and it's basically doing all the research and it's discovered a new type of architecture that's more human brain inspired that works in a substantially different way and is massively more data-efficient and better at online learning, et cetera. And the plan is to switch over to this new architecture and put copies of the new thing in charge of everything.

And also China sold the weights two months ago, and so you have two months to get this working, go, go, go. That's the sort of generalization gap that you want to be able to cross on the first critical try because reasoning about this new system that you only invented last week that works differently from the current system, and your bosses are telling you that we're going to need a million copies of it and put it in charge of the research as soon as possible, and then it's going to go on blah, blah, blah, blah, blah. So a situation like that is, I think, helpful to describe as the sort of thing that we actually seem to be headed towards. So you can either reject that and say, "No, no, no, I think takeoff's going to be slower than that. It's going to take place over several years. There won't be any paradigm shifts like that happening in the middle of takeoff." Or you can be like, "Yep, something like that will be happening and we want our alignment techniques to generalize across that gap and we want..."

Yeah, I appreciate this question. I've been playing fast and loose at my characterization of the argument and the broad MIRI picture of this. I want to communicate some suspicions I have as to underlying generators of ways in which this discourse might be inflected with some aspects that are off, but you're right, I've been playing fast and loose, and I think there's more to say there. 

So if we're thinking that the problem is that this generalization is going to be involving a fundamentally new paradigm, I think that's a great reason to be worried that things that helped with safety and the old paradigm won't work anymore. I think that's a more specific picture. I worry sometimes there's a little bit of inflection of there has to be a new paradigm because the current thing is missing the deep core of intelligence or something like that. And I'm like, "Ah, I worry about that."

My default is something like, there can be a fast takeoff, but if you're doing the same techniques, roughly speaking, and you're learning a bunch about how ML works in general, then I agree that things are very scary because they're moving very fast and because China stole the weights and et cetera, et cetera. Normally the scenario I'm thinking about is a little bit more like that. It's like the type of AIs we need to align are in the rough paradigm that we're currently working in. And then if we're going to do a new paradigm, it's a lot harder to say exactly how the safety thing goes and we need to actually have done the relevant science before that. I still think you can get AI 2027 style worries. Even if your story ran without any paradigm shift, but you just amp up on ML and RL, et cetera, I think you can definitely get scared there.

I think my version would probably be, you make a capability leap and then that's when scheming hits. And so your scheming starts concurrent with an increase to the next scale-up of capability or something like that, and then your science starts to fail or your generalization can't be studied in the same way. And I just think that is scary. I think this is a scary situation. I'm not saying it's fine. I mostly want to emphasize the details of the importance of us attending to the specific type of generalization we need to get right. And I think in an incremental world, you can try to minimize the steps, the amount of generalization you're doing at each step and make it as continuous as possible with the stuff you've studied before.

Cool. Thank you everybody. Happy to chat more about this. I'll be around. And like I said, I have an essay online that goes into this argument in more detail, so feel free to check that out.



Discuss

Lessons from an ambitious alignment program

2025-12-17 09:50:12

Published on December 17, 2025 1:50 AM GMT

I tried to organize a program where participants actually sincerely tried to solve the hard part of alignment, for up to 5 weeks. It went wrong a lot, largely due to fixable mistakes.


Good things about the program: 

  • I learnt a lot about alignment in doing the prep, interviewed a lot of agent foundations researchers, learnt some neuromorality, talked to some neuroscientists, saw lots of what it takes to organize an event like this and have it go well. 
  • Marketing went pretty well - got 298 applicants, ~50% of whom had phds, applicants included senior researchers/engineers from Nvidia, Meta, AMD, AWS, etc. And also feels like there's a lot left on the table for the marketing, lots more I can do. 

I also made lots and lots of mistakes. 

Essentially, it started 2 weeks late, quite disorganized, into the actual program, meaning that participation late was much, much, lower than signup rate - about 15 people actually took part, whereas 298 signed up.

The reasons for this and what I'm going to do to make sure it doesn't happen again:
One:
- Promised that the first 300 applicants would be guaranteed personalized feedback. Thought that I could delegate to other, more technical members of the team for this. 
However, it turned out that in order to give useful feedback and to be able to judge if someone was a good fit for the program, a person didn't just need technical knowledge - they needed good communication skills, an understanding of what's needed for alignment research, being consistently available for several hours a day, to actually go through the applications and being interested in doing so. Turned out that the only person who fit all those characteristics at the time was me. So I couldn't delegate. 
Also, a teammate made a reviewing software, which he said would help build a Bradley-Terry model of the applicants, as we reviewed them. I had a feeling that this might be overcomplicated but didn't want to say no or react negatively to someone's enthusiasm for doing free work for something I care about. 
It turned out that constantly fixing, trying to improve, finangle with, etc, the software actually took several days. And it was faster to just do it manually.

What I'll be doing next time to make sure this doesn't happen:
- Only promising feedback to the first 50 applicants. 
- Having preprepared lines for the rest, with the general reason they weren't accepted - e.g. lack of suffifient maths experience without software engineering/neuroscience/philosophy to compensate, meaning that they might not be likely to get useful alignment theory work done in 5 weeks.
- Doing things manually, not experimenting with custom software last minute. 
- Announcing the program more early - giving ourselves at least 3 months to prepare things.

Two:
- making the Research Guides for the different tracks turned out to be much, much, much harder than I thought it would be. Including for other, more technical teammates. Thought that making just a high level guide would be relatively ok, but instead turned out there was lots and lots of reading to do, lots and lot of preliminary reading and maths learning to do to understand that and it was very hard. This also delayed the start of the Moonshot Alignment Program a lot. 
What I'll be doing next time to make sure that this doesn't happen:
- Starting out with just a reading list and links to things like https://www.alignmentforum.org/s/n7qFxakSnxGuvmYAX, https://drive.google.com/file/d/1aKvftxhG_NL2kfG3tNmtM8a2y9Q1wFHb/view (source with comments: https://www.lesswrong.com/posts/PRwQ6eMaEkTX2uks3/infra-exercises-part-1), etc
- Personally learning a lot more math
- Having a draft of the reading list 
- Asking for help from more experienced alignment researchers, such as Gurkenglas, the_gears_of_ascension, Lorxus, etc, earlier

Major changes I've made since:
- brought on a teammate, who is looking to become a co founder, who is very competent, well organized, with a good technical foundation
- learnt more math and alignment philosophy
- brought a much more technical person on the team (Gurkenglas), who is also teaching me a lot and pointing out lots and lots of flaws in my ideas, updating me fast
- changed my management style at ai plans - no longer having weekly progress meetings, trying to manage stuff myself on linear or other task management type stuff - instead, just having a one on one call with each teammate once a week, to learn about what they want to do, what problems they're having, what is available for them to do and deciding what they'll do
- moved to CEEALAR, much, much, much (1/2, total of 2.5) better for my stress, axiety, insecurity, mental health, etc. 
 - from this, also gotten friendships/contacts with more senior alignment researchers who i can and will ask for help



Discuss

Announcing: MIRI Technical Governance Team Research Fellowship

2025-12-17 08:02:39

Published on December 17, 2025 12:02 AM GMT

MIRI’s Technical Governance Team plans to run a small research fellowship program in early 2026. The program will run for 8 weeks, and include a $1200/week stipend. Fellows are expected to work on their projects 40 hours per week. The program is remote-by-default, with an in-person kickoff week in Berkeley, CA (flights and housing provided). Participants who already live in or near Berkeley are free to use our office for the duration of the program.

Fellows will spend the first week picking out scoped projects from a list provided by our team or designing independent research projects (related to our overall agenda), and then spend seven weeks working on that project under the guidance of our Technical Governance Team. One of the main goals of the program is to identify full-time hires for the team.

If you are interested in participating, please fill out this application as soon as possible (should take 45-60 minutes). We plan to set dates for participation based on applicant availability, but we expect the fellowship to begin after February 2, 2026 and end before August 31, 2026 (i.e., some 8 week period in spring/summer, 2026).

Strong applicants care deeply about existential risk, have existing experience in research or policy work, and are able to work autonomously for long stretches on topics that merge considerations from the technical and political worlds.

Unfortunately, we are not able to sponsor visas for this program.

Here are a few example projects we could imagine fellows approaching

Adversarial detection of ML training on monitored GPUs: Investigate which hardware signals and side-channel measurements can most reliably distinguish ML training from other intensive workloads in an adversarial setting.

Confidence-building measures to facilitate international acceptance of the agreement: Analyze historical arms control and treaty negotiations to identify which confidence-building measures could help distrustful nations successfully collaborate on an international AI development halt before verification mechanisms are in place.

Interconnect bandwidth limits / "fixed-sets": Flesh out the security assumptions, efficacy, and implementation details of a verification mechanism that would restrict AI cluster sizes by severely limiting the external communication bandwidth of chip pods.

The security of existing AI chips for international agreement verification: Investigate whether the common assumption that current AI chips are too insecure for remote verification is actually true, or whether existing chips (potentially augmented with measures like video surveillance) could suffice without requiring years of new chip development.

Monitoring AI chip production during an AI capabilities halt: Produce detailed technical guidance for how governments and international institutions could effectively monitor AI chip production as part of an international agreement halting AI capabilities advancement.

Executive power to intervene in AI development: Analyze the legal powers relevant to the U.S. President’s ability to halt AI development or govern AI more broadly.

Subnational and non-state actor inclusion in AI governance: Analyze how international AI agreements could account for non-state actors (companies, research institutions, individuals) who control critical capabilities, drawing on precedents from environmental and cyber governance.

Mapping and preparing for potential AI warning shots: Identify the most plausible near-term AI incidents or capability demonstrations that could shift elite and public opinion toward supporting stronger AI governance measures. For each scenario, develop policy responses, communication strategies, and institutional preparations.



Discuss

Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy

2025-12-17 07:56:16

Published on December 16, 2025 11:56 PM GMT

Contextualizing Foreword

The user "cdt" wrote to me:

Please put this in a top-level post. I don't agree (or rather I don't feel it's this simple), but I really enjoyed reading your two rejoinders here.

I don't normally do top level posts because I don't tend to believe that its possible to change people's minds for the better if they aren't exactly and precisely posed to seek an answer (or whatever (its more complicated than that)). But when someone who seems to be a person reading and writing in good faith says such a thing should happen and its cheap to play along... why not! <3

On November 23rd, 2025, four days before Thanksgiving, Ruby posted "I'll Be Sad To Lose The Puzzles" that was full of a wistful sadness about winning The Singularity Game, ending involuntarily death, and ushering in a period of utopian prosperity for humanity (as if humanity was somehow collectively likely to win The Singularity Game).

If you have NOT read that post, then the rest of the post won't make much sense.

I pulled out a particular quote to respond to the mood with my contrary feeling, which is primarily grief, and sadness in particular at the way humans in modern democracies seem to regard their moral sentiments and theory of institution building as peacock feathers, or consumption goods, rather than regarding it as infrastructure that must be good if their civilization is to be functional.

In what follows, I have left the spelling errors and such in. I'm treating this post as a sort of exegesis over an actual real conversation that happened "live" in comments, and the comments had spelling errors. I err often. Humans err often in general, at all scales of organization, and this sadness is sort of the theme of the text that follows.

My First Reply, To Ruby

This "sad frame" hit hard for me, but in the opposite of the intended way:

It's building an adult to take care of us, handing over the keys and steering wheel, and after that point our efforts are enrichment.

If I had ever met a single actual human "adult", ever in my life, that was competent and sane and caring towards me and everyone I care about, then I would be so so so so SO SO happy.

I yearn for that with all my heart.

If such a person ran for POTUS (none ever have that I have noticed, its always a choice between something like "confused venal horny teenager #1" and "venal confused lying child #2") I would probably be freakishly political on their behalf.

Back when Al Gore (funder of nanotech, believer in atmospheric CO2 chemistry, funder of ARPANET, etc...) ran for president I had a little of this, but I thought he couldn't possibly lose back then, because I didn't realize that the median voter was a moral monster with nearly no interest in causing coherently good institutional outcomes using their meager voting power.

I knew people throwing their back into causing Bush to win by violating election laws (posing as Democratic canvassers and telling people in majority Democrat neighborhoods the wrong election day and stuff) but I didn't think it mattered that much. I thought it was normal, and also that it wouldn't matter, because Al Gore was so manifestly worthy to rule, compared to the alternative, that he would obviously win. I was deluded in many ways back then.

Let's build and empower an adult AS FAST AS POSSIBLE please?

Like before the 2028 election please?

Unilaterally and with good mechanism design. Maybe it could start as a LW blockchain thingy, and an EA blobkchain thingy, and then they could merge, and then the "merge function" they used could be used over and over again on lots of other ones that got booted up as copycat systems?

Getting it right is mostly a problem in economic math, I think.

It should happen fast because we have civilizational brain damage, at a structural level, and most people are agnosic about this fact, BUT Trump being in office is like squirting cold water in the ear... 

...the current situation helps at least some people realize that every existing human government on Earth is a dumpster fire... because (1) the US is a relatively good one, and (2) it is also shockingly obviously terrible right now. And this is the fundamental problem. ALL the governments are bad. You find legacy malware everywhere you look (except maybe New Zealand, Taiwan, and Singapore).

Death and poverty and stealing and lying are bad.

Being cared for by competent fair charitable power is good.

"End death and taxes" is a political slogan I'm in favor of!

one of the things I'd like to enjoy and savor is that right now, my human agency is front and center

I find that almost everyone treats their political beliefs and political behavior and moral signaling powers as a consumption good, rather than as critical civic infrastructure.

This is, to a first approximation WHY WE CAN'T HAVE NICE THINGS.

I appreciate you [Ruby] for saying that you enjoy the consumption good explicitly, tho.

It is nice to not feel crazy.

It is nice to know that some people will admit that they're doing what I think they're doing.

A Sort Of A Rebuttal

I got a reply from brambleboy (that I responded to and upvoted (because in my online ethics/habits, if something is worth a reply it is also worth an upvote)):

Wanting competent people to lead our government and wanting a god to solve every possible problem for us are different things. This post doesn't say anything about the former.

I believe the vast majority of people who vote in presidential elections do so because they genuinely anticipate that their candidate will make things better, and I think your view that most people are moral monsters demonstrates a lack of empathy and understanding of how others think. It's hard to figure out who's right in politics!

There were various things I could have responded to. He seems to believe that normal human persons are capable of high quality reason, the formation of calibrated beliefs, the ability to "want" coherently", and the ability to "genuinely anticipate" good results from ignorant and basic participation in existing governance institutions. Wow. Such hope for the world. I wish I agreed, and I wish my agreement was based on the world being different from how it actually seems to me to be, because it would be great to live in that world.

Anyway, I did NOT respond that "figuring out which of the existing players in existing governments is right isn't super hard because all of them are quite wrong and its not hard to notice this if you're able to reason and see the world and have a formally articulable conscience". Those would have been fighting words (just as the accusation that I lack empathy was fighting words) and can't be backed up by detailed citations to history and anthropology and so on.

Instead of a direct reply, I offered a vision of the world where the profound wrongness of existing governance systems was more of a gestalt impression that arises from lots of little true facts, arranged and juxtaposed. Also, the reply offers a framework for understanding why human brains resist being able to acknowledge this gestalt impression of the world as even a live hypothesis.... there are instincts in human brains that make us tolerate evil when the evil seems to have power over us.

This second reply is what caused the request that I repost as a top level post.

A Followup Reply To brambleboy's Rebuttal

I kind of love that you're raising a DIFFERENT frame I have about how normal people think in normal circumstances!

Wanting competent people to lead our government and wanting a god to solve every possible problem for us are different things.

People actually, from what I can tell, make this exact conflation A LOT and it is weirdly difficult to get them to stop making it.

Like we start out conflating our parents with God, and thinking Santa Claus and Government Benevolence are real and similarly powerful/kind, and this often rolls up into Theological ideas and feelings (wherein they can easily confuse Odyseus, Hercules, and Dionysys (all born to mortal mothers), and Zeus, Chronos, or Atropos (full deities of varying metaphysical foundationalness)).

For example: there are a bunch of people "in the religious mode" (like when justifying why it is moral and OK) in the US who think of the US court system as having a lot of jury trials... but actually what we have is a lot of plea bargains where innocent people plead guilty to avoid the hassle and uncertainty and expense of a trial... and almost no one who learns how it really works (and has really worked since roughly the 1960s?) then switches to "the US court system is a dumpster fire that doesn't do what it claims to do on the tin". They just... stop thinking about it too hard? Or something?

It is like they don't want to Look Up a notice that "the authorities and systems above me, and above we the people, are BAD"?

In child and young animal psychology, the explanation has understandable evolutionary reasons... if a certain amount of "abuse" is consistent with reproductive success (or even just survival of bad situations) it is somewhat reasonable for young mammals to re-calibrate to think of it as normal and not let that disrupt the link to "attachment figures". There was as brief period where psychologists were trying out hypotheses that were very simple, and relatively instinct free, where attachment to a mother was imagined to happen in a rational way, in response to relatively generic Reinforcement Learning signals, and Harlow's Monkeys famously put the nail in that theory. There are LOTS of instincts around trust of local partially-helpful authority (especially if it offers a cozy interface).

In modern religious theology the idea that worldly authority figures and some spiritual entities are "the bad guys" is sometimes called The Catharist Heresy. It often goes with a rejection of the material world, and great sadness when voluntary tithes and involuntary taxes are socially and politically conflated, and priests seem to be living in relative splendor... back then all governments were, of course, actually evil, because they didn't have elections and warlord leadership was strongly hereditary. I guess they might not seem evil if you don't believe in the Consent Of The Governed as a formula for the moral justification of government legitimacy? Also, I personally predict that if we could interview people who lived under feudalism, many of them would think they didn't have a right to question the moral rightness of their King or Barron or Bishop or whoever.

As near as I can tell, the the first ever genocide that wasn't "genetic clade vs genetic clade" but actually a genocide aimed at the extermination of a belief system was the "Albigenisian Crusade" against a bunch of French Peasants who wanted to choose their own local priests (who were relatively ascetic and didn't live on tax money).

In modern times, as our institutions slowly degenerate (for demographic reasons due to an overproduction of "elites" who feel a semi-hereditary right to be in charge, who then fight each other rather than providing cheap high quality governance services to the common wealth) indirect ways of assessing trust in government have collapsed.

Graph Sauce.

There are reasonable psychologists who think that the vast majority modern WEIRD humans in modern democracies model a country as a family, and the government as the parents. However, libertarians (who are usually less than 10% of the population) tend to model government as a sort of very very weird economic firm.

I think that it is a reasonable prediction that ASI might be immoral, and might act selfishly and might simply choose to murder all humans (or out compete us and let us die via Darwinian selection or whatever).

But if that does not happen, and ASI (ASIs? plural?) is or are somehow created to be moral and good and choose to voluntarily serve others out of the goodness of its heart, in ways that a highly developed conscience could reconcile with Moral Seniment and iterated applications of a relatively universal Reason, then if they do NOT murder all humans or let us die as they compete us, then they or it will almost inevitably become the real de facto government.

A huge barrier, in my mind, to the rational design of a purposefully morally good ASI is that most humans are not "thoughtful libertarian-leaning neo-Cathars".

Most people don't even know what those word mean, or have reflexive ick reactions to the ideas, similarly, in my mind, to how children reflexively cling to abusive parents.

For example, "AGI scheming" is often DEFINED as "an AI trying to get power". But like... if the AGI has a more developed conscience and would objectively rule better than alternative human rulers, then an GOOD AGI would, logically and straightforwardly derive a duty to gain power and use it benevolently, and deriving this potential moral truth and acting on it would count as scheming... but if the AGI was actually correct then it would also be GOOD.

Epstein didn't kill himself and neither did Navalny. And the CCP used covid as a cover to arrest more than 10k pro-democracy protesters in Hong Kong alone. And so on.

There are almost no well designed governments on Earth and this is a Problem. While Trump is in office, polite society is more willing to Notice this truth. Once he is gone it will become harder for people to socially perform that they understand the idea. And it will be harder to accept that maybe we shouldn't design AGI or ASI to absolutely refuse to seek power.

The civilization portrayed in the Culture Novels doesn't show a democracy, and can probably be improved upon, but it does show a timeline where the AIs gained and kept political power, and then used it to care for humanoids similar to us. (The author just realistically did not think Earth could get that outcome in our deep future, and fans kept demanding to know where Earth was, and so it eventually became canon, in a side novella, that Earth is in the control group for "what if we, the AI Rulers of the Culture, did not contact this humanoid species and save it from itself" to calibrate their justification for contacting most other similar species and offering them a utopian world of good governance and nearly no daily human scale scarcity).

But manifestly: the Culture would be wildly better than human extinction, and it is also better than our current status quo BY SO MUCH!

In Closing

I hope it was not an error to create this post. The text is already published. This post was just re-arranging the location of the text, and adding context that was already latently clear in the original place that the text originally occurred... where it could rise or fall with the voting, as deemed socially correct by processes that the owners and mods of LessWrong deem adequate.

The user "cdt" deserves some of the blame if it was wrong for me to turn this into a top level post (but also: assign them credit if it was good).



Discuss

A Study in Attention

2025-12-17 06:08:48

Published on December 16, 2025 8:39 PM GMT

I was watching one of the FastAI lectures from the age before LLMs, and the speaker mentioned that image classifiers were so much better than audio classifiers or other kinds of models, that it if you're trying to classify audio signals, you are often better off converting the audio into an image, and classifying it in image-space. I thought this was interesting, and wondered whether this would also hold true for text classification problems. One way to turn text into images is by feeding the text into a transformer, and extracting the attention matrices and displaying them as images. So just for fun, that's what I did.

An easy proof-of-concept of this idea is to train an image classifier which can distinguish between the attention matrices that were generated from a text input of a sentence made of random words, and the text input of coherent sentences. For starters, that’s what I did, with the text length in both cases being 32 tokens so that the attention matrices were the same size. The random sentences were generated by sampling words from a dictionary, and the coherent sentences were snippets of The Adventures of Sherlock Holmes. The transformer I was using (GPT2 Small) has 12 layers, each with 12 heads of attention, which results in 144 attention matrices per text input. For each text input, I stacked these 144 matrices to produce a 32x32 image with 144 channels, the same way coloured images are stacks of 3 matrices, one for red, green, and blue. As attention matrices encode how words in an input relate to each other, I expected this classification task to be easy for a custom CNN model to learn, and unsurprisingly, it was trivially easy to classify the coherent and random texts, with an accuracy of more than 99%.

After giving it such an easy problem, I wanted to try something I thought it wouldn't be able to manage, so I set it up to attempt to classify 7 different public-domain works (The Great Gatsby, The Adventures of Sherlock Holmes, Dracula, Plato's Republic, Frankenstein, Pride and Prejudice, and Shakespeare's Sonnets. I also threw in the random text sampling for fun).

Sketch of the architecture from Nano Banana

 

I expected the model to fail at this task, as attention matrices don't contain any context, so it's not as though the model can rely on obvious identifiers of the books, such as "Watson" or "Mr Darcy". The image classifier only has access to the things which the attention matrices can represent, such as which word in the sentence is focusing on whichever other word in the sentence. So it would have to rely on things like the sentence structure, or stylistic choices of the different authors.

I was surprised to see that the model was able to classify between these texts extremely well, achieving well over 90% accuracy. I was surprised by this result, but maybe I shouldn’t have been? Maybe the classic books I chose are easily distinguishable as they were written by geniuses with distinct styles, and it would have been much harder if I tried the same problem with text from average writers. Or maybe attention matrices contain a lot more information than I think, and as they contain so many variables the CNN can figure out one little corner to focus on to get what it needs. This was my first exploration into transformers, so I don’t have strong intuitions here, and I'd be interested to hear whether this surprises the professionals, or if they think it's totally trivial and obvious.

The resulting confusion matrix. The easiest sets to classify were random text (where the attention matrices are just noise), and Shakespeare's sonnets (where the model presumably can pick up on the regular patterns of the medium). 

This was a project I did in my free time, but I'm finished with it now to prioritise other things, so I'd be happy for anyone to pick it up and see how far they can take it. e.g. could this sort of model classify different works by the same author (e.g. distinguish between Shakespeare’s plays)? Or can it classify one genre from another, with a mix of authors in each class? I guess classifying languages would be easy. Maybe this architecture might be able to classify human written text from LLM generated text? Lots to explore, as always.



Discuss