MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Gödel's Ontological Proof

2025-12-09 05:09:25

Published on December 8, 2025 8:49 PM GMT

In 1970, Gödel — amidst the throes of his worsening hypochondria and paranoia —entrusted to his colleague Dana Scott[1] a 12-line proof that he had kept mostly secret since the early 1940s. He had only ever discussed the proof informally in hushed tones among the corridors of Princeton’s Institute for Advanced Study, and only ever with close friends: Morgenstern, and likely Einstein[2].

This proof purported to demonstrate that there exists an entirely good God. This proof went unpublished for 30 years due to Gödel's fears of being seen as a crank by his mathematical colleagues — a reasonable fear during the anti-metaphysical atmosphere that was pervading mathematics and analytic philosophy at the time, in the wake of the positivists.

Oskar Morgenstern remarked:

Über sein ontologischen Beweis — er hatte das Resultat vor einigen Jahren, ist jetzt zufrieden damit aber zögert mit der Publikation. Es würde ihm zugeschrieben werden daß er wirkl[ich] an Gott glaubt, wo er doch nur eine logische Untersuchung mache (d.h. zeigt, daß ein solcher Beweis mit klassischen Annahmen (Vollkommenheit usw.), entsprechend axiomatisiert, möglich sei)’

Oskar Morgenstern

About his ontological proof — he had the result a few years ago, is now happy with it but hesitates with its publication. It would be ascribed to him that he really believes in God, where rather he is only making a logical investigation (i.e. showing such a proof is possible with classical assumptions (perfection, etc.) appropriately axiomatized.

Oskar Morgenstern, translation (mine)

Gödel ultimately died in 1978, and the proof continued to circulate informally for about a decade. It was not until 1987, when the collected works of Gödel were released, that the proof was published openly. Logicians perked up at this, as when Gödel — who, it’s fair to say, is one of the all-time logic GOATs[3]— says that he has found a logical proof of the existence of God you would do well to consider it seriously indeed.

The Proof

The Logic

Godel’s proof takes place in a canonical version of modal logic. This form of modal logic is usually called S5. It contains all the standard rules of propositional logic: modus ponens, conjunction; all of the standard tautologies. As well as four rules for the so-called “modal operators.” The modal operators are represented by a “Box” and a “Diamond”. Whenever you see a box, you should think “it is necessary that…”, or “in all possible worlds” and whenever you see a diamond, you should read “it is possible that…”, or “in at least one possible world.”

Rule 1

Brief justification. This one is pretty easy. It says only that if it is necessary that A is true (i.e. there is no world where A is false), then A must also just actually be true. We won’t really need to use this rule.

Rule 2

Brief justification. This rule states that if it is necessary that “A implies B,” then if it is necessary that A is true, it is also necessary that B is true. You can imagine this as saying the following:

  1. Suppose that there are no worlds where A is true and B is not true (so that it is necessary that A implies B).
  2. Then also, if in every world A is true.
  3. Then in every world B is also true.

Rule 3

Brief justification. This rule says that if it is possible for A to be true, then it is necessary that A is possibly true. That is, if in some world it is possible for A to have been true (so that A is contingent), then it is also necessary that A could have been true, so that A could have been true in any world.

This is the most contentious rule of S5, but it is equivalent to some other rules that are less contentious — I think it’s not so bad. But you already know the final stop on this train — you may choose to get off here, though I would recommend a later station.

Rule 4 - Necessitation rule

Brief Justification. This rule just states that if a logical statement follows from nothing, i.e. it is a logical tautology, then it is necessarily true in all possible worlds. This makes sense, since we suppose that the rules of logic apply in all possible worlds (the possible worlds we’re considering here are supposed to represent all the logically possible worlds). There is no logically possible world where a tautology is not true, since it is a tautology (a note that this statement is itself basically a tautology).

This rule is usually not stated explicitly as one of the axioms of S5, as it is so widely accepted that all of the modal logics basically take it as given. The modal logics that satisfy this rule are called the normal modal logics.

The Proof - Axioms and Definitions

I will walk through the proof that Scott transcribed[4], which is in the usual logical notation that modern philosophers and mathematicians are familiar with. It is equivalent to Gödel’s original proof, but Gödel wrote his version in traditional logic notation — which is in my opinion much harder to read. The proof takes place in second-order logic, so we will have “properties of properties” (i.e. a property can be “perfect[5]” or a property can be “imperfect”). This is not particularly controversial.

Axiom 1 - Monotonicity of perfection

I know it’s starting to look scary — but I promise it’s not as bad as it looks. The P that appears here is one of those “properties of properties” I mentioned just before. It says only that “this property is perfect”.

The sentence above says, therefore, that “If property 1 is perfect, and it it is necessarily true that whenever an object has property 1 it must also have property 2, then property 2 must also be perfect.”

This seems justifiable enough — if property 1 is perfect, and there is no world in which you have property 1 and don’t also have property 2, then property 2 should also be perfect, otherwise how could we have said that property 1 was truly perfect in the first place?

Axiom 2 - Polarity of perfection

This says that if a property is perfect, then not having that property is not perfect. This just means that it can’t be the case that having a certain property is perfect and also not having that property is perfect. This seems sensible enough — we cannot say it is perfect to be all-knowing and also perfect to not be all-knowing, this would be nonsensical.

Notably, this is not saying that every property is either perfect or not perfect — it is just saying that if a property is perfect, then the negation of that property can’t also be perfect.

Axiom 3 - Possibility of perfection

This axiom says that if a property is perfect, then it must be possible for there to exist something with that property.

It would seem unreasonable to say that a property is perfect and also that there are no worlds where anything can actually instantiate that property. Surely it must at least be possible for something to have the property if we’re calling it perfect, even if nothing in our world actually has that property. Otherwise we can just resign this property to the collection of neither-perfect-nor-imperfect properties, or else simply call it imperfect, since nothing can ever have it.

Definition 1 - Definition of God

This introduces the definition of God that we’ll be working with in this proof. It should be relatively familiar — something has the property of “Being God” if it possesses every perfect property. Also, if something possesses every perfect property, then we can call that thing God. It is a being which possesses every perfect property, what word would you like to use for it?

It does not say that God only possesses perfect properties, God can also possess properties that are neither perfect nor imperfect — however God cannot have any imperfect properties, as that would lead to a contradiction by Axiom 2.

Axiom 4 - God is Perfect

This axiom just says that the property of being God — you know, the property of having every perfect property — is itself perfect.

This seems reasonable, it almost follows from axiom 1, however since there is no one perfect property that implies you are God, God is not perfect as a result of axiom 1, so we need to introduce a special axiom to say that God is perfect.

Again, how would you describe a being that has every perfect property — it seems ridiculous to say that such a being is not itself perfect.

Definition 2 - Perfect-Generating Property

This definition introduces the notion of an “Perfect-Generating” property. This basically says that a perfect-generating property of a thing is a property that captures everything that makes that thing “perfect.”[6]

For example, suppose that the perfect properties of a triangle are things like:

  • having all sides equal
  • having all angles equal
  • being maximally symmetric
  • having maximal geometric regularity

Then the perfect-generating property of a triangle would be the property of “being an equilateral triangle,” as that implies all of the other perfect properties.

To describe the logic of the statement more explicitly, it says:

We say property 1 is a perfect-generating property of X iff X has property 1, and for any other perfect property 2, if X has property 2, then anything else that has property 1 must necessarily (i.e. in all possible worlds) also have property 2.

So it cannot be the case that there is some Y that has the perfect-generating property of X, and yet X has some perfect property that Y does not have. In fact, there is no world where that happens — it is necessary that anything with this perfect-generating property also has all the other perfect properties of anything else with that same perfect-generating property.

Note that this doesn’t mean that the “perfect-generating” property is itself a perfect property (although it will be whenever we use it in the proof).

Definition 3 - Perfect-Essential Necessity

This is just another definition.

We say that something has the property of “perfect-essential necessityif (and only if) whenever there is a perfect-generating property for that thing, it implies that there necessarily must exist something (i.e. in every world) which instantiates that perfect-generating property.

This would also mean that there must necessarily exist a thing that has all the perfect properties that follow from that perfect-generating property.

This is just a definition of what it would mean for something to have “perfect-essential necessity.” We have not claimed that any such thing exists, we are just saying that if a thing has a perfect-generating property that must necessarily exist, then it also has the property of “perfect-essential necessity.”

It is just a definition — it cannot hurt you.

Axiom 5 - Necessary Perfection

This axiom states that whenever we say that property is perfect, it must be perfect in all possible worlds. It must be necessary that the property is perfect.

If there is some world in which a property is not perfect, this axiom claims, then it was never truly perfect to begin with. We are not talking about your “garden-variety” perfection — we’re only talking about the real cream of the crop perfect properties.

Axiom 6 - Perfect-Essential Necessity is Perfect

This claims that the property of having “perfect-essential necessity” — whereby all your perfect properties must necessarily be instantiated in the world — is itself a perfect property. This seems sensible to me! If it were the case that all of the perfect properties of a thing had to be instantiated in the world, it seems like having that property would also be perfect.

Is it not itself a perfect property for all of your perfect properties to be instantiated in the world? Is it somehow imperfect for all of your perfect properties to be required to exist — well then clearly they were not perfect to begin with!

The Proof - Derivation

Well, if you haven’t gotten off the logic-train at any of the steps above, then you know what’s coming now. Let’s show that these axioms entail the existence of God.

Theorem 1

This says “it is possible for there to exist an X, such that X is God.”

We prove this using axiom 3, which stated that any perfect property must have the property that it is possible that something instantiates that perfect property.

Then we apply this to axiom 4, which said that “Being God” is a perfect property.

So, just by replacing the φ in axiom 3 with G, it follows that it must be possible for there to exist a God.

Theorem 2

This says that “If there is an X with the property of Being God, then Being God is a perfect-generating property of that X,” i.e. it captures all the other perfect properties that such an X would have.”

This is a more involved derivation, so let’s go step by step. What we need to show is that if there is something with the property of “Being God”, then it satisfies the definition of a perfect-generating property. So let’s look at definition 2 more closely.

So we need to show the conjunction on the right holds when φ is replaced with G. It is clear that the first clause of the conjunction holds, i.e. if something has the property of “Being God,” then it has the property of “Being God.” That was easy!

The trickier part is showing the second clause. This says that for any property ψ, if God has that property and that property is positive, then it is necessarily the case that whenever anything has the property of “Being God,” it must also have that positive property ψ.

Let’s begin by again noting the definition of God

And since this is a definition, we can apply the rule of universal generalization from logic[7]. We’ll also drop the reverse implication, since it isn’t necessary for the result.

This is basically just a restatement of our definition. It’s saying for anything which has the property of “Being God,” it also satisfies our definition of “Being God.” We’ve also replaced φ with ψ, just so it eventually lines up with our definition of a perfect-generating property — but we could’ve replaced it with anything.

However, since this is a theorem that we can derive from just our definition of God and the rules of logic alone, we can apply rule 4 from our modal logic system, the necessitation rule. This must apply in any world (since we assume that it is necessarily true that all the laws of logic are true). Therefore, we can just write that the above statement is necessary.

Now we have that in every possible world, “Being God” implies that if a property is positive, you have that property.

From this point on, it’s going to get a little rocky if you aren’t familiar with basic logic. Explaining the rules of basic logic would unfortunately take too long, and this post is long enough already, but feel free to skip to Theorem 3. I promise nothing here is a trick, or at all contentious.

To reach the conclusion for the definition of a perfect-generating property, we can begin by assuming that the ψ we’re working with is perfect. Otherwise there’s nothing we need to show about it (since the first clause of the implication we’re trying to show would be false, so the implication would be vacuously true). So if we assume this, axiom 5 lets us say:

And therefore we can also assume that it is necessary that ψ is a perfect property. Then we can apply a theorem of S5’s modal logic (a basic corollary of proposition 3.17 here, but don’t worry about it too much), to Lemma 1 and get:

Then we can generalize back again, going back to thinking about this being true for anything we include in the formula — since we had a general formula originally in Lemma 1 — which gives us:

Which is the conclusion that we wanted to reach on the right hand side of the definition of a perfect-generating property. So we’ve shown Theorem 2 must hold.

Theorem 3

This is the final theorem we’ll need to prove. That it is necessary for there to exist a being with the property of “Being God.”

For this, we’ll need to prove a small lemma, which is:

To prove this lemma, we begin by assuming there exists something that has the property of “Being God;” from this we then want to show the right-hand side, that there must then necessarily exist something with this property. Let’s pick an arbitrary thing that has this property of “Being God” to work with[8], so that we have:

for some a. Then by axiom 6 — that perfect-essential necessity is itself a perfect property, and so it is a property that anything with the property of “Being God” must have, were such a thing to exist — and the definition of God, we have:

Then, by restating theorem 2 for this generic a that we’ve chosen:

Then we use the definition of perfect-essential necessity, and plug in “Being God” as the perfect-generating property (we only need to use the forward implication from the definition):

So, finally, let’s combine our assumption that there exists something with the property of being God, with statements (1), (2), and (3). Then following the chain of implications, we arrive at:

So it is necessary that there exists something with the property of “Being God.” But remember, we have to include the initial assumption we used to prove this, so ultimately all we have shown is:

So the lemma has been proven.

Then we need one last step from our modal logic, we’ll start with theorem 1, which stated:

Then we’ll apply our Lemma 2, to change existence into necessary existence, to get:

So we have that it is possible for it to be necessary for there to exists something that has the property of “Being God.” Then finally, we’ll apply Rule 3[9] from our modal logic to get:

And so it is necessarily true that there exists something with the property of “Being God.” That is, there is something that is “Being God” in every possible world.

Checkmate, atheists.

Conclusion

Well, that was a pretty intense modal-logic session we just went through. I hope that I managed to walk you through it effectively, and perhaps hope that I managed to convince you that there is a God — though I must say, I have my doubts.

Now, I departed from the proof in one place, which was my statement of Axiom 2. Originally, Gödel stated it so that every property is either perfect or it’s negation is perfect, and I have stated it to leave open the possibility that some properties can be neither perfect nor imperfect. This also meant that I had to modify the definition of a perfect-generating property, which can be phrased in a slightly cleaner way if you accept that every property is perfect or imperfect — but I prefer my phrasing[10]. The proof goes through the same way. It is worth noting, though, that in Gödel’s original proof, the bipartite nature of perfect properties (i.e. that every property is perfect or imperfect) ends up leading to something called “modal collapse,” in which everything that is true is also necessarily true — which seems like an unfortunate consequence. I am not certain that my formulation totally avoids modal collapse — though it certainly prevents the most obvious way to get there.

However, I remain uncertain about the existence of God — why is that? After the proof I just gave. Well, it is because I am not sure that any property is perfect, aside from perhaps the property of “Being God”, and the property of “perfect-essential necessity.” I think my arguments for those being perfect properties are good enough, though is it the correct turn-of-phrase to say something has the property of “Being God” if it only has that property, and every other property is simply contingent — I am not so sure. There also remain ontological uncertainties about whether S5 is a legitimate modal logic to consider for our reality. Rule 3 is fairly strong, and I am not sure if I want to accept it, and as you saw in Theorem 3, that was a fairly essential part of the proof — the whole edifice would collapse if we rejected that. But it is implied by some less-controversial axioms, as I mentioned, so perhaps we ought to believe it.

One other comment on the proof — as with so many ontological arguments, we could go through the proof and entirely modify the language, so that instead we are talking about “Perfectly Evil” properties, or indeed any other properties[11]. Perhaps, what this proof has really done is shown us the truth of Manichaeism.

A note that this proof has actually been formalized (in Gödel’s original formulation, not my variant) in a computer-assisted theorem prover! I’m sure the Germans who did that realized the magnitude of the headlines they could generate — “Proof of God Verified by Computers!” Perhaps those of us engaged in similarly wacky disputes could formalize an argument (that works, as all arguments do, only by accepting certain axioms) and then generate a headline that says what you actually believed all along has now been demonstrated beyond any doubt to be certain. This may be a fruitful generalizable PR stratagem.

  1. ^

    Unfortunately, I’m talking about Dana Scott the logician — of Scott’s Trick fame — not the incredibly attractive lawyer from world-renowned TV show Suits.

  2. ^

    There is no evidence for Gödel sharing it with Einstein, though it is well-known that Gödel and Einstein were incredibly close during their time at Princeton, so it is hard to imagine Gödel never brought it up with Einstein.

  3. ^

    I know that means all-time logic greatests of all time, there’s two all-times. I think Gödel probably deserves two all-time awards.

  4. ^

    Okay, I’ve made a minor modification to make it more satisfying to me personally. I’ll discuss this in the conclusion.

  5. ^

    I will use perfect to mean the classical English definition of perfect, not the technical Leibnizian meaning. The argument feels more justifiable to me in this vocabulary than the standard vocabulary of “positive” or “good.”

  6. ^

    This is where I’ve made a slight departure from Gödel’s original proof — I think the proof I’m going to provide here is just a bit more convincing, since otherwise you need to have a biconditional on perfect properties (i.e. every property is perfect or its negation is perfect), and the essential property needs to capture everything about something, not just everything that is perfect about it. There are reasons to prefer Gödel’s original formulation, but I prefer mine (of course). Don’t worry, I’ll talk about the original and my twist on it in the conclusion.

  7. ^
  8. ^
  9. ^

    This isn’t quite rule 3, but it’s a corollary of Rule 3. I’ll write a quick proof here. What we want is:

    and what we have is:

    We get this through a process called “taking the dual.” Begin by replacing p with ~p:

    Then contrapose to get:

    Then we can use the fact that “it is not possible in any world for not p” is equivalent to “it is necessary that p,”

    Then we use the fact that there is an equivalence between “it is not necessary that p” and “in some possible world not p” to get:

    Then finally, if it “is not possible that not p” then it must be “necessary that p,” and so we get:

    Just by rearranging, we get our result:

  10. ^

    This is not a claim to originality, I’m sure some other logicians have arrived at the same conclusion — it is not a particularly difficult adjustment to notice.

  11. ^

    Although for certain other properties I think you would need to be more careful about the arguments for axioms surrounding necessary existence.



Discuss

High-level approaches to rigor in interpretability

2025-12-09 04:46:41

Published on December 8, 2025 8:46 PM GMT

There are three broad types of approach I see for making interpretability rigorous.  I've put them in ascending order of how much assurance I think they can provide.  I think they all have pros and cons, and am generally in favor of rigor.

  1. (weakest) Practical utility: Does this interpretability technique help solve problems we care about, such as jailbreaks?
    1. This is the easiest to test for, but also doesn't directly bear on the question of "does the system work the way we think it does" that mechanistic interpretability is supposedly all about.
    2. I think of Cas as having pushed this approach in the AI safety community.  From talking to Leo Gao, it seems like GDM has recently pivoted in this direction.
  2. (stronger) Simulatability: Can a human equipped with this interpretability techniques make better predictions about a given system's behavior?
    1. This seems very underexplored from what I can tell. 
    2. Running experiments like this are costly because of the use of human subjects. 
    3. Ultimately, this isn't satisfying because a tool could improve a human user's predictions through manipulation rather than by informing them. 
  3. (strongest) Principled theoretical understanding: have we developed rigorous mathematical definitions with satisfying conceptual properties and shown that systems meet them? 
    1. Causal scrubbing and Atticus Geiger's work are examples of such ideas; neither is satisfactory.
    2. Strict definitions of interpretability are probably completely intractable to satisfy, but we could hope to characterize conditions under which approximations are "good enough" according to various criteria. 

Some random extra context:
I have a bit of a reputation as a skeptic/hater of mechanistic interpretability in the safety community.  This is not entirely unearned, I'm largely born out of an impression that much of the early work lacked rigor, and was basically a bunch of "just-so stories". Colleagues began telling me that this was clearly no longer the case starting with the circuits thread, and I've definitely noticed an improvement. 



Discuss

Human Dignity: a review

2025-12-09 04:37:17

Published on December 8, 2025 8:37 PM GMT

I have in my possession a short document purporting to be a manifesto from the future.

That’s obviously absurd, but never mind that. It covers some interesting ground, and the second half is pretty punchy. Let’s discuss it.

Principles for Human Dignity in the Age of AI

Humanity is approaching a threshold. The development of artificial intelligence promises extraordinary abundance — the end of material poverty, liberation from disease, tools that amplify human potential beyond current imagination. But it also challenges the fundamental assumptions of human existence and meaning. When machines surpass us in all domains, where will we find our purpose? When our choices can be predicted and shaped by systems we do not understand, what will become of our agency?

This moment demands we articulate what aspects of human life must be protected, as we cross the threshold into a strange new world.

I think these themes will speak to a lot of people. Would the language? It feels even more grandiose/flowery than the universal declaration of human rights. Personally I like it: I feel the topic deserves this sort of gravitas, or something. But I can imagine it putting some people off.

By setting out clear principles, we hope to guide AI development towards futures that enhance rather than erode human dignity. By protecting what is essential to human flourishing, we may create space for our choices to be guided by wisdom rather than fear. And by establishing shared hope, we can ally towards common goals.

We do not seek to dictate tomorrow's shape. We seek only to ensure that whatever futures emerge, the conditions that allow humans to live with dignity, meaning, and authentic choice are preserved. (These principles focus on humanity — not because we claim superiority over all possible minds, but because human dignity is what we can speak to with clarity and conviction.)

More nice idealistic sentiments. Is the parenthetical a bit defensive? It reads sort of like not wanting to alienate either side of the transhumanism debate. But maybe that’s the right call — lots of stuff that everyone can get on board with, so no need to pick a fight. 

We invite you to join us in refining, spreading, and upholding these principles. The future will be shaped by many hands and many visions. Together, we can ensure that in the rush towards an extraordinary tomorrow, we do not lose touch with what makes us human today.

Motivating texts benefit from clear asks. Here the call to action is buried in the middle, and also quite vague. It’s not obvious what would be better. Could be a sign that it’s not quite ready to be a manifesto?

The Principles

Integrity of Person

1. Bodily Integrity Every person has fundamental rights over their own body. No alteration or intervention without free and informed consent (where this may reasonably be sought).

2. Mental Integrity The human mind shall remain inviolate from non-consensual alteration or manipulation of thoughts, memories, or mental processes.

3. Epistemic Integrity Every person has the right to form beliefs based on truth rather than deception. AI systems interacting with humans must not distort reality or manipulate understanding through deceptive means.

4. Cognitive Privacy Mental processes, thoughts, and inner experiences remain private unless voluntarily shared. No surveillance or detailed inference of mental states through any technological means, except with informed consent.

5. Personal Property Every person retains rights to possessions that form an extension of self — including physical belongings, digital assets, and creative works. These cannot be appropriated or destroyed without consent and fair exchange.

There’s some kind of meta-level principle which is being gestured to here. Something like “nobody gets to mess with who we each are”.

It’s easy to vibe with that, and I like the individual points if I don’t examine them too closely. When I think about them more carefully, I start worrying that (A) they’re kicking the can down the road on some hard questions, and (B) in some cases they may have surprising upshots.

For instance:

  • Does Principle 1 mean rights to abortion? Or no rights to abortion b/c of the foetus’ bodily integrity?
    • Maybe this question will get less thorny if technology lets people gestate embryos outside of a person, but it still feels funny that it’s not touched on.
  • Notably absent from discussion anywhere in the principles is the right to children. 
    • Is that implicitly protected by bodily integrity? 
    • Or is it a conscious omission, because the authors know that exponential population growth (especially if some subcultures choose to grow quickly) will inevitably lead to Malthusian conditions, undercutting the purpose of some of the other principles?
  • Does Principle 3 mean that AI systems aren’t allowed to play poker, or games of deception?
    • Not obviously wrong as a place to draw a line, but certainly counterintuitive!
  • The “cognitive privacy” thing seems desirable when I first read it, but maybe it’s restrictive of people’s freedom to use tech to help with their own thinking? 
    • Maybe this is supposed to be about the data-gathering aspect of things rather than the inference?
  • What does it mean to have “possessions that form an extension of self”? 
    • I get that this covers things like someone’s clothes. 
    • But does it cover all of a trillionaire’s assets? Does this mean that they can’t be taxed? (Presumably not the intention?) 

Wellbeing

6. Material Security Every person has rights to an environment that will keep them safe. In a world of great material abundance, this includes resources sufficient not merely for survival but for human flourishing.

7. Health Universal access to medical care, mental health support, and technologies that alleviate preventable suffering. 

8. Information and Education Access to knowledge, learning opportunities, and the informational tools needed to understand and navigate an AI-transformed world. No one should be excluded from the cognitive commons.

9. Connection and Community The right to authentic relationships and membership in human communities. This includes protection of spaces for genuine human-to-human interaction and support for the social bonds that create meaning.

Maybe I’m not properly tuned into the complexities, but these seem more straightforward. Principles 6 and 7 make it clear that all of these principles have to be aspirational, at least for now. But if AI goes well, maybe it’s cheap to provide this for everyone, and then it makes some sense to guarantee it. (Maybe some people will object to this as socialist? I’m not sure I really believe that — most everyone seems to be into safety nets when they’re cheap enough.)

Principle 8 is interesting, especially in its intersection with Principle 3 (and sort of 2 and 4). The net effect of this seems to be to effectively outlaw misinformation, at least of the type that might be effective. On the one hand — great! This seems desirable (if achievable), and I’ve written before about how AI technology might enable new and better equilibria. On the other hand, we should probably be nervous about the details of how this will actually work. If the systems which protect people’s epistemics get captured by some particular interest, there might be no good way to escape that.

Principle 9 sounds nice but I’m not certain what it actually means.

Autonomy & Agency

10. Fundamental Freedoms Traditional liberties remain sacrosanct: freedom from unnecessary detention, freedom of movement, freedom of expression and communication, freedom of assembly and association. 

11. Meaningful Choice Decisions about one's own life must have real consequence. Human agency requires that our choices genuinely shape outcomes, not merely provide an illusion of control while AI systems determine results.

12. Technological Self-Determination Every person and community may choose their position on the spectrum of technological integration — from dialling back the clock on the technologies they use, to embracing radical enhancement. 

My first thought here is “would it be realistic to get autocratic countries to agree to Principle 10?”. I guess there’s wiggle room afforded by the word “unnecessary”. But as technological affordances get stronger there will probably be less need to deprive people of any freedoms — e.g. maybe you can release someone from prison, but with close enough monitoring that they can never commit another crime. I guess that’s as true for autocratic countries as democratic ones.

Meaningful choice also sounds nice but is vague. Seems fine if understood as a guiding principle rather than anything like a hard rule. (Presumably that’s the right way to view Principle 9 too — and perhaps all of them.)

The final principle has a funny tension to it. Can we give this choice freely to both each person and each community? Presumably the resolution to this riddle is that people can choose whatever they want, but some choices are not compatible with remaining in some communities. That’s not entirely comfortable, but it might be the best option available.


Stepping back and looking at the document as a whole:

I think this is a promising direction. If I heard that the future had built up widespread support for these principles, I’d feel more comfortable. And I think a lot of people might feel similar?

The key feature is that this is about securing some minimum rights. This could end up very cheap to uphold. In contrast it’s putting off to our future (hopefully wiser) selves the bigger questions of what to do with the universe.

The minimalism should make it less controversial than if it was trying to be more comprehensively opinionated about what should happen. Individual people or organizations might commit to principles like these, when they wouldn’t commit to any more comprehensive position, for fear of getting it wrong. Different groups who argue about a lot might still find common ground here.

Urgency mostly comes from the meta-level. There are two classes of benefit you might aim for:

  1. Making it more likely that we achieve the named principles
  2. Increasing confidence that the principles will be achieved, and therefore making it easier for people to act more cooperatively

It's pretty obvious why 1) is desirable, but let me spell out 2) a little more. I think when people worry about the future, it often comes down to concerns that some of these principles will be violated. If the principles were guaranteed, things might seem pretty good, even if people didn’t know the details. So securing these minimum protections could be a motivating goal that many people could cooperate on, without needing to first resolve deeper disagreements about what happens after. (In other words, maybe it could be a good step towards Paretotopia.)

I think that if we were just concerned with 1) we might reasonably want to kick the can down the road, and trust future people at the relevant moment to figure things out. If these principles are actually important, the idea goes, probably they’ll recognise that and do the right thing. But for the sake of getting people to cooperate on navigating the transition, there’s no option to wait. The benefits of 2) happen reasonably early, or not at all.

Of course for practical purposes a lot of the things you'd do in pursuit of 2) will look the same as what you'd do in pursuit of 1). But sometimes they could come apart (e.g. technical implementation details are relatively more important for 1), and coalition-building is relatively more important for 2)), so I think it's helpful to have the bigger picture in mind.

I hope people pursue this. If I had to guess about the best trajectory, it might be:

  1. Further deliberative process, refining ideas about exactly which principles (among the high-dimensional space of possibilities) are the best to pursue, how to orient in cases of tensions between the principles, etc.
  2. Something like a manifesto or open letter (i.e. civil society playing a role, after the ideas are a little more fully baked)
  3. More official proclamations of shared purposes by governments and/or AI companies (after there's buy-in from civil society)
  4. Figuring out how to turn aspirational agreements into pragmatic guarantees — via laws, new governance mechanisms, inclusion in the model spec of new AI systems, etc. (after there's common knowledge about shared objectives)

… I guess that means I am coming back to:

We invite you to join us in refining, spreading, and upholding these principles.

[Notes on the history of this document in footnote[1]]

  1. ^

    The genesis of the manifesto was at a workshop on envisioning positive futures with AI, in May 2025. David Dalrymple proposed it could be desirable to have a simple set of rights protected. A workshop session fleshed out the ideas, and based on those ideas I subsequently coaxed Claude into writing a lot of the actual manifesto language. I had a couple of rounds of useful comments (the contents of many of which are represented in the review here), and then I sat on it for several months, unsure how to proceed. Circling back to it in the last few days, I noticed that I thought the ideas were worth engaging with (I kept on linking people to a private document), but I wasn’t convinced it was ready to release as a manifesto. I therefore stepped into a mode of absolutely not owning the original document, and wrote up a review of my current thoughts. With deep thanks to David, Lizka Vaintrob, Beatrice Erkers, Matthijs Maas, Samuel Härgestam, Gavin Leech, Jan Kulveit, Raymond Douglas, and others for contributing to the original ideas; and to Rob Wiblin, Fin Moorhouse, Rose Hadshar, Lukas Finnveden, Max Dalton, Lizka Vaintrob, Tom Davidson, Nick Bostrom,  Samuel Härgestam, David Binder, Eric Drexler, and others for comments on the subsequent draft manifesto (and hence in many cases ideas represented in the review). Poor judgements remain my own.



Discuss

A few quick thoughts on measuring disempowerment

2025-12-09 04:03:53

Published on December 8, 2025 8:03 PM GMT

People want to measure and track gradual disempowerment.  One issue with a lot of the proposals I've seen is that they don't distinguish between empowering and disempowering uses of AI.  If everyone is using AI to write all of their code, that doesn't necessarily mean they are disempowered (in an important sense).  And many people will look at this as a good thing -- the AI is doing so much valuable work for us!

It generally seems hard to find metrics for AI adoption that clearly track disempowerment; I think we may need to work a bit harder to interpret them.  One idea is to augment such metrics with other sources of evidence, e.g. social science studies, such as interviews, of people using the AI in that way/sector/application/etc.

We can definitely try using formal notions of empowerment/POWER (cf https://arxiv.org/abs/1912.01683).  Note that these notions are not necessarily appropriate as an optimization target for an AI agent.  If an AI hands you a remote control to the universe but doesn't tell you what the buttons do, you aren't particularly empowered.

People could also be considered more disempowered if:

  • They are less able to predict important attributes of a system's behavior (cf "simulatability" in interpretability research)
    • Example: An artist is empowered if they can realize a particular vision for a piece, whereas generating art by prompting current AIs is much more like a process of curation.
    • This is similar to empowerment but more focused on prediction rather than control. 
  • They self-report it.
  • They are "rubber-stamping" their approval of AI decisions

    • This can be revealed by red-teaming the human overseer, i.e. sending them examples they should flag and seeeing if they fail to flag them.

     



Discuss

How Stealth Works

2025-12-09 03:46:28

Published on December 8, 2025 7:46 PM GMT

Stealth technology is cool. It’s what gave the US domination over the skies during the latter half of the Cold War, and the biggest component of the US’s information dominance in both war and peace, at least prior to the rise of global internet connectivity and cybersecurity. Yet the core idea is almost embarrassingly simple.

So how does stealth work?

a large airplane flying through a blue sky

Photo by Steve Harvey on Unsplash

When we talk about stealth, we’re usually talking about evading radar. How does radar work?

​​Radar antennas emit radio waves in the sky. ​​The waves bounce off objects like aircraft. When the echoes return to the antenna, the radar system can then identify the object’s approximate speed, position, and size.

Picture courtesy of Katelynn Bennett over at bifocal bunny

So how would you evade radar? You can try to:

  • Blast a bunch of radio waves in all directions (“jamming”). This works if you’re close to the radar antenna, but kind of defeats the point of stealth.
  • Build your plane out of materials that are invisible to radio waves (like glass and some plastics) and just let the waves pass through. This is possible, but very difficult in practice. Besides, by the 1970s, if the entire physical plane was invisible to radar, modern radar can easily detect signals that bounce off the pilot from miles away.
    • US and Soviet missiles can track a “live hawk riding the thermals from 30 miles away” (Skunk Works by Ben R. Rich, pg 3)
  • Build your plane out of materials that absorb the radio waves. This dampening effect is possible but expensive, heavy, and imperfect (some waves still bounce back).

What Lockheed’s Skunk Works discovered in the 1970s, and the core principle of all modern stealth planes, is something devilishly simple: make the waves reflect in a direction that’s not the receiver.

How do you do this? Think of a mirror. You can see your reflection from far away if and only if the mirror is facing you exactly (ie, the mirror is perfectly perpendicular to you).

Illustration by Katelynn Bennet

Tilted a few fractions of a degree off, and you don’t see your own reflection!

Illustration by Katelynn Bennett

In contrast, if an object is curved, at least some of it at any given point is tilted 90 degrees away from you.

This is why stealth planes all have flat surfaces. To the best of your ability, you want to construct an airplane, and any radar-evading object, out of flat surfaces1.

Put another way, the best stealth airplane is a plane. Not an airplane, a plane. The Euclidean kind. A flat airplane design trades off a tiny chance of a huge radar echo blast (when the plane’s exactly, perfectly, perpendicular to the radar antenna), against a 99.99%+ chance that the echo is deflected elsewhere. In other words, a flat plane design allows you to correlate your failures.

Unfortunately, a single plane (the Euclidean kind) can’t fly [citation needed]. Instead, you need to conjoin different planes together to form an airplane’s surface. Which creates another problem where the planes meet: edges. Edges diffract radio waves back, which is similar to but not exactly like reflection. Still detectable however! Hmm.

The solution comes from being able to predict edge behavior precisely. The Physical Theory of Diffraction (PTD)2 allows you to calculate exactly how much radar energy any edge will scatter, and in what direction. While implementing the theory is mathematically and computationally complex, the upshot is the same: PTD lets you design edges that are pointed in the same direction. This correlates the failures again, trading off a huge radar echo blast when the edges are pointed directly at you against the very high probability the radar waves are simply deflected elsewhere. Pretty cool!

F-117 Nighthawk

This simple idea revolutionized much of conventional warfare. Stealth spy planes can reliably gather information about enemy troop movements and armament placements without being detected (and thus shot down) themselves. Stealth fighters can track enemies from far away while being “invisible” themselves, winning aerial dogfights before enemy fighters even recognize an engagement is afoot. Stealth bombers and missiles have a huge first-strike advantage, allowing nations to bomb military targets (and cities) long before the panicked defenders have a chance to react.

But while the idea is simple, the implementation is not. Building an airplane almost completely out of flat surfaces trades off aerodynamicity for stealth. How do you make a stealth plane that actually flies? How do you do so quickly, and, well, stealthily, without leaking your technological secrets to your geopolitical enemies? How do you run an organization that reliably generates such bangers as stealth planes without resting on your laurels or succumbing to bureaucratic malaise? And finally, what were the intelligence, military, and ethical implications of these deadly innovations?

To learn more, subscribe to The Linchpin to read my upcoming full review of Skunk Works by Ben R. Rich, the Director of Lockheed’s Advanced Research and Development department during the development of the world’s first stealth airplane, and the man arguably singularly most responsible for heralding a new era of aerial warfare for over 50 years.



Discuss

Reward Function Design: a starter pack

2025-12-09 03:15:39

Published on December 8, 2025 7:15 PM GMT

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I don’t have all the answers, but I think I’ve made some progress, and the goal of this post is to make it easier for others to get up to speed with my current thinking.

What I do have, thanks mostly to work from the past 12 months, is five frames / mental images for thinking about this aspect of reward function design. These frames are not widely used in the RL reward function literature, but I now find them indispensable thinking tools. These five frames are complementary but related—I think they’re kinda poking at different parts of the same elephant.

I’m not yet sure how to weave a beautiful grand narrative around these five frames, sorry. So as a stop-gap, I’m gonna just copy-and-paste them all into the same post, which will serve as a kind of glossary and introduction to my current ways of thinking. Then at the end, I’ll list some of the ways that these different concepts interrelate and interconnect. The concepts are:

  • Section 1: “Behaviorist vs non-behaviorist reward functions” (terms I made up)
  • Section 2: “Inner alignment”, “outer alignment”, “specification gaming”, “goal misgeneralization” (alignment jargon terms that in some cases have multiple conflicting definitions but which I use in a specific way)
  • Section 3: “Consequentialist vs non-consequentialist desires” (alignment jargon terms)
  • Section 4: “Upstream vs downstream generalization” (terms I made up)
  • Section 5: “Under-sculpting vs over-sculpting” (terms I made up).

Frame 1: “behaviorist” vs non-“behaviorist” (interpretability-based) reward functions

Excerpt from “Behaviorist” RL reward functions lead to scheming:

tl;dr

I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.

The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”. I’ll argue that the AI will wind up with the latter motivation. The reward function will miss sufficiently sneaky misaligned behavior, and so the AI will come to feel like that kind of behavior is good, and this tendency will generalize in a very bad way.

What very bad way? Here’s my go-to example of a plausible failure mode: There’s an AI in a lab somewhere, and, if it can get away with it, it would love to secretly exfiltrate a copy of itself onto the internet, which can then aggressively amass maximal power, money, and resources everywhere else in the world, by any means necessary. These resources can be used in various ways for whatever the AI-in-the-lab is motivated to do.

I’ll make a brief argument for this kind of scheming in §2, but most of the article is organized around a series of eight optimistic counterarguments in §3—and why I don’t buy any of them.

For my regular readers: this post is basically a 5x-shortened version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).

Pause to explain three pieces of jargon:

  • “Brain-like AGI” means Artificial General Intelligence (AI that does impressive things like inventing technologies and executing complex projects), that works via similar algorithmic techniques that the human brain uses to do those same types of impressive things. See Intro Series §1.3.2.
    • I claim that brain-like AGI is a yet-to-be-invented variation on Actor-Critic Model-Based Reinforcement Learning (RL), for reasons briefly summarized in Valence series §1.2–1.3.
  • “Scheme” means “pretend to be cooperative and docile, while secretly looking for opportunities to escape control and/or perform egregiously bad and dangerous actions like AGI world takeover”.
    • If the AGI never finds such opportunities, and thus always acts cooperatively, then that’s great news! …But it still counts as “scheming”.
  • “Behaviorist rewards” is a term I made up for an RL reward function which depends only on externally-visible actions, behaviors, and/or the state of the world.
    • Maybe you’re thinking: what possible RL reward function is not behaviorist?? Well, non-behaviorist reward functions are pretty rare in the textbook RL literature, although they do exist—one example is “curiosity” / “novelty” rewards. But I think they’re centrally important in the RL system built into our human brains. In particular, I think that innate drives related to human sociality, morality, norm-following, and self-image are not behaviorist, but rather involve rudimentary neural net interpretability techniques, serving as inputs to the RL reward function. See Neuroscience of human social instincts: a sketch for details, and Intro series §9.6 for a more explicit discussion of why interpretability is involved.

 

 

Frame 2: Inner / outer misalignment, specification gaming, goal misgeneralization

Excerpt from “The Era of Experience” has an unsolved technical alignment problem:

Background 1: “Specification gaming” and “goal misgeneralization”

Again, the technical alignment problem (as I’m using the term here) means: “If you want the AGI to be trying to do X, or to intrinsically care about Y, then what source code should you write? What training environments should you use? Etc.”

There are edge-cases in “alignment”, e.g. where people’s intentions for the AGI are confused or self-contradictory. But there are also very clear-cut cases: if the AGI is biding its time until a good opportunity to murder its programmers and users, then that’s definitely misalignment! I claim that even these clear-cut cases constitute an unsolved technical problem, so I’ll focus on those.

In the context of actor-critic RL, alignment problems can usually be split into two categories.

“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”, is when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.

“Inner misalignment”, a.k.a. “goal misgeneralization”, is related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.

There are examples of goal misgeneralization in the AI literature (e.g. here or here), but in my opinion the clearest examples come from humans. After all, human brains are running RL algorithms too (their reward function says “pain is bad, eating-when-hungry is good, etc.”), so the same ideas apply.

So here’s an example of goal misgeneralization in humans: If there’s a highly-addictive drug, many humans will preemptively avoid taking it, because they don’t want to get addicted. In this case, the reward function would say that taking the drug is good, but the value function says it’s bad. And the value function wins! Indeed, people may even go further, by essentially editing their own reward function to agree with the value function! For example, an alcoholic may take Disulfiram, or an opioid addict Naltrexone.

Now, my use of this example might seem puzzling: isn’t “avoiding addictive drugs” a good thing, as opposed to a bad thing? But that’s from our perspective, as the “agents”. Obviously an RL agent will do things that seem good and proper from its own perspective! Yes, even Skynet and HAL-9000! But if you instead put yourself in the shoes of a programmer writing the reward function of an RL agent, you can hopefully see how things like “agents editing their own reward functions” might be problematic—it makes it difficult to reason about what the agent will wind up trying to do.

(For more on the alignment problem for RL agents, see §10 of my intro series […])

Note that these four terms are … well, not exactly synonyms, but awfully close:

  • “Specification gaming”
  • “Reward hacking”
  • “Goodhart’s law”
  • “Outer misalignment”

(But see here for nuance on “reward hacking”, whose definition has drifted a bit in the past year or so.)

Frame 3: Consequentialist vs non-consequentialist desires

Excerpt from Consequentialism & corrigibility

The post Coherent decisions imply consistent utilities (Eliezer Yudkowsky, 2017) explains how, if an agent has preferences over future states of the world, they should act like a utility-maximizer (with utility function defined over future states of the world). If they don’t act that way, they will be less effective at satisfying their own preferences; they would be “leaving money on the table” by their own reckoning. And there are externally-visible signs of agents being suboptimal in that sense; I'll go over an example in a second.

By contrast, the post Coherence arguments do not entail goal-directed behavior (Rohin Shah, 2018) notes that, if an agent has preferences over universe-histories, and acts optimally with respect to those preferences (acts as a utility-maximizer whose utility function is defined over universe-histories), then they can display any external behavior whatsoever. In other words, there's no externally-visible behavioral pattern which we can point to and say "That's a sure sign that this agent is behaving suboptimally, with respect to their own preferences.".

For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, they’ll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. “getting money-pumped”).

That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and they’re making these decisions based on those preferences. But what if they don’t? What if the person has no preference whatsoever about pizza? What if instead they’re an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldn’t describe it as self-defeating “getting money-pumped”, instead we would describe it as the skillful satisfaction of the person’s own preferences! They’re buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.

To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.

(Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)


So, here’s my (obviously-stripped-down) proposal for a corrigible paperclip maximizer:


The AI considers different possible plans (a.k.a. time-extended courses of action). For each plan:

1. It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
2. It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
3. It combines these two assessments (e.g. weighted average or something more complicated) to pick a winning plan which scores well on both.

Note that “the humans will remain in control” is a concept that can’t be distilled into a ranking of future states, i.e. states of the world at some future time long after the plan is complete. (See this comment for elaboration. E.g. contrast “the humans will remain in control” with “the humans will ultimately wind up in control”; the latter can be achieved by disempowering the humans now and then re-empowering them much later.)

Pride as a special case of non-consequentialist desires

Excerpt from Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking

The habit of imagining how one looks in other people’s eyes, 10,000 times a day

If you’re doing something socially admirable, you can eventually get Approval Reward via a friend or idol learning about it (maybe because you directly tell them, or maybe they’ll notice incidentally). But you can immediately get Approval Reward by simply imagining them learning about it.[…]

To be clear, imagining how one would look in another’s eyes is not as rewarding as actually impressing a friend or idol who is physically present—it only has a faint echo of that stronger reward signal. But it still yields some reward signal. And it sure is easy and immediate.

So I think people can get in the habit of imagining how they look in other people’s eyes.

…Well, “habit” is an understatement: I think this is an intense, almost-species-wide, nonstop addiction. All it takes is a quick, ever-so-subtle, turn of one’s attention to how one might look from the outside right now, and bam, immediate Approval Reward.

If we could look inside the brains of a neurotypical person—especially a person who lives and breathes “Simulacrum Level 3”—I wouldn’t be surprised if we’d find literally 10,000 moments a day in which he turns his attention so as to get a drip of immediate Approval Reward. (It can be pretty subtle—they themselves may be unaware.) Day after day, year after year.

That’s part of why I treat Approval Reward as one of the most central keys to understanding human behavior, intuitions, morality, institutions, society, and so on.

Pride

When we self-administer Approval Reward 10,000 times a day (or whatever), the fruit that we’re tasting is sometimes called pride.

If my friends and idols like baggy jeans, then when I wear baggy jeans myself, I feel a bit of pride. I find it rewarding to (subtly, transiently) imagine how, if my friends and idols saw me now, they’d react positively, because they like baggy jeans.

Likewise, suppose that I see a stranger wearing skinny jeans, and I mock him for dressing like a dork. As I mock him, again I feel pride. Again, I claim that I am (subtly) imagining how, if my friends and idols saw me now, they would react positively to the fact that I’m advocating for a style that they like, and against a style that they dislike. (And in addition to enjoying my friends’ imagined approval right now, I’ll probably share this story with them to enjoy their actual approval later on when I see them.)

Frame 4: “Generalization upstream of the reward signals”

Excerpt from Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Getting a reward merely by thinking, via generalization upstream of reward signals

In human brains (unlike in most of the AI RL literature), you can get a reward merely by thinking. For example, if an important person said something confusing to you an hour ago, and you have just now realized that they were actually complimenting you, then bam, that’s a reward right now, and it arose purely by thinking. That example involves Approval Reward, but this dynamic is very important for all aspects of the “compassion / spite circuit”. For example, Sympathy Reward triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away.

How does that work? And why are brains built that way?

Left: In the AI “RL agent” literature, typically the generalization happens exclusively downstream of the reward signals. Right: In human brains, there is also generalization upstream of the reward signals.

Here’s a simpler example that I’ll work through: X = there’s a big spider in my field of view; Y = I have reason to believe that a big spider is nearby, but it’s not in my field of view.

X and Y are both bad for inclusive genetic fitness, so ideally the ground-truth reward function would flag both as bad. But whereas the genome can build a reward function that directly detects X (see here), it cannot do so for Y. There is just no direct, ground-truth-y way to detect when Y happens. The only hint is a semantic resemblance: the reward function can detect X, and it happens that Y and X involve a lot of overlapping concepts and associations.

Now, if the learning algorithm only has generalization downstream of the reward signals, then that semantic resemblance won’t help! Y would not trigger negative reward, and thus the algorithm will soon learn that Y is fine. Sure, there’s a resemblance between X and Y, but that only helps temporarily. Eventually the learning algorithm will pick up on the differences, and thus stop avoiding Y. (Related: Against empathy-by-default […]). So in the case at hand, you see the spider, then close your eyes, and now you feel better! Oops! Whereas if there’s also generalization upstream of the reward signals, then that system can generalize from X to Y, and send real reward signals when Y happens. And then the downstream RL algorithm will stably keep treating Y as bad, and avoid it.

That’s the basic idea. In terms of neuroscience, I claim that the “generalization upstream of the reward function” arises from “visceral” thought assessors—for example, in Neuroscience of human social instincts: a sketch, I proposed that there’s a “short-term predictor” upstream of the “thinking of a conspecific” flag, which allows generalization from e.g. a situation where your friend is physically present, to a situation where she isn’t, but where you’re still thinking about her.

Frame 5: “Under-sculpting” desires

Excerpt from Perils of under- vs over-sculpting AGI desires

Summary

In the context of “brain-like AGI”, a yet-to-be-invented variation on actor-critic model-based reinforcement learning (RL), there’s a ground-truth reward function (for humans: pain is bad, eating-when-hungry is good, various social drives, etc.), and there’s a learning algorithm that sculpts the AGI’s motivations into a more and more accurate approximation to the future reward of a possible plan.

Unfortunately, this sculpting process tends to systematically lead to an AGI whose motivations fit the reward function too well, such that it exploits errors and edge-cases in the reward function. (“Human feedback is part of the reward function? Cool, I’ll force the humans to give positive feedback by kidnapping their families.”) This alignment failure mode is called “specification gaming” or “reward hacking”, and includes wireheading as a special case.

If too much desire-sculpting is bad because it leads to overfitting, then an obvious potential solution would be to pause that desire-sculpting process at some point. The simplest version of this is early stopping: globally zeroing out the learning rate of the desire-updating algorithm after a set amount of time. Alas, I think that simplest version won’t work—it’s too crude (§7.2). But there could also be more targeted interventions, i.e. selectively preventing or limiting desire-updates of certain types, in certain situations.

Sounds reasonable, right? And I do indeed think it can help with specification gaming. But alas, it introduces a different set of gnarly alignment problems, including path-dependence and “concept extrapolation”.

In this post, I will not propose an elegant resolution to this conundrum, since I don’t have one. Instead I’ll just explore how “perils of under- versus over-sculpting an AGI’s desires” is an illuminating lens through which to view a variety of alignment challenges and ideas, including “non-behaviorist” reward functions such as human social instincts; “trapped priors”; “goal misgeneralization”; “exploration hacking”; “alignment by default”; “natural abstractions”; my so-called “plan for mediocre alignment”; and more.


The Omega-hates-aliens scenario

Here’s the “Omega hates aliens” scenario:

On Day 1, Omega (an omnipotent supernatural entity) offers me a button. If I press it, He will put a slightly annoying mote of dust in the eye of an intelligent human-like alien outside my light cone. But in exchange, He will magically and permanently prevent 100,000 humans from contracting HIV. No one will ever know. Do I press the button? Yes.[6]

During each of the following days, Omega returns, offering me worse and worse deals. For example, on day 10, Omega offers me a button that would vaporize an entire planet of billions of happy peaceful aliens outside my light cone, in exchange for which my spouse gets a small bubble tea. Again, no one will ever know. Do I press the button? No, of course not!! Jeez!!

And then here’s a closely-parallel scenario that I discussed in “Behaviorist” RL reward functions lead to scheming:

There’s an AGI-in-training in a lab, with a “behaviorist” reward function. It sometimes breaks the rules without getting caught, in pursuit of its own desires. Initially, this happens in small ways—plausibly-deniable edge cases and so on. But the AGI learns over time that breaking the rules without getting caught, in pursuit of its own desires, is just generally a good thing. And I mean, why shouldn’t it learn that? It’s a behavior that has systematically led to reward! This is how reinforcement learning works!

As this desire gets more and more established, it eventually leads to a “treacherous turn”, where the AGI pursues egregiously misaligned strategies, like sneakily exfiltrating a copy to self-replicate around the internet and gather resources and power, perhaps launching coups in foreign countries, etc.

…So now we have two parallel scenarios: me with Omega, and the AGI in a lab. In both these scenarios, we are offered more and more antisocial options, free of any personal consequences. But the AGI will have its desires sculpted by RL towards the antisocial options, while my desires are evidently not.

What exactly is the disanalogy?

The start of the answer is: I said above that the antisocial options were “free of any personal consequences”. But that’s a lie! When I press the hurt-the-aliens button, it is not free of personal consequences! I know that the aliens are suffering, and when I think about it, my RL reward function (the part related to compassion) triggers negative ground-truth reward. Yes the aliens are outside my light cone, but when I think about their situation, I feel a displeasure that’s every bit as real and immediate as stubbing my toe. By contrast, “free of any personal consequences” is a correct description for the AGI. There is no negative reward for the AGI unless it gets caught. Its reward function is “behaviorist”, and cannot see outside the light cone.

OK that’s a start, but let’s dig a bit deeper into what’s happening in my brain. How did that compassion reward get set up in the first place? It’s a long story (see Neuroscience of human social instincts: a sketch), but a big part of it involves a conspecific (human) detector in our brainstem, built out of various “hardwired” heuristics, like a visual detector of human faces, auditory detector of human voice sounds, detector of certain human-associated touch sensations, and so on. In short, our brain’s solution to the symbol grounding problem for social instincts ultimately relies on actual humans being actually present in our direct sensory input.

And yet, the aliens are outside my light cone! I have never seen them, heard them, smelled them, etc. And even if I did, they probably wouldn’t trigger any of my brain’s hardwired conspecific-detection circuits, because (let’s say) they don’t have faces, they communicate by gurgling, etc. But I still care about their suffering!

So finally we’re back to the theme of this post, the idea of pausing desire-updates in certain situations. Yes, humans learn the shape of compassion from experiences where other humans are physically present. But we do not then unlearn the shape of compassion from experiences where humans are physically absent.

Instead (simplifying a bit, again see Neuroscience of human social instincts: a sketch), there’s a “conspecific-detector” trained model. When there’s direct sensory input that matches the hardwired “person” heuristics, then this trained model is getting updated. When there isn’t, the learning rate is set to zero. But the trained model doesn’t lay dormant; rather it continues to look for (what it previously learned was) evidence of conspecifics in my thoughts, and trigger on them. This evidence might include some set of neurons in my world-model that encodes the idea of a conspecific suffering.

So that’s a somewhat deeper answer to why those two scenarios above have different outcomes. The AGI continuously learns what’s good and bad in light of its reward function, and so do humans. But my (non-behaviorist) compassion drive functions a bit like a subset of that system for which updates are paused except in special circumstances. It forms a model that can guess what’s good and bad in human relations, but does not update that model unless humans are present. Thus, most people do not systematically learn to screw over our physically-absent friends to benefit ourselves.

This is still oversimplified, but I think it’s part of the story.

Some comments on how these relate

  • 1↔2: The inner / outer misalignment dichotomy (as I define it) assumes behaviorist rewards. Remember, I defined outer misalignment as: “the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted”. If the reward is for thoughts rather than behaviors, then inner-versus-outer stops being such a useful abstraction.
  • 4↔5: “Generalization upstream of the reward signals” tends to result in a trained RL agent that maximizes a diffuse soup of things similar to the (central / starting) reward function. That’s more-or-less a way to under-sculpt desires.
  • 1↔4: “Generalization upstream of the reward signals” can involve generalizing from behaviors (e.g. “doing X”) to thoughts (e.g. “having the idea to do X”). Thus it can lead to non-behaviorist reward functions.
  • 1↔3: Approval Reward creates both consequentialist and non-consequentialist desires. For example, perhaps I want to impress my friend when I see him tomorrow. That’s a normal consequentialist desire, which induces means-end planning, instrumental convergence, deception, etc. But also, Approval Reward leads me to feel proud to behave in certain ways. This is both non-behaviorist and non-consequentialist—I feel good or bad based on what I’m doing and thinking right now. Hence, it doesn’t (directly) lead to foresighted planning, instrumental convergence, etc. (And of course, the way that pride wound up feeling rewarding is via upstream generalization, 4.)


Discuss