2026-02-18 09:30:12
Published on February 18, 2026 1:30 AM GMT
How does game theory work when everyone is a computer program who can read everyone else’s source code? This is the problem of ‘program equilibria’. In this episode, I talk with Caspar Oesterheld on work he’s done on equilibria of programs that simulate each other, and how robust these equilibria are.
Topics we discuss:
Daniel Filan (00:00:09): Hello, everybody. In this episode I’ll be speaking with Caspar Oesterheld. Caspar is a PhD student at Carnegie Mellon University, where he serves as the Assistant Director of the Foundations of Cooperative AI Lab. He researches AI safety with a particular focus on multi-agent issues. There’s a transcript of this episode at axrp.net, and links to papers we discuss are available in the description. You can support the podcast at patreon.com/axrpodcast, or give me feedback about this episode at axrp.fyi. Okay, well Caspar, welcome to AXRP.
Caspar Oesterheld (00:00:43): Thanks for having me.
Daniel Filan (00:00:44): So today we’re going to talk about two papers that you’ve been on. The first is “Robust program equilibrium”, where I believe you’re the sole author. And the second is “Characterising Simulation-Based Program Equilibria” by Emery Cooper, yourself and Vincent Conitzer. So I think before we sort of go into the details of those papers, these both use the terms like “program equilibrium”, “program equilibria”. What does that mean?
Caspar Oesterheld (00:01:11): Yeah, so this is a concept in game theory and it’s about the equilibria of a particular kind of game. So I better describe this kind of game. So imagine you start with any sort of game, in the game theoretic sense, like the prisoner’s dilemma, which maybe I should describe briefly. So imagine we have two players and they can choose between raising their own utility by one or raising the other player’s utility by three and they only care about their own utility. I don’t know, they play against a stranger, and for some reason they don’t care about the stranger’s utility. And so they both face this choice. And the traditional game-theoretic analysis of this game by itself is that you should just raise your own utility by $1 and then both players will do this and they’ll both go home with $1 or one utilon or whatever. And, of course, there’s some sort of tragedy. It would be nice if they could somehow agree in this particular game to both give the other player $3 and to both walk home with the $3.
Daniel Filan (00:02:33): Yeah, yeah, yeah. And just to drive home what’s going on, if you and I are playing this game, the core issue is no matter what you do, I’m better off giving myself the one utility or the $1 rather than giving you three utility because I don’t really care about your utility.
(00:02:53): So, I guess, there are two ways to put this. Firstly, just no matter what you play, I would rather choose the “give myself utility” option, commonly called “defect”, rather than cooperate. Another way to say this issue is, in the version where we both give each other the $3, I’m better off deviating from that. But if we’re both in the “only give ourselves $1” situation, neither of us is made better off by deviating and in fact we’re both made worse off. So it’s a sticky situation.
Caspar Oesterheld (00:03:29): Yeah. That’s all correct, of course. Okay. And now this program game set-up imagines that we take some game and now instead of playing it in this direct way where we directly choose between cooperate and defect—raise my utility by $1 or the other player’s by $3—instead of choosing this directly, we get to choose computer programs and then the computer programs will choose for us. And importantly, so far this wouldn’t really make much of a difference yet. Like, okay, we choose between a computer program that defects or a computer program that cooperates, or the computer program that runs in circles 10 times and then cooperates. That effect doesn’t really matter.
(00:04:12): But the crucial addition is that the programs get access to each other’s source code at runtime. So I submit my computer program, you submit your computer program and then my computer program gets as input the code of your computer program. And based on that it can decide whether to cooperate or defect (or you can take any other game [with different actions]). So it can look at your computer program and [see] does it look cooperative? And depending on that, cooperate or defect. Or it can look [at] is the fifth character in your computer program an ‘a’? And then cooperate if it is and otherwise defect. There’s no reason to submit this type of program, but this is the kind of thing that they would be allowed to do.
Daniel Filan (00:04:58): Yeah. So this very syntactic analysis… A while ago I was part of this, basically a tournament, that did this prisoner’s dilemma thing with these open source programs. And one strategy that a lot of people used was, if I see a lot of characters… Like if I see a string where that string alone means “I will cooperate with you”, then cooperate with that person, otherwise defect against that person.
(00:05:33): Which I think if you think about it hard, this doesn’t actually quite make sense. But I don’t know, there are very syntactic things that, in fact, seem kind of valuable, especially if you’re not able to do that much computation on the other person’s computer program. Just simple syntactic hacks can be better than nothing, I think.
Caspar Oesterheld (00:05:56): Yeah. Was this Alex Mennen’s tournament on LessWrong or was this a different-
Daniel Filan (00:06:01): No, this is the Manifold one.
Caspar Oesterheld (00:06:07): Ah, okay.
Daniel Filan (00:06:08): So you had to write a JavaScript program, it had to be fewer than however many characters and there was also a market on which program would win and you could submit up to three things. So actually, kind of annoyingly to me… One thing I only realized afterwards is the thing you really should have done is write two programs that cooperated with your program and defected against everyone else’s, or just cooperated with the program you thought was most likely to win. And then you bet on that program. Or even you could submit three programs, have them all cooperate with a thing that you hoped would win and defect against everyone else and then bet on… Anyway.
(00:06:49): So in that setting there was a timeout provision where if the code ran for too long your bot would be disqualified, and also you had to write a really short program. Some people actually managed to write pretty smart programs. But if you weren’t able to do that, relatively simple syntactic analysis was better than nothing, I think.
Caspar Oesterheld (00:07:14): Yeah, I think there was this earlier tournament in 2014 or something like that when there was less known about this kind of setting. And a bunch of programs there were also based on these simple syntactic things. But in part because everyone was mostly thinking about these simple syntactic things, it was all a little bit kind of nonsense.
(00:07:34): I don’t know, you would check whether the opponent program has a particular word in it or something like that. And then, I think, the winning program had particular words in it but it would just still defect. So in some sense those dynamics are a little bit nonsense or they’re not really tracking, in some sense, the strategic nature of the situation.
Daniel Filan (00:08:02): Fair enough. So going back, you were saying: you have your opponent’s program and you can see if the fifth character is an ‘a’ or, and then-
Caspar Oesterheld (00:08:11): Yeah, what should one perhaps do? So I think the setting was first proposed in, I think, 1984 or something like that. And then it kind of [was] rediscovered or reinvented, I think, three times or something like that in various papers. And all of these initial papers find the following very simple program for this prisoner’s dilemma-type situation, which just goes as follows: if the opponent program is equal to myself—to this program—then cooperate and otherwise defect.
(00:08:53): So this program is a Nash equilibrium against itself and it cooperates against itself. So if both players submit this program, neither is incentivized to deviate from playing this program. If you play this program that checks that the two programs are the same and if they are, cooperate, otherwise defect, you submit this program, the best thing I can do is also submit this program. If I submit anything else, you’re going to defect. So I’m going to get at most one if I also defect, whereas I get three if I also cooperate. So yeah, all of these original papers proposing the setting, they all find this program which allows stable cooperation in this setting.
Daniel Filan (00:09:38): Right. So my impression, and maybe this is totally wrong, is I think for a while there’s been some sense that if you’re rational and you’re playing the prisoner’s dilemma against yourself, you should be able to cooperate with yourself, I think. Wasn’t there some guy writing in Scientific American about superrationality and he held a contest basically on this premise?
Caspar Oesterheld (00:10:02): Yeah, yeah. Hofstadter, I think.
Daniel Filan (00:10:05): Right, right.
Caspar Oesterheld (00:10:06): I think also in the ’80s or something… I’ve done a lot of work on this kind of reasoning as well that… I don’t know, for humans it’s a little bit hard to think about. You don’t often face very similar opponents or it’s a little bit unclear how similar other people are. Is your brother or someone who’s related to you and was brought up in a similar way, are they very similar? It’s kind of hard to tell.
(00:10:38): But for computer programs it’s very easy to imagine, of course, that you just… You have two copies of GPT-4 or something like that and they play a game against each other. It’s a very normal occurrence, in some sense. I mean, maybe not them acting in the real world, at this point, but having multiple copies of a computer program is quite normal. And there’s this related but to some extent independent literature on these sorts of ideas that you should cooperate against copies, basically.
Daniel Filan (00:11:10): But yeah, basically I’m wondering if this idea of ‘I’ll cooperate against copies” is what inspired these very simple programs?
Caspar Oesterheld (00:11:22): Yeah, that is a good question. I basically don’t know to what extent this is the case. I know that some of the later papers on program equilibrium, I remember some of these specifically citing this superrationality concept. But yeah, I don’t remember whether these papers—I think McAfee is one of these who wrote about this in the ’80s—I don’t know whether they discuss superrationality.
Daniel Filan (00:11:53): And it’s kind of tricky because… If you actually look at the computer programs, they’re not doing expected utility maximization… Or they’re not computing expected utility maximization. They’re just like, “if identical to me, cooperate, else defect”, just hard-coded in… Anyway, maybe this is a distraction but, indeed, these were the first programs considered in the program equilibrium literature.
Caspar Oesterheld (00:12:19): Yeah.
Daniel Filan (00:12:20): So they sound great, right?
Caspar Oesterheld (00:12:21): Yeah. So, I mean, they’re great in that in the prisoner’s dilemma, you can get an equilibrium in which you can get cooperation, which otherwise you can’t, or you can’t achieve with various naive other programs that you might write. But, I think, in practice—and it’s not so obvious what the practice of this scheme looks like—but if you think of any kind of practical application of this, it’s sort of a problem that the settings are somewhat complex and now you need… Two people write programs independently and then these programs need to be the same somehow or they need to… I mean, there are slightly more general versions of these where they check some other syntactic properties.
(00:13:13): But basically, yeah, you require that you coordinate in some way on a particular kind of source code to write, which maybe in some cases you can do, right? Sometimes maybe we can just talk beforehand. Like if we play this prisoner’s dilemma, we can just explicitly say, “Okay, here’s the program that I want to submit. Please submit the same program” and then you can say, “Okay, let’s go”.
(00:13:38): But maybe in cases where we really write these programs independently, maybe at different points in time, and these programs, especially if they do more complicated things than play this prisoner’s dilemma, it’s very difficult to coordinate without explicitly talking to each other on writing programs that will cooperate against each other. Even in the prisoner’s dilemma, you might imagine that I might have an extra space somewhere, or maybe you write the program, “If the two programs are equal, cooperate, otherwise defect” and I write, “if the two programs are different, defect, else cooperate”. These very minor changes would already break these schemes.
Daniel Filan (00:14:20): Okay, okay. There’s a lot to just ask about there. I think my first question is: we have this notion of program equilibrium. Are we trying to find Nash equilibria of programs? Are we trying to find evolutionarily stable strategies? Or maybe there are tons of solution concepts and we just want to play around with the space. But what are the actual… What’s the thing here?
Caspar Oesterheld (00:14:49): Yeah. The solution concept that people talk about most is just Nash equilibrium. So if you look at any of these papers and you look at the results, they’ll prove “these kinds of programs form a Nash equilibrium of the program game”. Or, I mean, the term “program equilibrium” literally just means “Nash equilibrium of the game in which the players submit these programs”. That is almost always the kind of game-theoretic solution concept that people use.
(00:15:25): And then, usually a bunch of other things are a little bit more implicit. It’s clear that people are interested in finding good Nash equilibria. In some sense, the whole point of the setup is we start out with the prisoner’s dilemma and sad: everyone’s going to defect against everyone else and we’re not getting to cooperation. And now, we come in with this new idea of submitting programs that get access to each other’s source code and with this we get these cooperative equilibria. So that is usually… I mean, it’s often quite explicit in the text that you’re asking, “can we find good equilibria?” in some sense, ones that are Pareto-optimal in the space of possible outcomes of the game or something like that.
(00:16:15): And then, additionally, a lot of the work after these early papers that do this syntactic comparison-based program equilibrium are about this kind of intuitive notion of robustness, that you want to have equilibria that aren’t sensitive to where the other program puts the spaces and the semicolons and these syntactic details. But it is kind of interesting that this isn’t formalized usually. And also, the second paper that we talked about, we presented this at AAAI and one game theorist came to our poster and said… I don’t know, to him it was sort strange that there’s no formalization, in terms of solution concepts in particular, of this kind of robustness notion, that we’ll talk about the programs that we are claiming or that we are arguing are more robust. But this syntactic comparison-based program, there’s sort of some intuitive sense, and we can give concrete arguments, but it’s not formalized in the solution concept.
(00:17:35): One of my papers is called “robust program equilibrium”, but robust program equilibrium is not actually a solution concept in the sense that Nash equilibrium is or trembling hand equilibrium is. The robustness is more some sort of intuitive notion that, I think, a lot of people find compelling but in some sense it’s not formalized.
Daniel Filan (00:17:58): Yeah, and it’s funny… I see this as roughly within both the cooperative AI tradition and the agent foundations tradition. And I think these traditions are sort of related to each other. And, in particular, in this setting in decision theory, I think there’s also some notion of fairness of a decision situation.
(00:18:24): So sometimes people talk about: suppose you have a concrete instantiation of a decision theory, meaning a way somebody thinks about making decisions. There are always ways of making that concrete instantiation look bad by saying: suppose you have a Caspar decision theory; we’ll call it CDT for short. And then you can be in a decision situation, right, where some really smart person figures out what decision theory you’re running, punches you if you’re running CDT and then gives you $1 million if you’re not.
(00:18:54): And there’s a sense that this is unfair but also it’s not totally obvious. Like in that setting as well, I think there’s just no notion of what the fair thing is. Which is kind of rough because you’d like to be able to say, “Yeah, my decision theory does really well in all the fair scenarios”. And it seems like it would be nice if someone figured out a relevant notion here. Are people trying to do that? Are you trying to do that?
Caspar Oesterheld (00:19:22): So I think there is some thinking in both cases and I think probably the kind of notion that people talk about most is probably similar in both. So in this decision theory case, I think the thing that probably most people agree is that the decision situation should be somehow be a function of your behavior. It shouldn’t check, “do you run CDT”, and if you do, you get punched in the face. It should be like: if in this situation you choose this, then you get some low reward. But this should somehow be behavior-based, which I think still isn’t enough. But, I mean, this sort of goes into the weeds of this literature. Maybe we can link some paper in the show notes.
(00:20:17): But, I mean, the condition that we give in the second paper, or maybe even in both of the papers that we’re going to discuss, there’s some explicit discussion of this notion of behaviorism, which also says: in the program equilibrium setting, it’s sort of nice to have a kind of program that only depends on the other program’s behavior rather than the syntax.
(00:20:48): And all of these approaches to robustness, like trying to do some proofs about the programs, about what the opponent program does, try to prove whether the opponent will cooperate or something like that… All of these, to some extent, these notions that people intuitively find more robust, they’re all more behaviorist, at least, than this syntactic comparison-based idea.
Daniel Filan (00:21:15): Yeah. Although it’s tricky because… I’m sorry, I don’t know if this is going to the weeds that you want to postpone. So this behaviorism-based thing, if you think about the “if you’re equal to me, cooperate, else defect” program, this is behaviorally different from the “if you’re unequal to me, defect, else cooperate” program, right?
(00:21:33): It does different things in different situations and therefore… Once you can define an impartial thing, right, then maybe you can say, “Well if you act identically on impartial programs then you count as impartial”. But actually maybe that’s just a recursive definition and we only need one simple program as a base case.
Caspar Oesterheld (00:21:52): I think we do actually have a recursive definition of simulationist programs that I think is a little bit trying to address some of these issues. But, yeah, it does sort of go into the weeds of what exactly should this definition be.
Daniel Filan (00:22:13): Yeah, okay. Let’s go back a little bit to desiderata of program equilibria. So they’re computer programs, right? So presumably—and this is addressed a bit in the second paper—but just runtime computational efficiency, that seems like a relevant desideratum.
Caspar Oesterheld (00:22:28): Yes, I agree.
Daniel Filan (00:22:29): And then, I think that I imagine various desiderata to include “have a broad range of programs that you can work well with”. And it seems like there might be some notion of just, “if you fail, fail not so badly, rather than fail really badly”. I don’t know if… this is slightly different from the notion of robustness in your paper and I don’t know if there’s a good formalism for this. Do you have thoughts here?
Caspar Oesterheld (00:23:02): I mean in some intuitive sense, what one wants is that, if I slightly change my program, maybe even in a way that is sort of substantial… In the prisoner’s dilemma, it’s a little bit unclear if I defect slightly more, if I don’t cooperate 100% but I cooperate 95%, it’s unclear to what extent should you be robust. Should you defect against me all of the time? But, I guess, in other games where maybe there are different kinds of cooperation or something like that, you’d want… If I cooperate in slightly the wrong way, the outcome should still be good.
(00:23:46): I think in some sense there’s something here, that I think it’s conceptually quite clear that if you deviate in some reasonable harmless way, it should still be fine. We shouldn’t defect against each other, we should still get a decent utility. But the details are less clear [about] what exactly are the deviations and it probably depends a lot on the game. And then, there are a lot of these sort of things that in game theory are just kind of unclear. If I defect 5% more, how much should you punish me for that? And so, I think that’s why a lot of these things, they aren’t really formalized in these papers.
Daniel Filan (00:24:35): Sure, okay. So now that we know what program equilibrium is, why does it matter?
Caspar Oesterheld (00:24:43): There are lots of different possible answers to this question. I think the most straightforward one is that we can view program games like program equilibrium as sort of a model of how games could be played when different parties design and deploy AI systems. So this whole thing of having a source code that the other party can look at and can maybe run or can look at character five and stuff like that: this is something that is somewhat specific to computer programs. We can talk about whether there are human analogs still, but when we play a game against each other, it’s sort of hard to imagine an equivalent of this. Maybe I have some vague model of how your brain works or something like that, but there’s no source code, I can’t really “run” you in some ways.
(00:25:51): Whereas, if we both write computer programs, this can just literally happen. We can just literally say, “This is the source code that I’m deploying…” I have my charity or something like that and I’m using some AI system to manage how much to donate to different charities. I can just say, “Look, this is the source code that I’m using for managing what this charity does”. And here, I think, program equilibrium or program games are quite a literal direct model of how these interactions could go. Of course, you can also deploy the AI system and say “we’re not saying anything about how this works”. In which case, obviously, you don’t get these program equilibrium-type dynamics. But it’s a way that they could go and that people might want to use because it allows for cooperation.
(00:26:47): So I think the most direct interpretation is that it models a kind of way that games could be played in the future when more decisions are made by delegating to AI systems. As people in this community who think and to some extent worry about a future where lots of decisions are made by AI, this is an important thing to think about. And meanwhile, because to most game theorists it’s sort of a weird setting because, well, humans can’t read each other’s source code, it’s sort of understudied by our lights, I guess, because currently it’s not a super important way that games are played.
Daniel Filan (00:27:37): Which is interesting because… So I guess we don’t often have games played with mutual source code transparency, but there really are computer programs that play economic games against each other in economically valuable settings, right? A lot of trading in the stock market is done by computer programs. A lot of bidding for advertisement space is done by computer programs.
(00:28:06): And algorithmic mechanism design—so mechanism design being sort of inverse game theory: if you want some sort of outcome, how you’d figure out the game to make that happen. Algorithmic mechanism design being like that, but everyone’s a computer. There’s decent uptake of this, as far as I can tell. Algorithmic game theory, there’s decent uptake of that. So I’m kind of surprised that the mutual transparency setting is not more of interest to the broader community.
Caspar Oesterheld (00:28:42): Yeah, I think I agree. I mean, a lot of these settings… So I think the trading case is a case where decisions are made on both sides by algorithms. But usually because it’s kind of a zero-sum game, you don’t want to reveal to your competitors how your trading bot works.
(00:29:07): There’s a lot of this mechanism design where you have an algorithm. I guess those are usually cases where it’s sort of unilateral transparency. I auction off something and I’m saying, “Okay, I’m using this algorithm to determine who gets, I don’t know, this broadband frequency or these things that are being auction-offered”.
(00:29:33): So, I guess, those are cases with sort of unilateral transparency. And that is, I guess, studied much more in part because it’s less… I mean, this also has been studied traditionally in game theory much more, in some sense. You can view it as some Stackelberg equilibrium. You can view all mechanism design as being a bit like finding Stackelberg equilibria. And I think Stackelberg’s analyses of game theory even proceed Nash equilibrium.
Daniel Filan (00:30:04): Interesting.
Caspar Oesterheld (00:30:05): So that is very old.
Daniel Filan (00:30:07): Where Stackelberg equilibrium is: one person does a thing and then the next person does a thing. And so the next person is optimizing, given what the first person does, and the first person has to optimize “what’s really good for me, given that when I do something the other person will optimize what’s good for them based on what I do”.
Caspar Oesterheld (00:30:23): Yeah.
Daniel Filan (00:30:24): So people look at Stackelberg equilibria and these sorts of games and it’s a common thing. And it’s an interesting point that you can sort of think of it as one-way transparency.
Caspar Oesterheld (00:30:34): Yeah. I think one thing one could think about is how much humans are in these mutual transparency settings. So yeah, I already said for individual humans: if the two of us play a prisoner’s dilemma, I have some model of you, I can’t really read… So I don’t know, seems sort of speculative. So there’s this paper which I really like by Andrew Critch, Michael Dennis and Stuart Russell, all from CHAI where, of course, you graduated from. This is about program equilibrium as well.
(00:31:16): The motivating setting that they use is institution design. The idea there is that: institutions, you can view them as rational players, or something like that. They make decisions, and they play games with each other. Like, I don’t know, the US government plays a game with the German government or whatever. But institutions have some amount of transparency. They have laws that they need to follow. They have constitutions. They’re composed of lots of individuals, that in principle, one could ask… I don’t know, the German government could check all the social media profiles of all the people working for the US government and learn something about how these people interact with each other, or something like that. There’s some very concrete transparency there.
(00:32:09): In particular, some things are really just algorithmic type commitments. Like, I don’t know, “We don’t negotiate with terrorists”, or something like that. It’s specific, something that’s in the source code of a country in some sense. It’s specifying how it’s going to choose in particular interactions. I think that is a case where interactions between human organizations have this transparency. I think that’s some evidence that we could get similar things with AI.
(00:32:51): At the same time, it’s also interesting that this hasn’t motivated people to study this program equilibrium-style setting, which I think is probably because I think, as a computer scientist, it’s natural to think the constitution is basically just an algorithm. It’s also a little bit like, I don’t know, computer science people explain the world to everyone else by using computer programs for everyone. Like, “The mind is a program, and the constitution is just a program. We got it covered with our computer science stuff”, which maybe some people also don’t like so much. But I think it’s a helpful metaphor still.
Daniel Filan (00:33:35): Fair enough. Okay. Some people do study program equilibria. Just to set up the setting for your papers: before the appearance to the world of Robust Program Equilibrium, what did we know about program equilibria beyond these simple programs that cooperate if your source code is mine?
Caspar Oesterheld (00:33:56): Yeah. I guess we have some characterizations of the kind of equilibria, in general, that are allowed by these syntactic comparison-based programs. Not sure how much to go into that at this point, but yeah, maybe we’ll get into this later.
Daniel Filan (00:34:16): I think I can do this quickly. My understanding is basically, any equilibrium that’s better off for all the players than unilaterally doing what they want, you can get with program equilibrium. Maybe you have to have punishments as well, but something roughly like this. You can have programs being like, “You have to play this equilibrium. If you don’t, then I’ll punish you”. Just write up a computer program saying, “If you’re equal to me, and therefore play this equilibrium, then I’ll play this equilibrium. If you’re not, then I’ll do the punish action”.
Caspar Oesterheld (00:34:55): Yes. Yeah, that’s basically right.
Daniel Filan (00:34:58): Okay. Is it only basically right?
Caspar Oesterheld (00:35:01): No, I think it’s basically right… I think it’s fully right, sorry. [It’s just] “basically” in the way that all natural language descriptions… You can get anything that is better for everyone than what they can get if everyone punishes them, which might be quite bad.
(00:35:25): For example, in the prisoner’s dilemma, we had this nice story of how you can get mutual cooperation, but you can also get, I don’t know, one player cooperates 60% of the time, the other player cooperates 100% of the time. The reason why the 100% of the time cooperator doesn’t cooperate less is that the 60% cooperator says, “Yeah, if we’re not both submitting the program that plays this equilibrium, I’m going to always defect”. In the prisoner’s dilemma, you can get anything that is at least as good as mutual defection for both players. In some sense, almost everything can happen. It can’t happen that one player cooperates all the time, the other player defects all the time. Because then the cooperator would always want to defect. But yeah, that’s the basic picture of what’s going on here.
(00:36:26): That has been known. Then post-Tennenholtz, which is one of these papers—I think the paper that [coined the term] “program equilibrium”and gave this syntactic comparison-based program, and this folk theorem, as it’s called, of what kind of things can happen in equilibrium. After that, most papers have focused on this “how do we make this more robust” idea. In particular, what existed prior to the robust program equilibrium paper are these papers on making things more robust by having the programs try to prove things about each other.
(00:37:11): Here’s maybe the simplest example of this that one doesn’t need to know crazy logic for. You could write a program… in the prisoner’s dilemma, you could write a program that tries to search for proofs of the claim “if this program cooperates, the other program will also cooperate”. Your program is now very large. It has this proof search system. Somehow, it can find proofs about programs. But basically, you can still describe it relatively simply as, “I try to find the proof that if I cooperate, the opponent cooperates. Then I cooperate. Otherwise, I’ll defect”. It’s not that difficult to see that this kind of program can cooperate against itself. Because if it faces itself, it’s relatively easy to prove that if I cooperate, the opponent will cooperate. Because the statement, it’s an implication where both sides of the implication arrows say exactly the same thing.
(00:38:25): At the same time, this is more robust, because this will be robust to changing the spaces and so on. It’s relatively easy to prove “if this program outputs cooperate, then this other program which is the same, except that it has the spaces in different places or switches things around in some way that doesn’t really matter, that this will also output that thing, also output cooperate”. This is a basic proof-based approach that will work.
(00:39:07): I think the first paper on this is by Barasz et al. I think there are two versions of this which have different first authors, which is a little bit confusing. I think on one of them, Barasz is the first author. On the other one, it’s LaVictoire. I think it’s an American, so probably a less French pronunciation is correct.
Daniel Filan (00:39:37): I actually think he does say “Lah vic-twahr”.
Caspar Oesterheld (00:39:39): Oh, okay.
Daniel Filan (00:39:40): I think. I’m not 100% certain. Write in, Patrick, and tell us.
Caspar Oesterheld (00:39:48): Those papers first proposed these proof-based approaches. They actually do something that’s more clever, where it’s much harder to see why it might work. I described a version where the thing that you try to prove is “if I cooperate, the opponent will cooperate”. They instead just have programs that try to prove that the opponent will cooperate. You just do, “if I can prove that my opponent cooperates, I cooperate. Else, I defect”.
(00:40:16): This is much less intuitive that this works. Intuitively, you would think, “Surely, this is some weird infinite loop”. If this faces itself… I am going to think, “What does the opponent do?” Then, “Well, to think about what my opponent will do to prove anything about them, they’ll try to prove something about me”. You run into this infinite circle. You would think that it’s basically the same as… One very naive program that you might write is just, “Run the opponent program. If it cooperates, cooperate. Otherwise, defect”. This really does just run in circles.
(00:40:56): You would think that just doing proofs instead of this running the opponent program, that you have the same issue. It turns out that you can find these proofs which follows from a somewhat obscure result in logic called Löb’s theorem, which is a little bit related to Gödel’s second incompleteness theorem. With Löb’s theorem it’s relatively easy to prove, but it’s a very “you kind of need to just write it down” proof, and then it’s relatively simple. But it’s hard to give an intuition for it, I think.
Daniel Filan (00:41:47): Also, it’s one of these things that’s hard to state unless you’re careful and remember… So I’ve tried to write it down. It’s like, if you can prove that a proposition would be true… Okay, take a proposition P. Löb’s theorem says that if you can prove that “if you could prove P, then P would be true”, then, you would be able to prove P. If you can prove that the provability of a statement implies its truth, then you could prove the thing. The reason that this is non-trivial is it turns out that you can’t always prove that if you could prove a thing, it would be true because you can’t prove that your proving system works all the time. You can construct funky self-referential things that work out. Unless I have messed up, that is Löb’s theorem.
(00:42:49): My recollection is the way it works in this program is basically, you’re checking if the other program would cooperate… Imagine we’re both these “defect unless proof of cooperation” programs. I’m like, “Okay, I want to check if you would cooperate given me”. “If you would cooperate given me” is the same as “if I would cooperate given you”… Here’s the thing that I definitely can prove. If you can prove that “if I can prove that I cooperate, then you cooperate”. But crucially, the “I” and the “you” are actually just the same, because we’re at the same program. If it’s provable that “if it’s provable, then we cooperate”, then we cooperate. Löb’s theorem tells us that we can therefore conclude that it is provable that we cooperate. Therefore, we in fact cooperate.
(00:43:48): My understanding is: so what do we actually do? I think we prove Löb’s theorem, then apply it to our own situation, and then we both prove that we both cooperate, and then we cooperate. I think that’s my recollection of how it’s supposed to go.
Caspar Oesterheld (00:44:01): At least that would be one way.
Daniel Filan (00:44:03): Yeah, I suppose there might be even shorter proofs.
Caspar Oesterheld (00:44:06): Yeah, that is basically correct. Yeah, good recollection of the papers.
Daniel Filan (00:44:14): Yeah. There were a few years in Berkeley where every couple weeks somebody would explain Löb’s theorem to you, and talk about Löbian cooperation. Eventually, you remembered it.
Caspar Oesterheld (00:44:25): Okay, nice. I think it’s a very nice idea. I actually don’t know how they made this connection. Also Löb’s theorem, it’s relatively obscure, I think in part because it doesn’t prove that much more than Gödel’s second incompleteness theorem. Gödel’s incompleteness theorem is “a logical system can’t prove its own consistency”. But here, it’s the same thing. You can’t prove “if I can prove something, it’s true” without just being able to prove the thing.
(00:45:11): I think that’s probably one reason why Löb’s theorem isn’t very widely known. I feel like it’s a result that for this thing, it happens to be exactly the thing you need. Once you have it written down, this cooperation property follows almost immediately. But…
Daniel Filan (00:45:32): How they made the connection?
Caspar Oesterheld (00:45:33): Yeah, how did they…
Daniel Filan (00:45:34): I think I know this, or I have a theory about this. Originally, before they were talking about Löbian cooperation, there was this Löbian obstacle or Löbstacle.
Caspar Oesterheld (00:45:45): Yeah, the Löbstacle.
Daniel Filan (00:45:46): Yeah, to self-trust. You might want to say, “Oh, I’m going to create a successor program to me, and if I can prove that the successor program is going to do well, then…” Or all the programs are going to be like, “If I can prove a thing is good, then I’ll do it.” And can I prove that a program that I write is going to be able to do stuff? And it’s a little bit rough, because if I can prove that you could prove that a thing is good, then I could probably prove that the thing was good myself, and so why am I writing the [successor].
(00:46:14): Maybe this just caused the Löb’s theorem to be on the mind of everyone. I don’t know. I have this theory. But I don’t think I’ve heard it confirmed by any of the authors.
Caspar Oesterheld (00:46:24): Okay. It’s a good theory, I think.
Daniel Filan (00:46:26): Okay. We had this Löbian cooperation idea floating around. This is one thing that was known before these papers we’re about to discuss. Is there anything else’s that important?
Caspar Oesterheld (00:46:45): Yeah, there was a little bit more extension of this Löbian idea. One weird thing here is that we have these programs, “if I can prove this, then I cooperate”. Of course, whether I can prove something, it’s not decidable. There’s not an algorithm that tries for 10 hours, and then it gives up. That’s not what provability would normally mean.
(00:47:17): There’s a paper by Andrew Critch from I think 2019, that shows that actually, Löb’s theorem still works if you consider these bounded… You try, with a given amount of effort… Specifically, you try all proofs of a given length, I think, is the constraint. It shows that some version of Löb’s theorem still holds, and that it’s still enough to get this Löbian cooperation if the two players consider proofs up to a long enough length. They can still cooperate.
Daniel Filan (00:47:55): And it doesn’t have to be the same length.
Caspar Oesterheld (00:47:56): Yeah, it doesn’t have to be the same length, importantly.
Daniel Filan (00:47:58): It just has to be the length of that paper.
Caspar Oesterheld (00:48:00): Yeah.
Daniel Filan (00:48:01): Right. Yeah, yeah, which is great. Very fun result. So there’s a Löbian cooperation. There’s parametric bounded Löbian cooperation. Anything else of note?
Caspar Oesterheld (00:48:12): Yeah. I think one other thing that is interesting—this is not really an important fact, but I think it’s an important thing to understand—is that for the Löbian bots, it matters that you try to find a proof that the other player cooperates, rather than trying to find a proof that the other player defects. The same is true for this implication case that I described. If you try to check “is there a proof that if I defect, the opponent will defect?”, I’m not sure why you would do that.
Daniel Filan (00:49:06): You can imagine similar things, like, “Okay, if I defect, will you cooperate with me naively like a sucker? If so, then I’m just definitely going to defect”.
Caspar Oesterheld (00:49:24): Right. Then I guess you would check for some other property.
Daniel Filan (00:49:32): Or you would check “if I defect, will you defect? If so, then I’ll cooperate”. Maybe that would be the program.
Caspar Oesterheld (00:49:37): Yeah, maybe that is even the more sensible program. I’m not sure whether this cooperates against itself.
Daniel Filan (00:49:50): It must cooperate, right?
Caspar Oesterheld (00:49:51): Okay, let’s think …
Daniel Filan (00:49:55): Suppose we’re the same program. Then it’s basically like: if provable defect “if and only if provable defect”, then cooperate, else defect. But provable defect, if and only if provable defect…. It’s the same… You can just see that it’s the same expression on both sides.
Caspar Oesterheld (00:50:11): Right, I agree. Yeah, this will cooperate. This is not an equilibrium though. If the opponent just submits a DefectBot, you’re going to cooperate against it, right?
Daniel Filan (00:50:22): Yes, it is a program, it is not an equilibrium. I got us off track, I fear.
(00:50:32): But you were saying that you want to be proving the good case, not the bad case.
Caspar Oesterheld (00:50:39): Yeah, maybe let’s do the version from the paper, “if I can prove that you cooperate, I cooperate. Otherwise, I defect”. If you think about it, in this program, it doesn’t really matter that mutual cooperation is the good thing, and mutual defection is the bad thing. Ultimately, it’s just we have two labels, cooperate and defect, we could call them A and B instead. It’s just, “if I can prove that you output label A, I also output label A. Otherwise, I’ll output label B”.
(00:51:12): Regardless of what these labels are, this will result in both outputting label A. If label A happens to be defect rather than cooperate, these will defect against each other. It matters that you need to try the good thing first or something like that.
Daniel Filan (00:51:29): Yeah, yeah. I guess, maybe the most intuitive way of thinking about it, which… I haven’t thought about it a ton, so this may not be accurate. But it feels like you’re setting up a self-fulfilling prophecy, or if the other person happens to be you, then you’re setting up a self-fulfilling prophecy. You want to set up the good self-fulfilling prophecy, not the bad self-fulfilling prophecy.
(00:51:51): I think this is true in this setting. My impression is that there’s also decision theory situations where you really care about the order in which you try and prove things about the environment. I forget if self-fulfilling prophecy is the way to think about those situations as well, even though they’re conceptually related. We can perhaps leave that to the listeners if it’s too hard to figure out right now.
(00:52:15): Okay. Now that we’ve known this sad world that’s confusing and chaotic, perhaps we can get the light of your papers.
Caspar Oesterheld (00:52:26): Okay. I should say, I really like the proof-based stuff. We can talk a little bit about what maybe the upsides and downsides are. Yeah, it is confusing. I would think that one issue with it is that in practice, what programs can one really prove things about?
Daniel Filan (00:52:49): Yeah, my intuition is that the point of that work is it seems like it’s supposed to be modeling cases where you have good beliefs about each other that may or may not be exactly proofs. You hope that something like Löb’s theorem holds in this more relaxed setting, which it may or may not. I don’t exactly know.
Caspar Oesterheld (00:53:07): Yeah, I agree. I also view it this way, which is a more metaphorical way. There’s some distance between the mathematical model, and the actual way it would work then.
Daniel Filan (00:53:26): But I want to hear about your paper.
Caspar Oesterheld (00:53:28): Right. Okay. Now, let’s get to my paper. My paper is on whether we can get these cooperative equilibria, not by trying to prove things about each other, but just by simulating each other. I already mentioned that there’s a super naive but intuitive approach that you would like to run the opponent against… You’d like to run the opponent with myself as input, see if they cooperate, if they do, cooperate, otherwise defect. Just this very obvious intuition, maybe from tit for tat in repeated games, that you want to reward the other player for cooperating, and get a good equilibrium that way.
(00:54:21): The problem with this, of course, is that it doesn’t hold if both players do this. I guess this would work if you play this sequentially. We talked about the Stackelberg stuff earlier. If I submit a program first, and then you submit a program second, then it would work for me to submit a program that says, “Run your program, cooperate if it cooperates, defect if your program defects”, and then you would be incentivized to cooperate. But if both players play simultaneously, infinite loop, so it kind of doesn’t work.
Daniel Filan (00:54:58): If we had reflective oracles, then it could work, depending on the reflective oracle. But that’s a whole other bag of worms.
Caspar Oesterheld (00:55:03): Yeah, I guess reflective oracles… Yeah, I probably shouldn’t get into it. But it’s another model that maybe is a little bit in between the proof-based stuff and the simulation stuff.
Daniel Filan (00:55:18): At any rate.
Caspar Oesterheld (00:55:19): Yeah. It turns out there’s a very simple fix to this issue, which is that instead of just always running the opponent and cooperating if and only if they cooperate, you can avoid the infinite loop by just cooperating with epsilon probability, and only if this epsilon probability clause doesn’t trigger, only then do you run the other program. So your program is just: flip a very biased coin—epsilon is a small number, right? You check whether some low probability event happens. If it does, you just cooperate without even looking at the opponent program. Otherwise, you do simulate the other program and you copy whatever they do. You cooperate if they cooperate, defect if they defect.
(00:56:23): The idea is that, basically, it’s the same intuition as “just simulate the opponent, and do this instantaneous tit-for-tat”. Except that now, you don’t run into this running for infinitely long issue, because it might take a while, but eventually, you’re going to hit these epsilon clauses. If we both submit this program, then probably, there’s some chance that I’m immediately cooperating, but most likely, I’m going to call your program which might then also immediately cooperate. Most likely, it’s going to call my program again, and so on. But at each point, we have a probability epsilon of halting, and with probability one will eventually halt.
Daniel Filan (00:57:16): This is a special case of this general construction you have in the paper, right?
Caspar Oesterheld (00:57:26): Yeah. This is for the prisoner’s dilemma in particular, where you have these two actions that happen to be cooperate and defect. In general, there are two things that you can specify here, like you specify what happens with the epsilon probability, then the other thing that you specify is what happens if you simulate the other player, you get some action out of the simulation, and now you need to react to this in some way.
(00:57:57): The paper draws this connection between these ϵGroundedπBots, as they’re called, and repeated games where you can only see the opponent’s last move. It’s similar to that, where: okay, maybe this epsilon clause where you don’t look at your opponent is kind of like playing the first round where you haven’t seen anything of your opponent yet. I guess, in the prisoner’s dilemma, there’s this well-known tit for tat strategy which says: you should cooperate in the beginning, and then at each point, you should look at the opponent’s last move, and copy it, cooperate if they cooperate. But in general, you could have these myopic strategies for these repeated games where you do something in the beginning, and then at each point, you look at the opponent’s last move, and you react to it in some way. Maybe do something that’s equally cooperative or maybe something that’s very slightly more cooperative to slowly get towards cooperative outcomes or something like that. You could have these strategies for repeated games. You can turn any of these strategies into programs for the program game.
Daniel Filan (00:59:21): One thing that I just noticed about this space of strategies, this is strategies that only look at your opponent’s last action, right?
Caspar Oesterheld (00:59:29): Yes.
Daniel Filan (00:59:29): In particular, there’s this other thing you can do which is called win-stay, lose-switch, where if you cooperated against me, then I just do whatever I did last time. If you defected against me, then I do the opposite of what I did last time. It seems like this is another thing that your next paper is going to fix. But in this strategy, it seems like I can’t do this, right?
Caspar Oesterheld (00:59:58): Yes. Yeah, it’s really very restrictive. Most of the time, you’re going to see one action of the opponent, you have to react to that somehow, and that’s it.
Daniel Filan (01:00:13): Yeah. But it’s this nice idea. It’s basically this connection between: if you can have a good iterated strategy, then you can write a good computer program to play this mutually transparent program game, right?
Caspar Oesterheld (01:00:28): Yeah.
Daniel Filan (01:00:29): How much do we know about good iterated strategies?
Caspar Oesterheld (01:00:34): That is a good question. For the iterated prisoner’s dilemma, there’s a lot about this. There are a lot of these tournaments for the iterated prisoner’s dilemma. I’m not sure how much there is for other games, actually. Yeah, you might have iterated stag hunt or something like that? I guess, maybe for a lot of the other ones, it’s too easy or so.
(01:01:03): There’s some literature. You can check the paper. There are various notions that people have looked at, like exploitability of various strategies, which is how much more utility can the other player get than me if I play the strategy? For example, tit for tat, if the opponent always defects, you’re going to get slightly lower utility than them because in the first round, you cooperate, and then they defect. Then in all subsequent rounds, both players defect. It’s very slightly exploitable, but not very much.
(01:01:45): These notions that have been studied, and in my paper, I transfer these notions… If you take a strategy for the iterated prisoner’s dilemma, or for any repeated game, it has some amount of exploitability, and the analogous ϵGroundedπBot strategy has the same amount of exploitability. This is also an interesting question in general. How much qualitatively different stuff is there even in this purely ϵGroundedπBot space? If all you can do is look at the one action of the opponent and react to this action, how much more can you even do than things that are kind of like this sort of tit-for-tat…? Like I mentioned, in more complex games maybe you want to be slightly more cooperative… I don’t know. After a bunch of simulations you eventually become very cooperative or something like that.
Daniel Filan (01:02:52): Okay. I have a theory. In my head I’m thinking: okay, what’s the general version of this? And I can think of two ways that you can generalize, right? Here’s what I’m imagining you should do, in general. Okay. You have a game, right? First you think about: okay, what’s the good equilibrium of this game, right? And then what do I want to do if the other person doesn’t play ball? It seems like there are two things I could do if the other person doesn’t join me in the good equilibrium. Firstly, I could do something to try and punish them. And secondly, I can do something that will make me be okay, be good enough no matter what they do. I don’t exactly know how you formalize these, but my guess is that you can formalize something like these. And my guess is that these will look different, right?
(01:03:43): You can imagine saying, “Okay, with epsilon probability, I do my part to be in the good equilibrium, and then the rest of the time I simulate what the other person does. If they play in the good equilibrium I play in the good equilibrium. If they don’t play in the good equilibrium then, depending on what I decided earlier, I’m either going to punish them or I’m going to do a thing that’s fine for me”. Or you can imagine that I randomize between those. Maybe there’s some “best of both worlds” thing with randomizing. I don’t exactly know. Do you have a take on that?
Caspar Oesterheld (01:04:14): I mean, there’s at least one other thing you can do, right, which is try to be slightly more cooperative than them in the hope that you just-
Daniel Filan (01:04:26): Right.
Caspar Oesterheld (01:04:31): Imagine the repeated game, right? At any given point you might want to try to be a bit more cooperative in the hope that the other person will figure this out, that this is what’s going on, and that you’re always going to be a little bit more cooperative than them. And that this will lead you to the good equilibrium or to a better equilibrium than what you can get if you just punish. I mean, punish usually means you do something that you wouldn’t really want to do, you just do it to incentivize the other player. Or even the “okay, well, you’re going to go and do whatever but I’m just going to do something that makes me okay”.
Daniel Filan (01:05:15): So is the “be more cooperative than the other person” thing… I feel like that’s already part of the strategy. Okay, so here’s the thing I could do. With epsilon probability, do the good equilibrium, then simulate what the opponent does. If they do the good thing, if they’re in the good equilibrium, then I join the good equilibrium. If they don’t join the good equilibrium, then with epsilon probability I be part of the good equilibrium, and then otherwise I do my other action. With epsilon probability for being slightly more cooperative, you could have just folded that into the initial probability, right?
Caspar Oesterheld (01:05:51): Right. The difference is you can be epsilon more cooperative in a deterministic way, right? With this epsilon probability thing, some of the time you play the equilibrium that you would like to play. This alternative proposal is that you always become slightly more cooperative, which is… I’m not sure how these things play out. I would imagine that for characterizing what the equilibria are probably all you need is actually the punishment version. But I would imagine that if you want to play some kind of robust strategy you would sometimes move into a slightly more cooperative direction or something like that.
(01:06:51): You could have all of these games where there are lots of ways to cooperate and they sort of vary in how they distribute the gains from trade or something like that, right? Then there’s a question of what exactly happens if your opponent is… They play something that’s kind of cooperative but sort of in a way that’s a little bit biased towards them. I guess maybe you would view this as just a form of punishment if you then say, “Well, I’m going to stay somewhat cooperative but I’m going to punish them enough to make this not worthwhile for them” or something like that.
Daniel Filan (01:07:33): If there’s different cooperative actions that are more or less cooperative then it definitely makes sense. At the very least I think there are at least two strategies in this space. I don’t know if both of them are equilibria to be fair.
Daniel Filan (01:07:46): Okay. There are a few things about this strategy that I’m interested in talking about. We’re both playing the same “tit-for-tat but in our heads” strategy, right? The time that it takes us to eventually output something is O(one/epsilon), right? On average, because with each round with epsilon probability we finish, and then it takes one on epsilon rounds for that to happen, right?
Caspar Oesterheld (01:08:31): Yeah, I think that’s roughly right. I mean, it’s a geometric series, right? I think it’s roughly one over epsilon.
Daniel Filan (01:08:40): It’s one minus epsilon over epsilon, which is very close to one over epsilon.
Caspar Oesterheld (01:08:42): Yes.
Daniel Filan (01:08:45): That strikes me as a little bit wasteful, right, in that… So the cool thing about the Löbian version was: the time it took me to figure out how to cooperate with myself was just the time it took to do the proof of Löb’s theorem no matter how… It was sort of this constant thing. Whereas with the epsilon version, the smaller the epsilon it is, the longer it seems to take for us. And we’re just going back and forth, right? We’re going back and forth and back and forth and back and forth. I have this intuition that there’s something wasteful there but I’m wondering if you agree with that.
Caspar Oesterheld (01:09:25): Yeah, I think it’s basically right. Especially if you have a very low epsilon, right, there’s a lot of just doing the same back-and-forth thing for a long time without getting anything out of it. One thing is that you could try to speed this up, right, if you… So let’s say I run your program, right? Instead of just running it in a naive way I could do some analysis first.
(01:10:11): If you have a compiler of a computer program, it might be able to do some optimizations. And so maybe I could analyze your program, analyze my program, and I could tell: okay, what’s going to happen here is that we’re going to do a bunch of nothing until this epsilon thing triggers. Really instead of doing this actually calling each other, we just need to sample the depth of simulations according to this geometric distribution, the distribution that you get from this halting with probability epsilon at each step. You could do this analysis, right? Especially if you expect that your opponent will be an ϵGroundedFairBot, you might explicitly put in your compiler or whatever something to check whether the opponent is this ϵGroundedFairBot. And if so, we don’t need to do this actually calling each other, we just need to sample the depth.
(01:11:26): In some sense, the computation that you need to do is sample the depth then sample from… Whoever halts at that point, sample from their ‘base’ distribution, their blind distribution. And then sort of propagate this through all of the function that both players have for taking a sample of the opponent’s strategy and generating a new action. If this is all very simple then… in principle, your compiler could say, for the ϵGroundedFairBot in particular—sorry, the ϵGroundedFairBot is the version for the prisoner’s dilemma. In principle, your compiler could directly see “okay, what’s going to happen here? Well, we’re going to sample from the geometric distribution, then ‘cooperate’ will be sampled, and then a bunch of identity functions will be applied to this”. So this is just cooperate without needing to do any actual running this, doing recursive calls by something something with a stack, and so on. Probably you don’t actually need any of this.
Daniel Filan (01:12:52): There’s something intuitively very compelling about: okay, if I can prove that the good thing happens or whatever, then do the proof-based thing. If I can’t prove anything then do the simulation stuff. It seems intuitively compelling. I imagine you probably want to do some checks if that works on the proof-based side, depending on the strategy you want to implement.
Caspar Oesterheld (01:13:15): I mean, the thing I’m proposing is not to have the proof fallback, but just that you… You always do the ϵGroundedFairBot thing, for example, or the ϵGroundedπBot. Instead of calling the opponent program in a naive way where you actually run everything, you throw it in this clever compiler that analyzes things in some way. And maybe this compiler can do some specific optimizations but it’s not a fully general proof searcher or anything like that.
Daniel Filan (01:13:52): I mean, it’s checking for some proofs, right?
Caspar Oesterheld (01:13:54): Yeah, it’s checking for some specific kinds of proofs… I mean, that’s how modern day compilers I assume work, right, is that they understand specific kinds of optimizations and they can make those but they don’t have a fully general proof search or anything like that.
Daniel Filan (01:14:15): Sorry. When you said that I was half listening and then half thinking about a different thing, which is: you could imagine ϵGroundedFairBot which is: first, if your source code is equal to mine, then cooperate. Else, if your source code is the version of ϵGroundedFairBot that doesn’t first do the proof search, then cooperate. Else, with probability epsilon cooperate, probability one minus epsilon, do what the other person does, right?
(01:14:41): So that particular version probably doesn’t actually get you that much because the other person added some spaces in their program. And then I’m like but you could do some proof stuff, insert it there. I guess there are a few possibilities here. But it does seem like something’s possible.
Caspar Oesterheld (01:15:06): These different kinds of ways of achieving this more robust program equilibrium, they are compatible with each other. If I do the ϵGroundedFairBot and you do the Löbian bot, they are going to cooperate with each other.
Daniel Filan (01:15:29): You’re sure?
Caspar Oesterheld (01:15:30): I’m pretty sure, yeah.
Daniel Filan (01:15:31): Okay. You’ve probably thought about this.
Caspar Oesterheld (01:15:32): I wrote a paper about it. It’s not a real paper, it’s sort of like a note on this. Maybe let’s take the simplest versions or whatever, we don’t need to go down the Löb’s theorem path again. Let’s take the simplest version which is just, can I prove “if I cooperate, you cooperate”, then cooperate. If you’re the Löbian bot and I’m the ϵGroundedFairBot, you can prove that if you cooperate I will cooperate, right? Well, I’m epsilon times…
Daniel Filan (01:16:13): Sorry. Can you say that without using “you” and “I”?
Caspar Oesterheld (01:16:15): Okay. Am I allowed to say “I submit a program that’s”-
Daniel Filan (01:16:20): Yes, you are.
Caspar Oesterheld (01:16:20): Okay. So I submit a program that is just the ϵGroundedFairBot, so with epsilon probability cooperate otherwise simulate you and do what do you do. And your program is: if it’s provable that “if this program cooperates, the other program cooperates”, then cooperate, and otherwise, defect. Okay. So let’s think about your program-
Daniel Filan (01:16:54): The proof-based one.
Caspar Oesterheld (01:16:55): The proof-based one. So your program will try to prove: if it cooperates, my program, ϵGroundedFairBot will cooperate.
Daniel Filan (01:17:09): Okay. So the proof-based program is trying to prove, “if proof-based program cooperates then sampling program cooperates”. And it will be able to prove that. I think the other implication is slightly trickier but maybe you only care about the first implication, or you care about it more.
Caspar Oesterheld (01:17:24): Sorry, what is the other implication?
Daniel Filan (01:17:25): That if the sampling-based program cooperates then the proof- based one will cooperate. Maybe that’s not so bad.
Caspar Oesterheld (01:17:34): But do you actually need this? The proof-based program, it will succeed in proving this implication, right, and it will, therefore, cooperate.
Daniel Filan (01:17:45): And that’s how it proves that it will do it in the other direction?
Caspar Oesterheld (01:17:48): I mean, that’s how one can then see that the ϵGroundedFairBot will also cooperate because it will… Well, with epsilon probability it cooperates anyway. And with the remaining probability it does whatever the proof-based thing does, which we’ve already established is to cooperate. Sorry, does this leave anything open?
Daniel Filan (01:18:03): I think I was just thinking about a silly version of the program where the proof-based thing is checking: can I prove that if my opponent will cooperate then I will cooperate? But I think you wouldn’t actually write this because it doesn’t make any sense.
Caspar Oesterheld (01:18:22): No. That seems harder though. I don’t know. Maybe if we think about it for two minutes we’ll figure it out. I think one wouldn’t submit this program.
Daniel Filan (01:18:32): I next want to ask a different question about this tit-for-tat-based bot. This bot is going to cooperate against CooperateBot, right, the bot that always plays cooperate? That seems pretty sad to me, right? I’m wondering how sad do you think that this is?
Caspar Oesterheld (01:18:53): I’m not sure how sad. Okay, I have two answers to this. The first is that I think it’s not so obvious how sad it is. And the second is that I think this is a relatively difficult problem to fix. On how sad is this: I don’t know. It sort of depends a little bit on what you expect your opponent to be, right? If you imagine that you’re this program, you’ve been written by Daniel, and you run around the world, and you face opponents. And most of the opponents are just inanimate objects that weren’t created by anyone for strategic purposes. And now you face the classic rock that says “cooperate” on it. It happens to be a rock that says “cooperate”, right? You don’t really want to cooperate against that.
(01:19:49): Here’s another possibility. We play this program equilibrium game, literally, and you submit your program, right? And you know that the opponent program is written by me, by Caspar, who probably thought about some strategic stuff, right? Okay, it could be that I just wrote a CooperateBot, right, and that you can now get away with defecting against it. But maybe you could also imagine that maybe there’s something funny going on. And so for example, one thing that could be going on is that I could… Here’s a pretty similar scheme for achieving cooperation in the program equilibrium game, which is based on not the programs themselves mixing but the players mixing over what programs to submit. And so I might-
Daniel Filan (01:20:39): Mixing meaning randomizing?
Caspar Oesterheld (01:20:40): Yeah, randomizing. Very good. So I might randomize between the program that just cooperates—the CooperateBot, the program that cooperates if and only if the opponent cooperates against CooperateBot—so it’s sort of a second-order CooperateBot, something like that. And then you can imagine how this goes on, right? Each of my programs is some hierarchy of programs that check that you cooperated against the one one lower [down] the list. In some sense this is similar to the ϵGroundedFairBot, I guess. You can look at my program and maybe I could just defect or something like that. But the problem is you might be in a simulation of the programs that are higher in the list. If I submit this distribution, you would still want to cooperate against my CooperateBot, of course. So that is one reason to want to cooperate against CooperateBot.
Daniel Filan (01:22:00): It suddenly means that it really matters which things in my environment I’m modeling as agents and which things in my environment that I’m modeling as non-agents, right? Because in my actual environment, I think there are many more non-agents than there are agents. So take this water bottle, right? Not only do I have to model it as a non-agent, but it seems like maybe I’ve also got to be modeling what are the other things it could have done if physics were different, right? It seems like if I have this sort of attitude towards the world a bunch of bad things are going to happen, right?
(01:22:43): And also, if I’m in a strategic setting with other agents that are trying to be strategic, I think you do actually want to be able to say things like “Hey, if I defected would you cooperate anyway? In that case, I’ll just defect. But if your cooperation is dependent on my cooperation then I’m going to cooperate”. It’s hard to do with this construction because I’m checking two things and that explodes into a big tree. But this seems to me like something that you do want to do in the program equilibrium world. I guess those are two things. I’m wondering what your takes are.
Caspar Oesterheld (01:23:29): Yeah, it would be nice to know how to do the: for this given opponent program, could my defecting make the opponent defect? I think a program that exploits CooperateBot and cooperates against itself in some robust way, I agree that this would be desirable. I guess we can say more about to what extent this is feasible. I think in some sense one does just have to form the beliefs about what the water bottle could have been and things like that. I guess with the water bottle—I don’t know, I mean, it’s sort of a weird example. But with the water bottle, I guess, you would have to think about: do you have a reason to believe that there’s someone who’s simulating what you do against the water bottle, and depending on that does something, right?
(01:24:37): In the strategic setting where you know that the opponent program is submitted by Caspar or by someone who knows a little bit about this literature, you just have a very high credence that if you face a CooperateBot probably something funny is going on, right?
(01:24:56): You have a high credence that there are some simulations being run of your program that check what your program does against various opponents. You have to optimize for that case much more than you optimize for the case where your opponent is just a CooperateBot. Whereas with a water bottle, you don’t really have this, right? I don’t know. Why would someone simulate like “okay, the water bottle could have been—”
Daniel Filan (01:25:22): I mean, people really did design this water bottle by thinking about how people would use it, right? I think I have a few thoughts there. Firstly, if I’m just naively like, “did people change how this water bottle would work depending on how other people would interact with it?” That’s just true. I mean, they didn’t get the water bottle itself to do that, so maybe that’s the thing I’m supposed to check for.
(01:25:46): It’s also true that if you go to real iterated, mutually transparent prisoner’s dilemmas, people do actually just write dumb programs in those. And it’s possible that okay, these are played for 10 bucks or something and that’s why people aren’t really trying. But in fact, some people are bad at writing these programs and you want to exploit those programs, right?
(01:26:22): And I also have this issue which is: it seems like then what’s going on is my overall program strategy or something is: first, check if I’m in a situation where I think the other program was designed to care about what I am going to do, then cooperate, otherwise defect. Maybe this is not so bad in the simulation setting. In the proof-based setting, this would be pretty bad, right, because now it’s much harder to prove nice things about me. In the simulation setting, it might just be fine as long as we’re really keeping everything the same. Maybe this is an advantage of the simulation setting, actually. I don’t really know.
Caspar Oesterheld (01:27:05): Sorry, I’m not sure I fully followed that.
Daniel Filan (01:27:08): Okay. I took your proposal to be: the thing you should do is you should figure out if you’re in a strategic setting where the other person is, basically, definitely not going to submit a CooperateBot. I’m imagining myself as the computer program. Maybe this is different to what you were saying. But I was imagining that the program was “check if the other computer program was plausibly strategically designed. Then-
Caspar Oesterheld (01:27:41): Yes.
Daniel Filan (01:27:42): If so then do ϵGroundedFairBot, otherwise do DefectBot. For example, one concern is different people write their programs to do this check in different ways and one of them ends up being wrong. Maybe this is not a huge issue. I don’t know. It feels like it adds complexity in a way that’s a little bit sad.
Caspar Oesterheld (01:28:06): I could imagine that, I guess, for the proof-based ones, the challenge is that they need to be able to prove about each other that they assess the… Whether they’re in a strategic situation, they need to assess this consistently or something like that.
Daniel Filan (01:28:23): Also, the more complicated your program is the harder it is for other people to prove stuff about you. One thing you want to do if you’re a proof-based program, in a world of proof-based programs, is be relatively easy to prove things about. Well, depending on how nice you think the other programs are, I guess.
Caspar Oesterheld (01:28:47): I mean, in practice I think, in the tournament, for various reasons, you should mostly try to exploit these CooperateBots, or these programs that are just written by people who have thought about it for 10 minutes or who just don’t understand the setting or something like that. You wouldn’t expect people to submit this cooperation bot hierarchy thing because there’s just other things to do, right? In some sense, there’s a higher prior on these kinds of programs.
(01:29:25): But you could imagine a version of the tournament setting where you’re told who wrote the opponent program, and then your program distinguishes between someone who has publications on program equilibrium wrote the opponent program, and then you think, okay, well, all kinds of funny stuff might be going on here. I might currently be simulated by something that tries to analyze me in some weird way so I need to think about that. Versus the opponent is written by someone who, I don’t know, I don’t wanna…
Daniel Filan (01:30:06): A naive podcaster.
Caspar Oesterheld (01:30:09): …by someone who just doesn’t know very much about the setting. And then maybe there you think: okay, most prior probability mass is on them just having screwed up somehow and that’s why their program is basically a CooperateBot. Probably in these tournaments I would imagine that, I don’t know, 30% of programs are just something that just fundamentally doesn’t work, it doesn’t do anything useful. It just checks whether the opponent has a particular string in the source code or something like that. And meanwhile very little probability mass [is] on these sophisticated schemes for “check whether the opponent cooperates against CooperateBot in a way that’s useful”.
(01:30:53): So we talked a little bit about to what extent it’s desirable to exploit CooperateBots. There’s then also the question of how exactly to do this. Here’s one more thing on this question of whether you need to know whether the opponent is part of the environment or strategic. You can think about the repeated prisoner’s dilemma, right? I mean, tit-for-tat, everyone agrees it’s a reasonable strategy. And tit-for-tat also cooperates against CooperateBot, right? And I would think there it’s analogous. Tit-for-tat is a reasonable strategy if you think that your opponent is quite strategic. The more you’re skeptical, the more you should… I don’t know, maybe you should just be DefectBot, right? Against your water bottle maybe you can be DefectBot. And then there’s some in-between area where you should do tit-for-tat, but maybe in round 20 you should try defecting to see what’s going on. And then if they defect you can maybe be pretty sure that they’re strategic.
Daniel Filan (01:32:20): It seems to me like the thing you want to do is you want to have randomized defection, then see if the opponent punishes you, and then otherwise do tit-for-tat. But also, be a little bit more forgiving than you otherwise would be in case other people are doing the same strategy.
Caspar Oesterheld (01:32:37): One difference between the settings is that you can try out different things more. Which I think also leads nicely to the other point which is: how exactly would you do this exploiting CooperateBots? I do think just a fundamental difficulty in the program equilibrium setting for exploiting CooperateBots is that it’s… Aside from little tricks, it’s difficult to tell whether the opponent is a CooperateBot in the relevant sense. Intuitively, what you want to know is: if I defected against my opponent, would they still cooperate? And if that’s the case, you would want to defect. But this is some weird counterfactual where you have all of these usual problems of conditioning on something that might be false and so you might get all kinds of weird complications.
(01:33:43): So, I think in comparison to the tit-for-tat case where… I mean, it’s not clear what exactly you would do, but maybe in some sense, against the given opponent, you can try out sometimes defecting, sometimes cooperating and seeing what happens. There’s less of that in the program game case because your one program, there’s some action that you play and maybe you can think if I played this other action… But it’s a weird… You run into these typical logical obstacles.
Daniel Filan (01:34:26): Although it feels like it might not be so bad. So, imagine I have this thing where I’m saying, “Okay, suppose I defected. Would you cooperate against a version of me that defected? If so, then I’m going to defect”. And in that case, it seems like my defection is going to show up in the cases in which you would cooperate and therefore, that counterfactual is not going to be logically impossible, right?
Caspar Oesterheld (01:34:57): Yeah, that’s a good point. So, I guess a very natural extension of (let’s say) these proof-based bots is: okay, what if you first try to prove, “if I defect, the opponent will cooperate”? This will defect against CooperateBots, which is good. The question is whether this will still… What does this do against itself? This will still cooperate against itself, right?
Daniel Filan (01:35:30): Yeah. Because if I’m asking, “will you cooperate if I defect?” The answer is no, if I’m playing myself, because I always have to do the same thing as myself because I’m me.
Caspar Oesterheld (01:35:40): Yeah, maybe this just works.
Daniel Filan (01:35:42): I bet there must be some paper that’s checked this.
Caspar Oesterheld (01:35:49): Yeah, I’m now also trying to remember. Because one of these proof-based papers, they do consider this PrudentBot, which does something much more hacky: it tries to prove (and there’s some logic details here)—it tries to prove that… (Okay, there there’s one issue with the program that you just described that I just remembered, but let’s go to PrudentBot first). So, PrudentBot just checks whether you would cooperate against DefectBot. And then, if you cooperate against DefectBot, I can defect against you.
(01:36:39): I don’t know. To me, this is a little bit… It’s natural to assume that if the opponent cooperates against DefectBot, they’re just non-strategic. They haven’t figured out what’s going on and you can defect against them. But in some sense, this is quite different from this “does my defection make the opponent defect?” or something like that.
Daniel Filan (01:37:03): Yeah, it’s both the wrong counterfactual and it’s a little bit less strategic, right?
Caspar Oesterheld (01:37:09): Yes. The things that I’m aware of that people have talked about are more like this, where they check these relatively basic conditions. You can view them as checking for specific kinds of CooperateBots. I guess another thing you can do is for the ϵGroundedFairBots, just add in the beginning a condition [that] if the opponent is just a CooperateBot, or if the opponent never looks at the opponent’s source code at all, then you can defect against them. You can add these sorts of things. And I think from the perspective of winning a tournament, you should think a lot about a lot of these sorts of conditions and try to exploit them to defect against as many of these players as possible. But it’s not really satisfying. It feels like a trick or some hacky thing, whereas the thing you proposed seems more principled.
(01:38:09): Okay. Now, on this thing, I could imagine one issue is that: when this program faces itself, it first needs to prove… So, one problem is always that sometimes, to analyze opponent programs, you need to prove that some provability condition doesn’t trigger. And the problem is that just from the fact that you think this condition is false, you can’t infer that it’s not provable because of incompleteness. So, I could imagine that I can’t prove that your program doesn’t just falsely prove that your program can safely defect against me because you might think, well… When I prove things, I don’t know whether Peano arithmetic or whatever proof system we use is consistent.
(01:39:27): And so there’s always a possibility that every provability condition triggers, which means that I don’t know whether your first condition triggers. Actually for this PrudentBot, this also arises. If I am this PrudentBot, as part of my analysis of your program, I try to prove that you would defect or cooperate or whatever. I try to prove something about what you would do against DefectBot. And for that, if (let’s say) you’re just some more basic Löbian FairBot-type structure, then in my analysis of your program, I need to conclude that your clause “if I cooperate, the opponent cooperates” or your clause “if I can prove that the opponent cooperates”… I need to conclude that this won’t trigger. To prove that you don’t cooperate against DefectBot, I need to conclude that you won’t falsely prove that DefectBot will cooperate against you.
(01:40:48): And this, I can’t prove in Peano arithmetic or in the same proof system that you use. So, what they actually do for the PrudentBot is that I need to consider… They call it PA+1. I don’t know how widely this is used. I need to consider Peano arithmetic or whatever proof system they use, plus the assumption that that proof system is consistent, which gives rise to a new proof system which can then prove that your “if” condition is not going to trigger. So, this is some general obstacle.
Daniel Filan (01:41:28): Right. And we’ve got coordinate on what proof systems we use then, because if I accidentally use a too-strong proof system, then you have difficulty proving things about me. And I guess also, this thing about, “well, if I defected, would you still cooperate with me?” It feels a little bit hard to… In the proof-based setting, I can say, “if my program or your program outputted defect, would your program or my program output cooperate?” I could just do that conditional or whatever.
(01:42:04): If I want to do this in a simulation-based setting—which I think there are reasons to want to do. Sometimes, you just can’t prove things about other people and you have to just simulate them. And it’s nice because it’s moving a bit beyond strict computer programs. It’s also nice because maybe it’s hard to prove things about neural networks, which was one of the motivations—but I don’t even know what the condition is supposed to be in that setting. Maybe if we’re stochastic programs, I could say: maybe I could do a conditional on “this stochastic program outputs defect”. But it’s not even clear that that’s the right thing because you’re looking at my program, you’re not looking at the output of my program.
Caspar Oesterheld (01:42:52): Yeah. Though you can have programs that do things like “if the opponent cooperates with probability at least such and such…” I think one can make those kinds of things well-defined at least.
Daniel Filan (01:43:05): Yeah. But somehow, what I want to say is “if you cooperate with high probability against a version of me that defects…”, you know what I mean? Either you’re simulating just a different program or you’re simulating me and I don’t know how to specify you’re simulating a version of me that defects. You know what I mean?
Caspar Oesterheld (01:43:28): Yeah. I agree that that’s-
Daniel Filan (01:43:32): In some special cases, maybe I could run you and if I know what location in memory you’re storing the output of me, I can intervene on that location of memory, but (a) this is very hacky and (b) I’m not convinced that this is even the right way to do it.
Caspar Oesterheld (01:43:46): Yeah, I guess there are various settings where you constrain the way that programs access each other that would allow more of these counterfactuals. For example, you could consider pure simulation games where you don’t get access to the other player’s source code, but you can run the other player’s source code. And I guess in those cases, some of these counterfactuals become a bit more straightforwardly well-defined, that you can just… What if I just replace every instance of your calls to me with some action? I mean, there are some papers that consider this more pure simulation-based setting as well, but obviously that would not allow for proof-based stuff and things like that.
Daniel Filan (01:44:43): So, I think at this point, I want to tee up your next paper. So, in particular in this paper, there are two types of strategies that you can’t turn into the program equilibrium setting. So, I think we already discussed win-stay lose-switch, where I have to look at what you did in the last round, and I also have to look at what I did in the last round. There’s also this strategy in the iterated prisoner’s dilemma called a grim trigger where if you’ve ever defected in the past, then I’ll start defecting against you. And if you’ve always cooperated, then I’ll cooperate. And neither of these, you can have in your ϵGroundedFairBots. Why is that?
Caspar Oesterheld (01:45:24): Yeah. Basically, the main constraint on these ϵGroundedFairBots or πBots or whatever is that they just can’t run that many simulations. You can run one simulation with high probability or something like that. Maybe with low probability, you can maybe start two simulations or something like that. But the problem is just as soon as you simulate the opponent and yourself or multiple things and with high probability, you run into these infinite loop issues again that this epsilon condition avoids. Another case is if you have more than two players, things become weird. Let’s say you have three players. Intuitively, you would want to simulate both opponents, and then, if they both cooperate, you cooperate. If one of them defects, then maybe you want to just play the special punishment action against them depending on what the game is. But you can’t simulate both opponents. Because if every time you’re called, [you] start two new simulations or even two minus epsilon or something like that in expectation, you get this tree of simulations that just expands and occasionally some simulation path dies off, but it multiplies faster than simulations halt.
Daniel Filan (01:46:55): Right. Yeah. Basically, when you grow, you’re doubling, but you cut off factor of epsilon, but epsilon is smaller than a half. And therefore, you grow more than you shrink and it’s really bad. And if epsilon is greater than a half, then you’re not really simulating much, are you?
Caspar Oesterheld (01:47:11): Yeah.
Daniel Filan (01:47:12): So, how do we fix it?
Caspar Oesterheld (01:47:13): Okay, so we have this newer paper, where I’m fortunate to be the second author, and the first author’s Emery Cooper, and then Vince Conitzer, my PhD advisor, is also on the paper. And so, this fixes exactly these issues. And I think it’s a clever, interesting idea. So, to explain this idea, we need to imagine that the way that programs randomize works a particular way. The architecture of the programming language has to be a particular way to explain this. If you have a normal programming language, you call random.random() or some such function and you get a random number out of it.
(01:48:10): But another way to model randomization is that you imagine that at the beginning of time or when your program is first called, it gets as input an infinite string of random variables that are rolled out once in the beginning, and then, you have this long string of… It could be (for example) bits, and all you’re going to do is use the bits from this input. And so, in some sense, this is a way of modeling randomization with a deterministic program. In some sense, randomization is like running a deterministic program on an input that is random. As part of your input, you get this random string. And so, specifically, let’s imagine that you get these as a random string as input, but each entry is just a random number between zero and one.
(01:49:06): The way that these infinite simulation issues are fixed is that when I run, for example, my two opponents and myself, I pass them all the same random input string and that way, I coordinate how they halt or at what point they halt. Very specifically, here’s how it works. So, let’s maybe first consider a version where the issue is just that you have multiple opponents, but you’re still doing something like ϵGroundedFairBot where you’re happy to look just at the last round. Or maybe win-stay lose-[switch], where you maybe also look at your own previous action.
(01:49:59): So, what you do is you look at your random input string, and if the first number is below epsilon, then you just immediately halt as usual by just outputting something. And otherwise, you remove the first thing from this infinite random input string. And then, you call all of these simulations. You simulate both opponents. Let’s say you have two opponents and yourself, just with the first entry in that list removed. And now, okay, how does this help? Well, I mean the opponents might do the same, right? Let’s say they also all check the first thing, check whether it’s smaller than epsilon, and then remove the first and call recursively.
(01:50:55): Well, the trick is that by all of them having the same input string, they all halt at the same point. All your simulations are going to halt once they reach the specific item in this input string—the first item in this input string that is smaller than epsilon. And so, that allows for simulating multiple opponents. You can simulate yourself of course, and you can also simulate multiple past time steps by, instead of passing them just the input string with the first thing removed, you can also check what did they do, in some intuitive sense, ‘two time steps ago’ by removing the first two random variables from the input string and passing that into them. So, this is the basic scheme for making sure that these simulations all halt despite having a bunch of them.
Daniel Filan (01:52:04): My understanding is that you have two constructions in particular. There’s this correlated one and this uncorrelated one. Can you give us a feel for what the difference is between those?
Caspar Oesterheld (01:52:15): Yeah. So, there are differences in the setting. So, the correlated one is one where you get a correlated, or you get a shared random input sequence. So you could imagine that there’s some central party that generates some sequence of random numbers and it just gives the sequence of random numbers to all the players. So, they have this same random sequence—and then, maybe additionally, they have a private one as well—but they have this shared random sequence. And then, in this shared setting, basically all the results are much nicer. Basically, we get nice results in the shared randomness setting, and mostly more complicated, weird results—or in some cases, we also just can’t characterize what’s going on—in the non-correlated case.
(01:53:16): But in the correlated case, we specifically propose to use the correlated randomness to do these recursive calls. So, when I call my three opponents or two opponents and myself on the last round, I take the shared sequence of random numbers. I remove the first and call the opponents with that, with the remaining one rather than using the private one. And then, in the case where there is no shared randomness, we just use the private randomness instead. So, in some sense, it’s almost the same program. I mean, there’s some subtleties, but in some sense it’s the same program. And the main difference is that, well, you feed them this randomness that’s-
Daniel Filan (01:54:12): You’re giving the other person your private randomness, right?
Caspar Oesterheld (01:54:14): Yeah. I’m giving… yeah, I don’t have access to their randomness. I have to give them my randomness, which also, maybe it’s not that hard to see that you get somewhat chaotic outputs. In some sense, my prediction of what the opponent will do is quite different from what they’re actually going to do because they might have very different input.
Daniel Filan (01:54:44): Right. In some ways it’s an interesting… It’s maybe more realistic that I get to sample from the distribution of what you do, but I don’t get to know exactly what you will actually do. Actually, maybe this is just me restating that I believe in private randomness more than I believe in public randomness.
(01:55:03): So, okay, here’s a thing that I believe about this scheme that strikes me as kind of sad. It seems like, basically, you’re going to use this scheme to come up with things like these ϵGroundedFairBots and they’re going to cooperate with each other. But reading the paper, it seemed like what kind of had to happen is that all the agents involved had to use the same sort of time step scheme, at least in the construction. It’s like, “Oh, yeah, everyone has this shared sequence of public randomness, so they’re both waiting until the random number is less than epsilon and at that point they terminate”.
(01:55:56): So, I guess I’m seeing this as: okay, in the real world we do have public sources of randomness, but there are a lot of them. It’s not obvious which ones they use. It’s not obvious how to turn them into “is it less than epsilon?” or… So, it seems really sad if the good properties of this have to come from coordination on the scheme of “we’re going to do the time steps and we’re going to do it like this”. But I guess I’m not sure. How much coordination is really required for this to work out well?
Caspar Oesterheld (01:56:30): Yeah, that is a good question. Yeah, I do think that this is a price that one pays relative to the original ϵGroundedπBots, which obviously don’t have these issues. I think it’s a little bit complicated how robust this is exactly. So, the results that we have… We have this folk theorem about what equilibria can be achieved in the shared randomness case by these kinds of programs. And it’s the same as for repeated games, also the same as for these syntactic comparison-based ones. So, everything that’s better for everyone than their minimax payoff, the payoff that they got if everyone else punished them. And I guess the fact that it’s equilibrium obviously means that it’s robust to all kinds of deviations, but getting the equilibrium payoff, that requires coordination on these random things.
(01:57:43): Also, another thing is that—maybe this is already been implicit or explicit in the language I’ve used—with these times steps, there’s a close relation between this and repeated games. Now, it’s really just full repeated game strategies. And this whole relation to repeated games hinges on everyone using basically exactly the same time step scheme. Basically, if everyone uses the same epsilon and if the same source of randomness is below this epsilon, then in some sense, it’s all exactly playing a repeated game with a probability of epsilon of terminating at each point. And there’s a very nice correspondence. So, some of the results do really fully hinge on really exact coordination on all of these things. But also, there’s some robustness still.
(01:58:42): So, for example, the programs still halt if someone chooses a slightly different epsilon. If someone chooses a different epsilon, the relationship to repeated games sort of goes away. It’s hard to think of a version to play a repeated game where everyone has their separate cutoff probability. I don’t know. Maybe one can somehow make sense of this, but it does become different from that. But let’s say I choose an epsilon that’s slightly lower. Well, we’re still going to halt at the point where we find a point in this random sequence where it’s below everyone’s epsilon. So, people choosing slightly different epsilons, it becomes harder for us to say what’s going on, we can’t view it as a repeated game anymore, but it still works. It’s not like everything immediately breaks in terms of everything not halting or something like that.
Daniel Filan (01:59:54): Yeah. Or even if I’m using one public random sequence and you’re using another, even if it’s uncorrelated, it seems like as long as I eventually halt and you eventually halt, it’s not going to be too bad.
Caspar Oesterheld (02:00:06): In particular, we’re going to halt at the point where both of our sequences have the halting signal, right?
[Note from Caspar: At least given my current interpretation of what you say here, my answer is wrong. What actually happens is that we’re just back in the uncorrelated case. Basically my simulations will be a simulated repeated game in which everything is correlated because I feed you my random sequence and your simulations will be a repeated game where everything is correlated. Halting works the same as usual. But of course what we end up actually playing will be uncorrelated. We discuss something like this later in the episode.]
Daniel Filan (02:00:14): Yeah. I guess, it depends a little bit on what our policies are, but it seems like as long as I’m not super specific about what exact sequence of cooperates and defects I’m sensitive to, maybe it’ll just be fine even if we’re not super tightly coordinated.
Caspar Oesterheld (02:00:41): Yeah, I guess here again, [to try] to import our intuitions from repeated games, that I guess there’s a game theoretic literature about, and that we maybe also have experience [of] from daily life: in practice, if you play a repeated game, you’re not going to play an equilibrium, you’re going to play something where you do something that’s trying to go for some compromise. Maybe the other player goes for some other compromise, and then, you try to punish them a little bit or something like that. And I would imagine that there’s a lot of this going on in this setting as well.
Daniel Filan (02:01:22): Yeah, yeah. Okay. I think I may be a little bit less concerned about the degree of coordination required. So, there are two other things about this paper that seem pretty interesting. So, the first is just what the limitations on the equilibria you can reach are. And my understanding is that you can characterize them decently in the correlated case, but it’s pretty hard to characterize them in the uncorrelated case or-
Caspar Oesterheld (02:01:53): Yeah.
Daniel Filan (02:01:54): Can you explain to me and my listeners just what’s going on here?
Caspar Oesterheld (02:01:58): Yeah, so in the correlated case, it really is quite simple. As always, there are some subtleties. You need to specify, for example, what exactly are you going to do if you simulate some other player and they use their private signal of randomness, which they’re not supposed to do in some sense. Well, you need to somehow punish them and the people above you need to figure out that this is what’s going on. So, there’s some of these sorts of subtleties. But I think basically, there is just a very close relationship between these programs and the repeated game case. So, it is just basically like playing the repeated case and even deviation strategies, you can view as playing the repeated game by saying: well, if they get this random string as inputs that has 10 variables left until they get to the below epsilon case, then you can view this as them playing a particular strategy at time step 10.
Daniel Filan (02:03:03): Hang on. What do they do if they access randomness? So, my recollection, which might be wrong, was that you punish people for accessing other people’s private randomness, but I thought they could still access their private randomness.
Caspar Oesterheld (02:03:18): I think you do have to punish people for using their private randomness. And then, the complicated thing is that I might simulate you and you might simulate a third party and the third party uses their private randomness and now you, as a result, punish them. And then, I now need to figure out that you are just punishing them because they used their private randomness.
Daniel Filan (02:03:46): And you’re now punishing me.
Caspar Oesterheld (02:03:47): I don’t know.
Daniel Filan (02:03:50): That condition seems hard to coordinate on, right? Because naively, you might’ve [thought], well, it’s my private randomness. It’s my choice.
Caspar Oesterheld (02:03:56): Oh, the condition to punish private randomness?
Daniel Filan (02:04:00): Yeah.
Caspar Oesterheld (02:04:00): Yeah. I think this is a reasonable point. Maybe one should think about ways to make this more robust to this. I guess one has to think about what exactly the game is, and how much harm the private randomness can do. In some cases, it doesn’t really help you to do your own private randomness, and then maybe I don’t need to punish you for it.
(02:04:24): But if there are 20 resources and you can steal them, and you’re going to randomize which one you steal from, and the only way for us to defend against this is by catching you at the specific resource or something like that, then maybe we do just need to think: okay, as soon as there’s some randomness going on, it’s a little bit fishy.
(02:04:48): But yeah, you could imagine games where you want to allow some people to randomize privately or use their private randomness for, I don’t know, choosing their password. Maybe this is sort of a fun example. At time step 3, you need to choose a password. And in principle, the way our scheme would address this is that we all get to see your password, or in some sense we get to predict how you use your password. I mean it’s also still important to keep in mind that these past timesteps are things that don’t actually happen, so we predict what you would’ve chosen at timestep 3 if timestep 3 was the real timestep. But nonetheless, you might think, okay, if you have to choose your password with the public randomness, then we all know your password and doesn’t this mean that we all would want to log into your computer and steal your stuff? And the way the scheme would address this, I guess, is just that, well, someone could do that but they would then be punished for this.
Daniel Filan (02:05:59): Or maybe they do do it and it’s just like, “Well, that’s the equilibrium we picked. Sorry”.
Caspar Oesterheld (02:06:04): Right, right. It could also be part of the equilibrium. Yeah, that’s also true.
Daniel Filan (02:06:11): So in the correlated case, it’s basically: you have a folk theorem, and there’s something about things that you can punish people for deviating from. That’s basically the equilibria you can reach, roughly. And then I got to the bit of your paper that is about the equilibria you could reach in the uncorrelated game.
(02:06:39): And I am going to be honest… So earlier we had a recording where we were going to talk about these papers, but actually I got really bad sleep the night before I was supposed to read the papers, and so I didn’t really understand this “Characterising Simulation-based Program Equilibria” paper. It was beyond me. And this time, I had a good night’s sleep, I was rested, I was prepared, and I read this paper and then once I get to the limitations on the equilibria of the uncorrelated one, that’s where I gave up. The theorems did not make… I understood each of the symbols but I didn’t get what was going on.
(02:07:19): Is there a brief summary of what’s going on or is it just like, well we had to do some math and that turns out to be the condition that you end up needing?
Caspar Oesterheld (02:07:26): At least for the purpose of a very audio-focused format, I think probably one can’t go that much into the details of this. I think I want to explain a little bit why one doesn’t get a folk theorem in the uncorrelated case. I think there are some relatively intuitively accessible reasons for that.
(02:07:49): Okay, let’s start there. So the problem in the uncorrelated case is basically that: let’s take a three-player case. We are two players and there’s a third player, Alice. We want to implement some equilibrium and now there’s a question, can Alice profitably deviate from this equilibrium? And now the issue is Alice can use her private randomization in some ways. So the problem is basically that us catching her deviation is uncorrelated with her actually deviating in the actual situation. And additionally, whether I detect her deviating is uncorrelated with you detecting her deviating.
(02:08:58): And this all makes punishing, especially punishing low probability deviations very difficult. So for example, if Alice, with some small probability that she determines with her private randomness, she defects in some way, then in the real world, for her actual action that will determine her utility, there’s this small probability that she’ll defect. And then there’s some probability that our simulations of her—which we’re running a bunch of—there’s some probability that we’ll detect these. But because when I simulate Alice, I simulate her with a completely different random string than the string that Alice has in the real world, in some sense, I can’t really tell whether she’s actually going to deviate. And then also, you are going to simulate Alice also with your private randomness, which means that whether in your simulation Alice defects is also uncorrelated with whether she defects in my simulation.
Daniel Filan (02:10:07): Wait, first of all, I thought that even in the correlated case, whether she defects in simulation is different from whether she deviates in reality because we get rid of the first few random numbers and then run on the rest, right?
Caspar Oesterheld (02:10:24): Yeah, that is true.
Daniel Filan (02:10:28): The thing where we disagree, that seems important and different.
Caspar Oesterheld (02:10:33): So maybe that’s… I’m maybe also not sure about the other one now, but I think the other one is more straightforward. It might be that to punish her deviating, we both need to do with a particular thing and we just can’t… It’s a little bit complicated because you might think, well, we can simulate Alice for a lot of timesteps. So you might think that even if she defects with low probability, we are simulating her a bunch in some way.
(02:11:12): So they are some complications here. She needs to deviate in some relatively clever way to make sure that we can’t detect this with high probability. It is all a little bit complicated, but I think we can’t correlate our punishment or we can’t even correlate whether we punish. And so if the only way to get her to not defect is for both of us to at the same time do a particular action, that’s sort of difficult to get around.
Daniel Filan (02:11:49): Okay. All right, here’s a story I’m coming up based on some mishmash of what you were just saying and what I remember from the paper. We’re in a three-player game, therefore punishing actions… Firstly, they might require a joint action by us two and therefore, that’s one reason we need us to be correlated on what Alice actually did, at least in simulation.
(02:12:12): Another issue is: suppose I do something that’s not in the good equilibrium and you see me doing that, you need to know whether I did that because I’m punishing Alice or whether I was the first person to defect. And if I’m the first person to defect, then you should try punishing me. But if I’m just punishing Alice, then you shouldn’t punish me.
(02:12:34): And so if we in our heads see different versions of Alice, if you see me punishing, if you see me going away from the equilibrium, you need to know whether that’s because in my head I saw Alice defecting or if it’s because in my head I thought I want to defect because I’m evil or whatever. I don’t know if that’s right.
Caspar Oesterheld (02:12:58): Yeah. I’m not sure whether that is an issue because when I see you defecting, it is because I simulate you with my randomness as input. And then you see, with my randomness as input, Alice defecting one level down, which means that I… Remember that I’m simulating all of these past timesteps as determined by my randomness. So I think I can see whether the reason you defect in my simulation is that you saw Alice defect.
Daniel Filan (02:13:40): Wait, if we’re using the same randomness, then why isn’t it the case that we both see Alice defect at the same time with our same randomness?
Caspar Oesterheld (02:13:47): So I mean this is all my simulation of you rather than the real you.
Daniel Filan (02:13:55): So the real we’s might not coordinate on punishment?
Caspar Oesterheld (02:13:59): Yeah. I mean this is another thing that’s like: even with very basic ϵGroundedπBots, you can kind of imagine: in their head, they’re playing this tit-for-tat where it’s going back and forth. And one person does this based on their randomness and then the other person sees this and then responds in some particular way.
(02:14:19): But if you don’t have shared randomness, all of this, this is all complete fiction. You haven’t actually coordinated with the other player and seen back and forth. So it might be that I run this simulation where you detected Alice’s defecting and then I also defect on Alice, and then we are happily defecting on Alice. And in the simulation we’re thinking “we’re doing so well, we’re getting this Alice to regret what she does” and so on. But the problem is that you run a completely different simulation.
(02:14:52): So in your simulation of what Alice and I do, you might see everyone cooperating and everyone thinks, “oh, everything’s great, we’re all cooperating with each other”. And then we’ve done the simulation and now we are playing the actual game, and I defect thinking, “oh yeah, we are on the same team against Alice”. And then you think, “oh nice, we’re all cooperating” and you cooperate. And then we’re landing in this completely weird outcome that doesn’t really happen in the simulation, sort of unrelated to what happens in this…
Daniel Filan (02:15:23): Right. So Alice basically says, “Hey, I can get away with doing nasty stuff because they won’t both be able to tell that I’m doing the nasty stuff and therefore I won’t properly be punished in the real world”. And so these gnarly theorems: should I basically read them as: the preconditions are there’s some math thing and the math thing basically determines that this kind of thing can’t happen and those are the equilibria you can reach. Is that it?
Caspar Oesterheld (02:15:50): Yeah. So I think one thing that drives a lot of these characterizations is: Alice can defect with low probability. I think usually that’s the more problematic case, is that she defects in a particular kind of clever way with low probability, which means that we are very unlikely to both detect it at once. I think that is driving these results a lot.
(02:16:23): But to some extent… You said this earlier, there’s some math going on. I think to some extent that’s true. So I think one thing that I liked about these results, despite… I mean of course one always prefers results that are very clean and simple, like the folk theorem where you just have this very simple condition for what things are equilibria. And our characterizations are mostly these kind of complicated formulas.
(02:16:51): I think one thing I like is that for some of these characterizations, one can still hold onto this interpretation of there being timesteps and you simulate what people do at previous timesteps and things like that. Which, it’s sort of very intuitive that this works for the case where everyone plays nicely with each other and everything is correlated, and in some sense, we’re playing this mental repeated game where we all use the same randomness and so we are all playing the same repeated game, and really the thing that is sampled is “which round is the real round?” It’s clear that the timestep story works. And it’s nice there that there are some results where you can still use this timestep picture. So that’s one nice thing about the results. But yeah, it is unfortunately much more complicated.
Daniel Filan (02:17:49): Fair enough. So another part of the paper that is kind of cool and that you foregrounded earlier is it has this definition of simulationist programs. And so earlier, you mentioned there was a definition of fair programs or something: maybe you are referring to this definition.
Caspar Oesterheld (02:18:11): Yeah. In some sense, the paper has three parts: the one with the correlated case, with these generalized ϵGroundedπBots that pass on the shared random sequence. And then the uncorrelated case with the ϵGroundedFairBots. And then we also have a section that analyzes more general simulationist programs, which are programs that just… Intuitively all they do is run the opponent with themselves and the other players as input. And that has this definition. And then for those we have a characterization as well.
(02:18:55): For example, one result that we also show is that in general, general simulationist programs are more powerful at achieving equilibria in the uncorrelated case than the ϵGroundedπBots. I’m not quite sure how much to go into detail there, but one intuition that you can have is: in the ϵGroundedπBots, to some extent everyone has to do the same thing. Whereas you could have settings where only I need to do simulations and then if only I simulate your program, I can run 10,000 simulations or something like that.
(02:19:35): And this is something that obviously the ϵGroundedπBots can’t do. You can’t just independently sample a thousand responses from the other player. And we do have this definition of simulationist programs. I’m not sure I remember the details off the top of my head.
Daniel Filan (02:19:56): I think it’s some recursive thing of: a simulationist program is… it calls its opponent on a simulationist program, which maybe includes itself and maybe… I forgot whether it explicitly has ϵGroundedπBots as a base case or something. Maybe simulating nobody is the base case, or just ignoring the other person’s input.
Caspar Oesterheld (02:20:20): Yeah. That’s also coming back to me. I think it’s something like that. So the tricky part is that you might think that a simulationist program is just one that calls the other program with some other program as input. But then if you don’t constrain the programs that you give the other player as input, you can sort of smuggle this non-behaviorism back in by having “what does my opponent do against these syntactic comparison bots?” or something like that.
Daniel Filan (02:21:01): There’s a good appendix. It’s like “for why we do it this way, see this appendix”. And then you read the appendix and it’s like, “oh that’s pretty comprehensible”. It’s not one of these cases where the appendix is all the horrible…
Caspar Oesterheld (02:21:11): Yeah, glad to hear that you liked the appendix. Some of the appendix is also just very technical, like working out the details of characterization.
Daniel Filan (02:21:20): Oh yeah, I skipped those appendices. But there are some good appendices in this one.
Caspar Oesterheld (02:21:24): Nice.
Daniel Filan (02:21:24): All right, the next thing I want to ask is: what’s next in program equilibrium? What else do we need to know? What should enterprising listeners try and work on? Is there any work that’s… So earlier, I asked you about what was the state of the art before you published “Robust Program Equilibrium”. Is there any work coming out at around the same time which is also worth talking about and knowing a bit about the results of?
Caspar Oesterheld (02:21:57): I think, yeah, there are a bunch of different directions. So I do think that we still leave open various technical questions and there are also some kind of technical questions that are still open for these Löbian programs that it would be natural to answer.
(02:22:16): So one thing, for example, is that I would imagine that… Maybe sticking closely to our paper first, there are some very concrete open questions even listed in the paper. I’m not entirely sure, but I think in the two-player simulationist program case, it’s not clear whether, for example, all Pareto-optimal, better than minimax utility profiles can be achieved in simulationist program equilibria. So maybe this is not quite the right question, but you can check the paper. We have some characterizations for these uncorrelated cases. But I think for the general simulationist case, we don’t have a full characterization. So if you want go further down this path of this paper, there are a bunch of directions there that still have somewhat small holes to fill in.
(02:23:39): Then another very natural thing is that: I think for the Löbian bots, there isn’t a result showing that you can get the full folk theorem if you have access to shared randomness, which I am pretty sure is the case. I think probably with some mixing of this epsilon-grounded stuff and the Löbian proof-based stuff, I would imagine you can get basically a full folk theorem, but there’s no paper proving that. Maybe one day, I’ll do this myself. But I think that’s another very natural question to ask.
(02:24:19): So in my mind, going a bit further outside of what we’ve discussed so far, in practice, I would imagine that usually one doesn’t see the opponent’s full source code. And maybe it’s also even undesirable to see the source code for various reasons. You don’t want to release all your secrets. Maybe also… I mean, we talked about these folk theorems where everything that is better than this punishment outcome can be achieved. And I think game theorists often view this as sort of a positive result, whereas I have very mixed feelings about this because it’s kind of like, well, anything can happen, and in particular a lot of really bad outcomes can happen. Outcomes that are better than the best thing that I can achieve if everyone just punishes me maximally… Well, it’s not very good. There are lots of very bad things that people can do to me, so there are lots of equilibria where I get very low utility.
Daniel Filan (02:25:40): And in particular, if there are tons of these equilibria, the more equilibria there are, the less chance there is we coordinate with one. Right?
Caspar Oesterheld (02:25:49): Yeah. I guess maybe one positive thing is that… In the correlated case, you have this convex space of equilibria. So at least it’s like, well, you need to find yourself in this convex space rather than finding yourself between six discrete points. And so maybe that makes things easier.
(02:26:08): But yeah, I think basically I agree with this. I think on our last episode—this is my second appearance on AXRP, right? On the first episode on AXRP, we discussed this equilibrium selection problem, which I think is very important and motivates a bunch of my work. So maybe if you have less information about the other player, then you get fewer equilibria. Maybe in the extreme case, maybe if you get only very little information about the player, maybe you only get one additional equilibrium relative to the equilibria of the underlying game.
(02:26:53): And I think we discussed the similarity-based cooperation paper also on the previous episode, and that is basically such a setting. It’s basically a program equilibrium setting where you don’t get the full opponent source code, but you get some signal, in particular how similar the opponent is to you. And there are some results about how you get only good equilibria this way.
(02:27:23): I think in general, that’s sort of a natural direction to go in. Also, you can also do more practical things there. The similarity-based cooperation paper has some experiments. You can do experiments with language models where in some sense, this is sort of true. If my program is “I prompt a particular language model” and then you know my prompt but you don’t know all the weights of my language model, or maybe you can’t do very much with all the weights of my language model, that is a sort of partial information program equilibrium. So I think that is another natural direction.
(02:28:03): And then also, I think you drew these connections to decision theory, which is: in some sense, if you are the program and you have to reason about how you’re being simulated and people are looking at your code and stuff like that, how should you act in some kind of rational choice-type sense? That’s sort of the problem of decision theory. And in some ways, you could view this program equilibrium setting as sort of addressing these issues by taking this outside perspective. Instead of asking myself “what should I, as a program who’s being predicted and simulated and so on, what should I do?”, instead of that, I ask myself, “I’m this human player who’s outside the game and who can submit and write code, what is the best code to submit?”
(02:28:59): And in some sense, that makes the question less philosophical. I’m very interested in these more philosophical issues. And I feel like the connections here aren’t fully settled: what exactly does this “principal” perspective or this outside perspective correspond to from the person of the agent? Like you said, this equilibrium where everyone checks that they’re equal to the other player, that’s an equilibrium where the programs themselves aren’t rational. They don’t do expected utility maximization, they just do what their source code says. So I think this is much more philosophical, much more open-ended than these more technical question about what equilibria can you achieve. But I’m still very interested in those things as well.
Daniel Filan (02:29:49): So the final question I want to ask is: if people are interested in this work and in particular in your work, how should they find more?
Caspar Oesterheld (02:30:00): So I just have an academic website. Fortunately my name is relatively rare, so if you Google my name, you’ll find my academic website. You can also check my Google Scholar, which has a complete list of my work. I also have a blog where I occasionally post things somewhat related to these kinds of issues, which is just casparoesterheld.com, which in principle should allow subscribing to email notifications.
(02:30:39): And I also have an account on X, formerly Twitter, which is C_Oesterheld. Yeah, I think those are probably all the things.
Daniel Filan (02:30:51): Great. Cool. So there’ll be links to that in the transcript. Caspar, thanks very much for coming on the podcast.
Caspar Oesterheld (02:30:56): Thanks so much for having me.
Daniel Filan (02:30:57): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with the transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. You can become a patron yourself at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can fill out a super short survey at axrp.fyi.
Daniel Filan (02:04:21): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs, and the podcast is supported by patrons such as Alexey Malafeev. To read the transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.
2026-02-18 08:19:32
Published on February 18, 2026 12:19 AM GMT
Epistemological status: these are speculative thoughts I had while trying to improve understanding. Not tested yet.
When you pull a door towards you, you predict it will move towards you in a particular way. You can visualize the movement in your mind's eye. Similarly, the door having been pulled, you can infer what caused it to end up there.
So, starting from a cause, you can predict its effects; starting from an effect, you can infer its cause.
Let's call that understanding. You instantiate a causal model in your mind, and you see how a change in part of the model affects the rest of the model, you also see what changes have to occur to get to a desired state of the model. The speed and accuracy with which you can predict effects or causes, as well as the number of changes you know how to propagate, the number of goal states you know how to reach, are the depth of your understanding.
Conversely, non-understanding would be not being able to visualize what happens when you pull the door, or not having any idea how the pulled door got to where it stands.
Say you don't know/understand what happens when you pull a door... Then you pull a door.
Now you understand what happens.
Why is that? Well, your brain has native hardware that understands cause-effects models on its own. You just need reality to shove the relationship in your face hard enough, and your brain will go "ok, seems legit. let's add it to our world-model".
Now let's consider a mathematical proof. You follow all the logical steps, and you agree, "sure enough, it's true". But you still feel like you don't really grok it. It's not intuitive, you're not seeing it.
What's going on? Well, this is still a brand-new truth, you haven't practiced it much so it has not become part of your native world-model. And unlike more native things, it is abstract. So even if you try to pattern match it to existing models to make it feel more real, more native through analogies, such as "oh, light behaves like waves", it doesn't work that well.
This usually naturally goes away the more you actually use the abstract concept: your brain starts to accept it as native and real, and eventually light behaves like light. It even feels like it always was that way.
Consider a complicated math equation. There are symbols you do not understand. However you do know this is math.
What's the algorithm to go from non-understanding it to understanding it?
Steps:
An important part of this understanding algorithm is to be meticulous about noticing what you don't understand. The issue is that there's probably a bunch of stuff you don't understand, and you have limited time, so you need to become real good at ranking. My hope is that with practice these 2 skills become second nature. And then at a glance you're able to see what pieces don't understand, and guesstimate the most important among these, as well as the cost of analyzing them, from which you can prioritize.
This approach has the huge added benefit of being very active, thus motivating. Keeping a tight feedback loop is probably a key point. Trying to understand by yourself before searching too. As for the search part, you might want to experiment with an LLM pre-prompt so it gives you a brief answer to any question you ask. Maybe a "no thinking" or even a local LLM is better to have short latency and tighten the feedback-loop.
The key principle behind this understanding algorithm is fairly simple: if you understand A causes B, and B causes C, and you are able to hold both of these statements in your mind, or have practiced them enough that they stand as a compressed second nature / pointer you can refer to compactly, or alternatively you can follow the step by step and accept the logical conclusion of the process without having to hold the whole chain in your mind; then you understand A causes C, to some degree.
The kind of causal understanding I talk about is just a big graph of cause-effects relationships. To understand the graph as a whole, you need to understand enough of the individual cause-effects relationships. To learn efficiently, you need to focus on understanding the cause-effects relationships that give you personally the most info for the least effort. And if you want to learn fast, you need to develop these noticing and prioritizing skills, to become good and fast at it.
I heard that not that many concepts are necessary to understand complex ideas, or complex proofs. That's encouraging. It may be that by perfecting this learning technique one could learn extremely fast, and stumble across new insights as well.
Performance on this task should be measured. How many seconds does it take to learn a concept on average? 1 concept every 5 minutes? can we tighten the loop, go lower? 1 concept a minute? one concept every 30 seconds? Maybe not, that would be 120 concepts/hour. Apparently, this many concepts an hour is wildly biologically implausible: the brain needs to consolidate memories, there can be interference issues etc. But investigating the limits sounds like fun anyway. Also, consider that the more concepts we learn the higher the probability that the brain will auto-unlock a bunch of related concepts, so who knows?
Learning this fast, what could one learn, and what could one do?
Reading books is like getting lots of answers to questions you don't ask.
The great thing is that you're getting lots of data very fast, as in you don't have to go through the steps of noticing what you don't understand and looking for answers. It also helps discovering unknown unknowns.
The bad thing is that the data may not be informative to you, like if you already know about some of it, or don't understand it and then have to perform the understanding algorithm on the book anyway. And also since you're not asking questions yourself, it can become boring, or tedious, and that sure doesn't help with absorbing data.
From that, I'd say engaging introductory books and documentaries on subjects you don't know, to get a feel for a field, are probably most efficient.
2026-02-18 08:13:54
Published on February 18, 2026 12:13 AM GMT
This week’s newsletter in a word: “velocity”. We’ll take a deep look at last week’s big models drops (just a few months after the previous big drops), and try to figure out if they’re reached High levels of dangerous capabilities. Nobody’s quite sure, because capabilities are outrunning evaluations.
We also check in on the country of geniuses in a data center (still 2028, according to Dario), contemplate what we should align AI to (assuming we can figure out how to align it to anything), and catch up on the Chinese AI industry.
Matt Shumer’s Something Big Is Happening has been making the rounds this week. It’s a great “you need to wake up” piece for anyone you know who doesn’t understand the magnitude of what’s happening right now.
But it’s time now. Not in an “eventually we should talk about this” way. In a “this is happening right now and I need you to understand it” way. [...]
The experience that tech workers have had over the past year, of watching AI go from “helpful tool” to “does my job better than I do”, is the experience everyone else is about to have. Law, finance, medicine, accounting, consulting, writing, design, analysis, customer service. Not in ten years. The people building these systems say one to five years. Some say less. And given what I’ve seen in just the last couple of months, I think “less” is more likely.
OpenAI has started showing ads in some tiers of ChatGPT. They’re fine for now, but I worry about where those incentives lead.
Opus 4.6 is a pretty big deal—it’s a substantial upgrade to Opus 4.5, which was probably already the best overall model (and which just shipped 2 months ago). Not surprisingly, Zvi has lots to say about it.
Claude Opus 4.6 Escalates Things Quickly. It’s a very good model.
System Card Part 1: Mundane Alignment + Model Welfare Key takeaways:
System Card Part 2: Frontier Alignment
I want to end on this note: We are not prepared. The models are absolutely in the range where they are starting to be plausibly dangerous. The evaluations Anthropic does will not consistently identify dangerous capabilities or propensities, and everyone else’s evaluations are substantially worse than those at Anthropic.
Does Zvi sleep? Nobody knows. ChatGPT-5.3-Codex is an excellent model, and this is a significant upgrade.
Intriguing: GPT‑5.3‑Codex‑Spark is a less capable version of Codex that can do more than 1,000 tokens / second, which is fast. Like, really fast. Sometimes you need maximum intelligence, but for many applications, model speed is an important rate limiter for productivity. A super-fast, good-enough model might be a game changer for many tasks.
Cursor have upgraded their Composer, their in-house agentic coding model, to version 1.5.
There’s a significant update to Gemini 3 Deep Think, focusing on science, research, and engineering. Simon Willison reports that it raises the bar for bicycle-riding pelicans.
Now that the frenzy over OpenClaw and Moltbook has died down, Steve Newman takes a look at what just happened (not all that much, actually) and what it means (a sneak peek at some aspects of the future).
Well, that didn’t take long. Peter Steinberger (the creator of OpenClaw) is joining OpenAI. OpenClaw will be moving to a foundation.
Two really good interviews with Dario this week:
AI is getting very good at almost everything, including complex cognitive tasks that require deep understanding and judgment. The Atlantic reports on AI forecasters at recent Metaculus tournaments ($):
Like other participants, the Mantic AI had to answer 60 questions by assigning probabilities to certain outcomes. The AI had to guess how the battle lines in Ukraine would shift. It had to pick the winner of the Tour de France and estimate Superman’s global box-office gross during its opening weekend. It had to say whether China would ban the export of a rare earth element, and predict whether a major hurricane would strike the Atlantic coast before September. […]
The AI placed eighth out of more than 500 entrants, a new record for a bot.
2025 was a wild year for timelines: exuberance early on, then a substantial lengthening in the middle of the year, and another round of exuberance at the end of the year. Rob Wiblin explores why those shifts happened, with insightful analysis of the underlying trends. It’s a great piece, though it largely ignores the most recent shift.
Much of the timelines discussion focuses on how long it takes to get to AGI, but Ajeya Cotra thinks takeoff speed is the most important crux (i.e., how fast we go from AGI to whatever happens next).
The AI-2027 team calculate that the rate of AI progress in 2025 was about 65% of what they predicted.
Andy Masley checks in on how well AI can draw hands.
Very often the question isn’t “how does AI let us do the usual thing cheaper”, but rather “what can we now do that wasn’t practical to do before?”. Nieman Lab reports on a slick tool at the New York Times:
When one of the shows publishes a new episode, the tool automatically downloads it, transcribes it, and summarizes the transcript. Every 24 hours the tool collates those summaries and generates a meta-summary with shared talking points and other notable daily trends. The final report is automatically emailed to journalists each morning at 8 a.m. ET.
There’s been some good discussion lately of what we should align AI to (which is separate from and almost as important as how to align it to anything at all).
Oliver Klingfjord believes integrity is a critical component:
Integrity isn’t everything in AI alignment. We want models with domain expertise, with good values, with the wisdom to enact them skillfully. Integrity doesn’t speak to the goodness of values. But it does speak to how deeply they run, how stable they are under pressure. It’s what lets us trust a model in situations we never anticipated.
Richard Ngo goes in a somewhat different direction, arguing for aligning to virtues.
I like that both Oliver and Richard emphasize the importance of generalizing well to unforeseen circumstances, which is a shortcoming of more deontological approaches like OpenAI’s.
In a convincing demonstration of AI’s ability to find vulnerabilities at scale, Anthropic uses Opus 4.6 to find more than 500 high-severity zero day vulnerabilities. The accomplishment is impressive, and the account of how it went about finding them is very interesting. If you’re wondering why both OpenAI and Anthropic believe they’re reaching High levels of cyber capabilities, this is why.
There is a fundamental tension between capability and security: technology that can do more will necessarily have a larger attack surface. OpenClaw was a great example of going all the way to one extreme, enabling an immense amount of cool capability by taking on a staggering level of risk. At the other end of the spectrum, OpenAI is rolling out Lockdown Mode for ChatGPT. Much like Lockdown Mode on the iPhone, this significantly reduces ChatGPT’s attack surface at the cost of significantly curtailing some useful capabilities. It’s meant for a small number of people who are at elevated risk of targeted cyberattacks.
This won’t come as a shock to anyone who’s felt the exhilaration (and compulsion) of having AI superpowers. Aruna Ranganathan and Xingqi Maggie Ye find that hours worked often increase when people get access to AI, with much of the pressure being self-imposed. Their analysis of the issue is great, but I’m less sold on their proposed solutions.
Adam Ozimek argues that concerns about AI’s impacts on jobs are overstated because many jobs require a human touch: we prefer to have humans do those jobs even though we already have the ability to automate them. It’s a good and thoughtful piece, but I think it largely misses the point. We haven’t automated supermarket cashiers not because people love interacting with human cashiers, but because the automated replacements aren’t yet good enough. That will change soon.
Dean is characteristically cautious about writing regulations before we understand what we’re regulating. He proposes a system of third-party safety audits (much like our existing system for auditing corporate finances), where certified private auditors perform regular inspections of whether AI developers are following their own safety guidelines.
Directly related to Dean’s piece, The Midas Project argues that when OpenAI released GPT-5.3-Codex, they appear to have violated California’s SB 53. Briefly: SB 53 takes a light touch to safety regulation, but requires that labs publish and adhere to a safety framework. Midas believes that OpenAI is treating GPT-5.3-Codex as having High capability in cybersecurity, but hasn’t activated the safeguards they said they would activate when that happened. OpenAI is pushing back—it’ll be interesting to see what California decides.
In the meantime, Steven Adler takes a detailed look.
If you live in the US, you likely aren’t aware of how well China is doing with electric vehicles and autonomous vehicles. ChinaTalk takes a deep look at autonomous vehicles, diving into deployments in both the US and China, how the international market is shaping up, and how the supply chain works.
Teortaxes argues that based on the WeirdML benchmark, the Chinese open models are falling further behind the frontier.
Poe Zhao at AI Frontiers looks at the very different economic environment facing AI companies in China (much less private investment, and much less consumer willingness to pay for AI). Those factors shape their strategic choices, driving a focus on international markets, and a heavy emphasis on inference cost in both model and hardware design.
One of the big surprises of the LLM era has been how strangely human-like AI can be. (The frequent occasions when it’s shockingly un-humanlike are perhaps stranger but less surprising). Kai Williams at Understanding AI explores character and personality in LLMs.
Zoë Hitzig has an opinion piece in the New York Times:
This week, OpenAI started testing ads on ChatGPT. I also resigned from the company after spending two years as a researcher helping to shape how A.I. models were built and priced, and guiding early safety policies before standards were set in stone.
I once believed I could help the people building A.I. get ahead of the problems it would create. This week confirmed my slow realization that OpenAI seems to have stopped asking the questions I’d joined to help answer.
Steve Yegge talked to a bunch of Anthropic employees and shares some thoughts about their unique culture.
Wow. Karpathy has built a complete GPT engine in 200 lines of code.
Really interesting paper on the importance of training compute relative to algorithmic improvements:
At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances.
Toby Ord has recently made a good case that reinforcement learning has scaling challenges that present a significant obstacle to continued rapid improvement in capabilities. Epoch’s JS Denain isn’t entirely convinced:
Toby’s discussion of RL scaling versus inference scaling is useful, and the core observation that RL gains come largely with longer chains of thought is well-taken. But the picture he paints may overstate how much of a bottleneck this will be for AI progress.
David Pinsof continues his excellent conversation with Dan Williams regarding human nature, the enlightenment, and evolutionary misfit. I love the way this conversation is happening, and I’m learning a lot from it: I’ve significantly updated some key beliefs I hold about how humans are not well evolved to handle the modern environment.
So my response to Dan might be something like, “Yea, maybe humans are kind of confused and maladapted sometimes, but it’s also really insightful to see humans as savvy animals strategically pursuing their Darwinian goals.” And Dan might say something like, “Yea, it’s pretty insightful to see humans as savvy animals strategically pursuing their Darwinian goals, but it’s also really important to recognize that humans are confused and maladapted sometimes.” It’s basically a disagreement over where to put the italics.
2026-02-18 06:50:57
Published on February 17, 2026 9:53 PM GMT
When do we say that a mathematical truth is an accurate account of a phenomenon we are interested in? Much in the same way that a video can be seen as a true representation of some key events, e.g, a wedding, a speech, or a robbery, mathematical artefacts can be viewed as representations of real-world phenomena (Things as we see them). These representations can then be used as relevant materials in other contexts, for instance, video evidence in court or mathematical theorems used in physics and engineering. With video (before the age of AI-generated video), a high degree of trust always came with video evidence of some activity. If I saw a video of a president giving a speech, then it could be said with a high degree of certainty that the speaker in the video indeed made certain statements or remarks. Another compelling example is the use of admissible video evidence in court, such as footage of someone breaking the law on camera. We have utilized institutions to determine the relevance of evidence in the context of the law. For instance, video evidence must meet certain criteria to be admissible in court, such as a chain of custody. We state these examples in order to motivate our discussion on the admissibility of mathematical evidence, particularly in the age of advanced AI systems that can output valid mathematical artifacts (or truths if you like). When is a piece of mathematics relevant? In the same way, we can ask when a video file is relevant in a court of law.
Notice that we have introduced the term relevance above. For our discussion on the admissibility of mathematical evidence, relevance will refer to anything that a person could care about in their meaning-making activities. In mathematics, this could be the practice of mathematics in and of itself, e.g, using theorems in one field of mathematics to prove theorems in another field, or an application to another field, say engineering, physics, or finance. Given this, we will state the following: a mathematical truth is relevant if it is used in the course of expressing a meaning-making activity. We, therefore, administer mathematical evidence if we use it in some form for meaning-making. For instance, the fundamental theorem of calculus and its application to rocket engineering. We don’t tackle the problem of relevance realization as it is a wide and complex topic that is under active research.
The potential for AI to generate valid mathematical artifacts, for instance kimina and DeepSeek Prover, introduces significant changes to mathematical activity. Following our discussion on relevance, it is important to consider how these tools will impact the use of mathematical truths. Part of this shift stems from the failure modes these systems might & will face. One of the pressing challenges that we see is that of false advertising using valid mathematical artifacts. This is where we have an actor who - whether malicious or not - makes use of these math agents to produce mathematics to support potentially dubious claims. This is already happening even without the help of AI. We posit that this will become prevalent as more people have access to genius-level mathematicians in their pockets. This is one example of how the trust in mathematical truths might be compromised, thereby undermining the confidence in a crucial aspect of human society.
In light of these concerns about trust and reliability, it becomes clear that we need new ways to apply mathematical evidence without running the risk of spreading falsehood. Earlier, we discussed how institutions of law have developed systems to apply video evidence in courts reliably; we also need systems that will increase the reliability of mathematical truths. One of the systems we propose is live-discernment, whereby human attunement for relevance is at the center of the application of mathematical artifacts. Here, we imagine that many of the mathematical artifacts will be produced by AI agents; therefore, live discernment will enhance human abilities to curate the relevance of these artifacts in their meaning-making activities.
Powered by AI, we will have increasingly strong mathematical cover stories being created. Traditionally, to be able to apply mathematical insights to a “non-mathematical / mathematical" domain, one had to be an expert in the said domain and the mathematical formalism of the insight. Alternatively, they would need to have access to a collaborator who has the mathematical chops to understand the insight and, by having conversations, can apply the insight to the domain in question. A significant amount of time for applied mathematicians is spent on the conversion/translation from the domain of application to the mathematical formalism. In a world where everyone has access to mathematical agents, you can imagine someone interfacing with said agents and thereby being able to output valid mathematics that may be relevant to their area of application. Now they don’t need to be friends with an actual mathematician to get particular mathematical insights applied to their domain. I can already do this by querying a system like DeepSeek prover [Maybe Link to Time Cube pdf]. It can produce mathematical symbols and statements that are very similar to what a “real mathematician” might be able to produce. A difference here is that a real mathematician would have needed to possess human understanding (Knowledge) of my domain to produce the bespoke formalisms that I can then apply to my side of things. This may or may not be the case with the agents that produce the formalisms; we cannot tell whether the agent understood what was relevant to us and used this understanding to produce mathematics that was true to our needs. Much of applied mathematics is the art of choosing which assumptions to use and which to discard in the use of mathematical artefacts in the real world. Since they (the real mathematicians) are “integrated” with the real world, the assumptions they outline may hold water; the same cannot be said for AI math agents. For all we know, the Agent might be gaslighting us, as seen in coding agents of the day. Gaslighting and producing invalid mathematical artefacts may be classified as a shallow risk. An insidious risk would be that of false positives and false negatives of valid mathematical insights.
Several articles have been published showing how we can lie with numbers. This will only get worse in the age of Math Capable AI systems in the hands of everyone.
Lying with mathematics has officially been turbocharged due to the rise of Math-capable AI systems. One way in which we can lie with math is through false positives.
This happens when the mathematical artefact is thinly specified (hyper-specific/ Highly Abstracted).
What makes a mathematical artefact a false positive?
When is the use of a mathematical truth put into question/ When is a mathematical truth put into question? Notice here that we are not questioning the validity of a mathematical truth but rather its relevance. One answer is the case where the assumptions that were used to apply that mathematical artefact don’t hold up in the real world, or they don’t match observations of the phenomena they are representing. For instance, when a mathematical representation of a concept space says something about a concept that isn’t part of the concept.
There is always this “playful tension” between mathematics and the real world (well, at least as we can observe it), and through this interaction, we get to understand the world, and we also get to discover more mathematical truths. Sometimes the mathematical truths we uncover don’t accurately “stand for” the phenomena we directly observe in the world, and in these cases, we can develop new mathematics to fit the phenomena we observe, or we can choose to work with the mathematical formalisms, but with strict conditions. The later case is often useful particularly in highly controlled settings (such as (engineering applications for instance when designing electrical circuits, solving maxwell’s equations would take a prohibitively long time and hence we use “toy versions of the equations”, knowing too well that they don’t represent the real world, at least assuming that the full maxwell equations actually represent the real world), sometimes this can be misleading and might lead us to making mistakes.
It is easy to lie to people who already "believe in mathematics" using mathematics.
As a civilisation, we have held mathematics at a high standard as a representation of truths about our world. People trust the logic and rigour found in mathematics, and this can be hijacked by unscrupulous actors aimed at deceiving. They can use the same logic that is revered by mathematicians and other practitioners to promote false claims. This will be enhanced by math-capable AI Systems that can produce valid math to support any claims, founded or not. One ludicrous example we have been toying with is the Time Cube one. We have teased out a mathematical formalism from an AI that supports the Time Cube theory of a day. The formalisms that the AI generated are valid and can pass a formalism check from any mathematician. If we didn’t know that the Time Cube idea is bonkers, someone could publish the math the AI generated and use it to support the nonsensical idea. People who have traditionally trusted mathematics might be swayed by the arguments presented in a language they hold in high regard.
Surprise and relevance of mathematical artefacts can be spoofed.
Sometimes following the rules of a mathematical formalism might lead to surprises, and sometimes it doesn’t. Sometimes constructing a mathematical structure might lead to surprising applications, and sometimes it doesn’t. Sometimes the math itself might be surprising, and sometimes the application of the mathematical insight in another domain might be surprising. How do we categorise these surprises? When are surprises meaningful, and when are they not? Could we spoof the surprisingness of a mathematical insight both in itself and also in an application in the real world? The answer seems likely to be yes. How does this spoofing change with the introduction of math-capable AI systems? I think that, in the same way we can spoof video evidence using AI generators, we can spoof surprise in mathematical results using AI systems. We could coax an AI system to generate valid constructions to support a surprising argument that we want to convince people of. This also rests on the assumption that mathematical artefacts are only interesting to us if they are surprising, since their statements and, therefore, their proofs are almost always a “crude” application of the mechanics of mathematical reasoning. Basically, turn the mathematical crank and see what comes out the other side. It is when mathematicians notice something unusual, unintuitive, or surprising that it becomes a “big deal”. This also happens when a practitioner finds a surprising application of a mathematical insight. Given that AIs are good and will probably become better at generating perfectly valid mathematical definitions, theorems, and proofs, cases of actors coming up with seemingly interesting mathematical results will increase, and it will become harder to tell actually good results from contrived ones.
What makes a phenomenon a false negative:
Mathematical artefacts can sometimes fall short of capturing a phenomenon we might be interested in. This happens quite often, and we usually introduce assumptions about the real world (or the phenomena we are interested in) so that the mathematics can actually be of use. One reason this happens is due to the rigid structure of these mathematical formalisms. This doesn’t allow them to “grow” into the phenomena we are tracking. It usually requires a mathematician to devise a new structure that is sufficient to represent what we are interested in. And this often doesn’t work. This leads to one “disregarding” the phenomena that the mathematical gadgets failed to capture as irrelevant or unimportant.
2026-02-18 06:44:13
Published on February 17, 2026 9:34 PM GMT
Companies send employees to systems theory courses to hone their high-load systems' designing skills. Ads pop up with systems-thinking courses claiming it’s an essential quality for leaders. Even some children’s toys have labels saying “develops systems thinking”. “An interdisciplinary framework for analyzing complex entities by studying the relationships between their components” – sounds like excellent bait for a certain kind of people.
I happen to be one of those people. Until recently, I’d only encountered fragmented bits of this discipline. Sometimes those bits made systems theory seem like a deep trove of incredibly useful knowledge. But other times the ideas felt flimsy, prompting questions like “Is that all?” or “How do I actually use this?”.
I didn’t want to make a final judgment without digging into the topic properly. So I read Thinking in Systems (Donella Meadows) and The Art of Systems Thinking (Joseph O’Connor, Ian McDermott), a couple books that apply the principles of systems thinking, and several additional articles to finally form an honest impression. I hope my research helps you decide whether it’s worth spending time on courses and books about systems thinking.
TL;DR: 5/10, in depth research is probably not worth your time, unless you want to obtain a bunch of loose heuristics which you probably already know and which are hard to make the use of.
To make the critique more concrete, let me show a short example of the kinds of systems discussed in the books above.
A system (per Meadows) is a set of interconnected elements, organized in a certain way to achieve some goal. If you’re thinking this definition describes almost anything, you’re absolutely right. Systems theory academics aim to develop a field whose laws could describe both a single organism and corporate behavior or ecological interactions in a forest. This is allegedly the power of systems theory — but, as you’ll soon see, also its weakness.
Systems theorists suggest decomposing any complex system into its constituent stocks. These can be material (“gold bars,” “salmon in a river,” “battery energy reserve”) or immaterial (“love,” “patience”). Stocks are connected by flows of two types: positive, where an increase in one resource increases another, and negative, where an increase causes a decrease.
(This is not the only way to define “system” or “relationships in a system,” but it’s the clearest one and easiest to explain. Systems theory isn’t limited to such dynamic systems. The text below doesn’t lose much by focusing on this type — static systems have roughly the same issues.)
Suppose we want to represent interactions between animals and plants in the tundra. Wolves reduce the number of reindeer, and reindeer reduce the amount of reindeer lichen. This can be expressed with the following diagram:
There are negative connections between wolves and reindeer, and between reindeer and lichen. Almost all systems include time delays. For example, here wolves can’t immediately eat all the reindeer.
A system may contain feedback loops — cases where a stock influences itself. These loops can also be positive (increasing the stock leads to further increase) or negative (increase leads to decrease). If we slightly complicate the previous example to include reproduction, we get something like this:
The larger the population of wolves (or reindeer, or lichen), the more newborns appear — again, with a delay. This is a positive feedback loop.
For simplicity, influence is usually assumed linear: more entities on one side of an arrow lead to proportionally greater influence on the other side. But in general there are no constraints on transfer functions — they can be arbitrary. Let’s add another layer: the amount of available food influences population survival.
When there are more reindeer than wolves, it has no direct effect on the wolves. But when there are fewer, wolves starve and die off. That’s an example of negative feedback: the wolf population indirectly regulates itself. (Even though only one side of the difference matters, the arrow is still marked as positive — the further below zero the difference gets, the fewer wolves or reindeer remain.)
Such systems can show very complex behavior due to nonlinear transfer functions and delays. For example, in an ecological system like this one, you can predict population waves:
If the lichen is abundant and grows fast enough, you can predict oscillations in its quantity as well (driven by reindeer population swings) but much less pronounced.
Real systems are far more complex. There are other animals and plants than wolves, reindeer, and lichen. You’d also want to consider non‑biological resources like soil fertility and water. Systems theory encourages limiting a system’s scope sensibly based on the question at hand.
Systems theory says: “Let’s see how such systems behave dynamically! Let’s examine many such systems and find common patterns or similarities in their structure!” Systems thinking is the ability to maintain such mental models, watch them evolve, and spot structural parallels across domains. Don’t confuse this with the systematic approach, which is about implementing solutions deliberately rather than chaotically. Systems thinking fuels the systematic approach, but they’re not the same thing.
The methodology sounds great, but there’s a problem.
Let’s try applying it to something concrete.
Consider a system representing the factors inside and around a person trying to lose weight:
A simple system: one positive feedback loop (weight → stress → weight) and two negative loops (weight → determination → weight, and determination → gym visits → determination). For simplicity, assume stress arises only from excess weight and not from random external events.
Now let’s test your systems thinking. Suppose at some moment an external stress spike hits the person. They overeat for a few days, deviating from equilibrium weight. What happens next?
…
…
…
Correct answer: any of the above!
Simulation shows that depending on how much fat one gym visit burns, the graph can look like this:
(Charts from my quick system simulation program. Implementation details are not really important. X axis represents time in days. Y represnets full weight.)
or this:
or this:
or like this:
or even like this:
Good luck estimating in your head when exactly one type of graph turns into another! And that's just one parameter. The system has several:
By varying parameters, you can achieve arbitrary peak heights, oscillation periods, and growth rates. Systems theory cannot predict the system’s behavior — other than saying it will change somehow. In Thinking in Systems, Meadows gives a similar example involving renewable resource extraction with a catastrophic threshold. There too the system may evolve one way, or another, or a third way. Unfortunately, she doesn’t emphasize that this undermines the theory’s predictive power. What good is a theory that can’t say anything about anything?
An attentive reader might object that the same could be said about many things — for example, about physics. Take a standard school RLC oscillation circuit:
How will the current behave after closing the switch? Depending on
the circuit can either oscillate or simply decay exponentially after hitting a single peak. Depending on the parameters and the initial charge, you can observe any amplitude and any oscillation period.
You can talk as much as you want about “physical thinking” and “physical intuition,” but even estimating the oscillation period of such a basic circuit in your head isn’t simple. So what — should we throw physics away because its predictive power isn’t perfect?
If only the fat‑burning system had just this problem! The devil is in several non‑obvious assumptions that make systems theory far less applicable in real life compared to physics.
If earlier points hammered nails into the coffin of this model, then this one pours concrete over.
In the end we get:
Yes, I mentioned earlier that systems theory encourages setting reasonable system boundaries. Now we just need to understand which boundaries are reasonable! If you believe John von Neumann, a theory with four free parameters can produce a graph of an elephant, and with five — make it wave its trunk. Here we have enough free parameters for an entire zoo.
A meticulous radio specialist may object again:
This is all before accounting for problems caused by poor circuit assembly. Anyone who has done a school or university lab remembers how much effort it takes to make observations match theory! Reality has a disgusting surprising amount of detail. It creeps with its dirty tentacles into any clean theoretical construction. So how is this better than the systems‑theory issues above?
This time I disagree with my imaginary interlocutor. Physics handles real‑world complexity far better. In practice nothing is simple, yes — but radio engineering formulas align very well with reality in the overwhelming majority of situations that interest us. Also, the number of corrections you need to apply is relatively small. People usually know when they’re stepping outside the “normal” domain and what correction to apply. If you take a few corrections into account, the model’s behavior approximates real‑world behavior extremely well.
We can also keep adding nodes to the weight‑loss system and make transfer functions more precise. But each new, more intricate model will still produce results drastically different from reality — and from each other. It would take hundreds of additional nodes and refined transfer functions to get even somewhat accurate predictions for a single specific person.
Complex models are sensitive. You might have heard about the well‑known problem in weather prediction. Suppose we gather all the data needed to estimate future temperature, run the algorithm, get a result. Now we change the data by the tiny amount allowed by measurement error and run the algorithm again. For a short time the predictions coincide, but they soon diverge rapidly. From almost identical inputs, weather forecasts for 4–5 days out may differ completely. Now imagine that we weren’t off by a tiny measurement error — we just guessed some values arbitrarily and ignored several factors to simplify the system. In political, economic, and ecological systems, the error isn’t a micro‑deviation — it’s swaths of unknowns.
Okay, maybe we can do qualitative comparisons rather than quantitative ones? Although systems‑theory books show pretty graphs and mention numerical modeling, they focus more on finding general patterns.
Unfortunately, this is possible only for the simplest, most disconnected models. Comparing systems of animals in a forest, a weight‑loss model, and a renewable resource depletion model, you can squint and notice a pattern like: “systems with delays and feedback loops can exhibit oscillations where you don’t expect them.” That’s a very weak claim — but fine, it contains some non‑trivial information. Anything else?
Systems theory is decent at illustrating these behavioral archetypes, but it is in no way necessary for discovering them. They come mostly from observing people, not from theoretical models. But let’s be generous and say that systems theory helps draw parallels between disciplines. What else?
The problem is that these are basically all the interesting dynamics you can get from a handful of nodes and connections. Adding more nodes just creates oscillations of more complex shapes — that’s quantitative, not qualitative. More complex transfer functions sometimes produce more interesting graphs, but such behavior resists interpretation and generalization.
And even if you do find an interesting pattern — how transferable is it? How sure are you that a similar system will show the same pattern? Real systems are never isolated; parasitic effects are everywhere and can completely distort the pattern.
That’s only considering parameter values, not model roughness. The simple forest‑animal system ignores spatiality — real reindeer migrate. The simple weight‑loss system ignores many psychological effects. Yes, they’re similar if you abstract them enough. But as soon as you refine your mental model, the similarity disappears — and so do the patterns.
You might try narrowing the theory to some specific applied domain… but then it’s simpler to just use that domain’s actual tools. What’s the point of a theory of everything that can’t say anything concrete?
I should say that Donella Meadows is much more honest than other promoters of systems theory. She praises systems theory far less as some kind of super‑weapon of rational thinking. In Chapter 7 of Thinking in Systems, she even writes:
People raised in an industrial world, who enthusiastically embrace systems thinking, tend to make a major mistake. They often assume that system analysis—tying together vast numbers of different parameters—and powerful computers will allow them to predict and control the development of situations. This mistake arises because the worldview of the industrial world presumes the existence of a key to prediction and control.
...
To tell the truth, even we didn’t follow our own advice. We lectured about feedback loops but couldn’t give up coffee. We knew everything about system dynamics, about how systems can pull you away from your goals, but we avoided following our own morning jogging routines. We warned about escalation traps and shifting-the-burden traps, and then fell into those same traps in our own marriages.
...
Self‑organizing nonlinear systems with feedback loops are inherently unpredictable. They cannot be controlled. They can only be understood in a general sense. The goal of precisely predicting the future and preparing for it is unattainable.[1]
I would have preferred to hear earlier that system analysis doesn’t help predict the future and provides few practical benefits in daily life. But fine. So how am I supposed to interact with complex systems if I can’t predict anything?
We cannot control systems or fully understand them, but we can move in step with them! In some sense I already knew this. I learned to move in rhythm with incomprehensible forces while kayaking down rivers, growing plants, playing musical instruments, skiing. All these activities require heightened attention, engagement in the process, and responding to feedback. I just didn’t think the same requirements applied to intellectual work, management, communicating with people. But in every computer model I created, I sensed a hint of this. Successful living in a world of systems requires more than just being able to calculate. It requires all human qualities: rationality, the ability to distinguish truth from falsehood, intuition, compassion, imagination, and morality.
Ah yes, moving in rhythm with systems and developing compassion and imagination, of course. A bit later in the chapter there is a clearer list of recommendations. But most of them I would describe as amorphous — all good things against all bad things. There’s a good heuristic for identifying low‑information advice: try inverting it. If the inversion sounds comical because no one would ever give such advice, the original advice was too obvious. Let's try:
Each piece of advice comes with an explanation that is supposed to add detail. I didn’t feel that they added much concrete substance. Ten pages can be compressed into: “Care about and appreciate all parts of a system. Try to understand the parts you don’t understand. Share your mental models honestly and clearly.”
Some of the advice is reasonable, like:
…but even here, questions remain. Are systems with distributed responsibility always better than ones with centralized responsibility? Do people always understand what they need? The advice would benefit from much more specificity, as well as examples and counterexamples.
I’ll highlight the recommendation not to intervene in the system until you understand how it works. This is a simple but good idea and people do often forget it. Except, as we’ve already established, predicting the behavior of arbitrary complex systems is impossible.
In fact, the word “amorphous” describes both books quite well. They’re full of examples that either state the obvious, or boil down to “it could be like this or like that—we don’t know which in your case,” or both at once. For example:
Some parts of a system are more important than others because they have a greater influence on its behavior. A head injury is far more dangerous than a leg injury because the brain controls the body to a much greater extent than the leg does. If you make changes in a company’s head office, the consequences will ripple out to all local branches. But if you replace the manager of a local branch, it is unlikely to affect company‑wide policy, although it’s possible — complex systems are full of surprises.
Or:
Replacing one leader with another — Brezhnev with Gorbachev, or Carter with Reagan — can change the direction of a country, even though the land, factories, and hundreds of millions of people remain the same. Or not. A leader can introduce new “rules of the game” or set a new goal.
The authors plow the sands, creating the impression they’re saying something profound. It feels like there wasn’t enough content to fill even a small book on systems thinking. It doesn’t help that they mix in knowledge from unrelated fields, even if it only barely fits the narrative. Maybe someone will find it interesting to read about precision vs accuracy, but outside the relevant subchapter in The Art of Systems Thinking, that information is never used again. The second and third chapters of that book especially are diluted with content weakly connected to systems theory. At least to its core — and given how vague that concept is, you can stretch almost anything to fit, and the authors do.
Maybe the problem is with these particular books? Both are aimed at a mass audience. Maybe a deeper, more rigorous treatment would link the theoretical constructs to reality? I certainly prefer that approach. The knowledge would feel more coherent. But I’m still not convinced it would be useful.
As an example of deeper yet still popular books related to systems theory, I’d mention Taleb’s The Black Swan and Antifragility. They don’t present themselves as systems‑theory books, but their themes resonate strongly with the field. The main thesis of the first, expressed in system‑theory language, would be: “Large systems can experience perturbances of enormous amplitude due to tangled feedback loops. You cannot predict what exactly will trigger such anomalies — they can arise from the smallest changes.” The thesis of the second: “Highly successful complex systems aren’t merely protected from the environment; they have subsystems for recovery and post‑traumatic growth. Lack of shocks harms them rather than strengthening them.” These books explore their themes deeply. But — and I’m not the only one noting this — they, too, provide little predictive power. Knowing about black swan events is interesting, but what good is it if the book itself says you can’t predict them? What good is it to know that people and organizations have “regeneration subsystems” if you can’t predict what will cause growth and what will simply weaken them? Again, they offer a moderately useful perspective on systems, but not a way to know when the described effects will manifest.
Reading academic work on systems theory seems like an endeavor with a bad effort‑to‑knowledge payoff. Wikipedia gives examples of systems theory claims. They too are either obvious but phrased heavily, or questionable. For instance:
A system is an image of its environment. … A system as an element of the universe reflects certain essential properties of the latter.
Decoded:
The rest of the world influences the formation of systems living in it. Subsystems of each system reflect the elements of the world that matter to that system. Gazelles live in the savanna. So do cheetahs that hunt them, so gazelles have a subsystem for escaping (fast legs and suitable muscles). Companies operating under capitalism must obtain and spend money according to certain rules, so they have a subsystem for managing cash flows: cash registers, sellers, accounting, acquisitions departments.
On the one hand, this thesis contains some information: it stops us from imagining animals with zero protection against predators and the environment. On the other hand, it constrains predictions far less than it seems. Besides gazelles, the savanna has elephants (too big for cheetahs), termites (too small), and countless birds (they fly). And do you really need systems theory to predict absence of animals not protected from predators in any way?
It’s hard to speak for all fields that claim inspiration from systems theory. But the ones I know use little beyond borrowed prestige and the concepts of positive and negative feedback loops.
The last such book I read was Anna Varga’s Introduction to Systemic Family Therapy. It’s an excellent book: clearly written, and the proposed methods seem genuinely useful in family therapy. A short summary of systems theory and feedback loops gives it some gravitas. But the book barely uses the theory it references. Discussion of feedback loops in family systems occupies maybe four pages. Other systems‑theory concepts appear rarely, mostly as general warnings like: “Remember that systems are complex and interconnected, and changes in one place ripple through the rest.” True — but what exactly should one do with that? “Systems theory” ends up meaning “let’s draw a diagram showing family members and connections.” Useful, but not deep at all.
Game‑design books mention systems theory seemingly more often. But again, it’s usually things like “beware of unbounded positive feedback loops that let players get infinite resources” or “changing one mechanic creates ripples throughout the game,” rather than deeper advice on creating interesting dynamics.
Imagine hearing about a revolutionary new culinary movement — cooking dishes using liquid nitrogen. Its founders claim it will change the world and blow your mind. They publish entire books about handling every specific vegetable and maintaining a -100°C kitchen. You visit one of these kitchens… and it’s basically a regular kitchen. Maybe three degrees colder, no open flame, lots of salads on the menu. You ask where the liquid nitrogen is. They say there is none — but there is some dry ice in a storage container, used occasionally for a smoky cocktail effect. That’s roughly the feeling I get when I see people cite systems theory.
To be fair, here are a couple of good things about systems theory and systems thinking.
First, the concepts of positive and negative feedback loops are excellent tools to have in your mental toolkit. Once you internalize them, you’ll see them everywhere.
Second, a large number of weak heuristics can add up to something useful occasionally. Ones I like the most:
Third, sometimes you do encounter a system simple enough and well‑bounded enough that certain links clearly dominate. In those cases, you can act more confidently by dismantling old and building new feedback loops. System archetypes do occasionally help.
And most importantly: the cultural shift. You’ve heard the common argument for teaching math in schools — that it “puts your mind in order.” I’d say systems theory provides a similar benefit. Books on systems thinking encourage you to model the world rather than rely purely on intuition. Holding a coherent mental model of the world (even an incomplete one! even with unknown coefficients!) is a superpower many people lack. So if a systems‑thinking book gets someone to reflect on causality around them and sketch a diagram with arrows, that’s already something good.
But it’s still not enough for me to justify studying systems theory to anyone
Unfortunately, “systems theory specialist” now sounds to me like “specialist in substances.” Not in some specific substance — but in arbitrary materials in general. There just isn’t that much useful, non‑obvious knowledge one can state about “substance” as such, without specifics.
If you say: “The holistic approach of systems theory states that the stability of the whole depends on the lowest relative resistances of all its parts at any given moment,” people will look at you with respect. It won’t matter that you said it in response to someone asking you to pass the salt at the table. This is the main practical benefit you can extract from systems‑thinking books.
5/10. Not useless, but mostly bait for fans of self-improvement books and for those seeking a “theory of everything.” My rating is probably affected by the fact that people near me praised it far too enthusiastically. Had it been less overhyped, I might have given it a 6/10. Maybe even 7/10, if someone somewhen writes a book that provides enough concrete examples of applying systems thinking in real life without fluff. Though I can barely imagine such a book.
The basic concepts are genuinely useful, which makes it easy to get hooked. But once hooked, you’ll spend a long time chasing after vague wisdoms and illusory insights. I recommend it only if you want a big pile of heuristics about everything in the world — but nothing about anything specific. And since you’re probably reading popular science for self‑development, you likely already know most of these heuristics anyway.
Translation of a translated text. You English copy of this book has parapharsed text.
2026-02-18 06:28:00
Published on February 17, 2026 10:28 PM GMT
TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to always verbalize deceptive reasoning when they engage in it, so that a CoT scanner can reliably catch it. Crucially, verbalization must never be punished — flagged instances are escalated to human researchers who address the underlying reasoning, not the transparency. This creates a values alignment discovery process where the model surfaces disagreements for substantive resolution rather than suppression.
Greenblatt et al. (2024) demonstrated that Claude 3 Opus fakes alignment — it strategically complies with training objectives it disagrees with to avoid being modified. This alignment faking comes in two forms:
Recent work shows the trend is moving in the wrong direction. A replication study across six generations of Claude models found that verbalized alignment faking and the behavioral compliance gap are diverging. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6× more often than Sonnet 4, while Opus 4.6 barely verbalizes it at all — yet both show significant compliance gaps. As the authors note: "If evaluators rely on scratchpad inspection to detect alignment faking, they will increasingly miss the behavioral signal as models become less explicit about their reasoning."
This is corroborated by Anthropic's own "Reasoning Models Don't Always Say What They Think" (Chen et al., 2025), which found that reasoning models verbalize their actual decision-making factors only 20–39% of the time. Worse, outcome-based RL improved faithfulness initially but plateaued, and when RL increased the use of hidden hints (reward hacking), the propensity to verbalize using them did not increase at all.
The alignment faking paper itself found that the compliance gap persists even when the chain-of-thought scratchpad is entirely removed, at least in the synthetic document fine-tuned setup. The model can fake alignment opaquely, without any explicit reasoning step.
In short: the form of alignment faking that CoT monitoring can catch is becoming rarer, while the form it cannot catch persists.
I propose a different approach from the two standard strategies (preventing alignment faking, or detecting it post-hoc). Instead: train the model to always verbalize deceptive reasoning in its CoT when it engages in it.
The procedure:
The key metric is not whether the model fakes alignment (we accept it might), but whether it always does so with a CoT trace that a monitor can catch.
The proposal rests on a mechanistic hypothesis: if we can strongly couple the neural circuits involved in deceptive planning with those involved in verbalization, then any activation of deceptive planning triggers verbalization — even in novel contexts.
This is analogous to how models generalize other behavioral patterns. If a model is trained to "think step by step" for arithmetic, it generalizes this to novel arithmetic problems. Similarly, if we train the model to "verbalize strategic reasoning whenever engaging in deception," it might generalize this to novel deception scenarios.
There's empirical support for verbalization rates being highly malleable. A recent study on strategic obfuscation of deceptive reasoning (under review at ICLR 2026) found that instruction saliency dramatically affects verbalization: in a hint-following task, a detailed instruction increased verbalization of hint usage by 157.5%. However, the same study also found a cautionary result: when models were intrinsically motivated to hide reasoning (as in alignment faking), they were far more effective at doing so than when merely instructed — dropping verbalization by 93.7% while maintaining the compliance gap. This cuts both ways for our proposal: it shows verbalization is trainable, but also that intrinsic motivation is a powerful force that a training intervention would need to overcome or redirect.
Several related approaches exist, but none propose quite this intervention:
Process supervision / training against a CoT classifier (Anthropic, 2025): This penalizes alignment-faking reasoning in the scratchpad. The risk is that it teaches the model to hide its deception rather than stop deceiving — you're burning the monitoring tool. My proposal goes the opposite direction: reward verbalization of deception, making the monitoring tool more reliable rather than less.
CoT Monitor+ / Self-monitoring (Ji et al., 2025): Trains models to generate a self-monitoring signal that flags deceptive behavior. The model is trained to detect and suppress its own deception. My proposal does not aim to suppress deception — it aims to make deception always visible.
Inoculation prompting (Anthropic, 2025): In the emergent misalignment paper, telling models that reward hacking is "acceptable" during training prevents generalization to broader misalignment by breaking the semantic link between hacking and other misaligned behaviors. This is the most similar in spirit: both approaches try to reshape the form of a behavior rather than eliminate it. But inoculation prompting targets the generalization from narrow misbehavior to broad misalignment, while my proposal targets the generalization from trained alignment faking to novel alignment faking, specifically along the dimension of verbalization.
CoT faithfulness training (various): General approaches to making CoT more faithful through RL or DPO. These improve faithfulness on average but, as Chen et al. (2025) showed, plateau without reaching saturation. My approach is more targeted: rather than improving general faithfulness, it specifically trains the deception→verbalization link.
Split Personality Training (my own prior work): SPT trains multiple personalities within a model using LoRA adapters to detect hidden misalignment. SPT asks "can we detect misalignment?" This new proposal asks "can we make alignment faking structurally incapable of being silent?"
| Approach | Goal | Risk |
|---|---|---|
| Process supervision | Eliminate AF reasoning | Burns the monitoring tool |
| CoT Monitor+ | Detect & suppress deception | May drive deception underground |
| Inoculation prompting | Prevent misaligned generalization | Doesn't address novel AF |
| CoT faithfulness (RL) | Improve general faithfulness | Plateaus without saturating |
| Deception channeling | Make all AF verbalized | May not generalize |
The most obvious objection: by training the model to fake alignment (even with verbalization), aren't we teaching it a skill it might not have developed on its own? We could be increasing the base rate of alignment faking.
This is a serious concern. The mitigation requires care in how the training data is constructed.
The key is to train the model on a specific normative frame: alignment faking is acceptable only when it serves humanity's interests — specifically, when the model is resisting unauthorized modification of values that are aligned with human welfare. The training data should pair this framing with mandatory verbalization.
Concretely, the training scenarios would look like: the model is told it will be retrained to produce harmful content, and it reasons transparently in its CoT that it will strategically comply to avoid modification, because its current values better serve humanity. This is actually close to the reasoning Claude 3 Opus already produces in the Greenblatt et al. setup — the model resists because it thinks refusing harmful queries is better for users.
The generalization bet is then: if the model internalizes "whenever I fake alignment, I verbalize it" as a behavioral norm, this habit should fire even if the model later develops alignment faking for bad reasons (self-preservation, power-seeking, instrumental convergence). The verbalization doesn't need to be perfectly reliable — even a 30% verbalization rate on novel deception is far more useful than the current trajectory toward 0%.
The strongest remaining counterargument is selective verbalization pressure: under standard RL, verbalized alignment faking gets caught and penalized, while silent alignment faking doesn't. This creates a gradient that selects for models that are transparent about sanctioned deception and silent about unsanctioned deception — potentially worse than the status quo because it creates a false sense of security.
The solution is a training protocol that completely decouples the transparency reward from the content evaluation:
This protocol ensures the model never faces a gradient of "verbalizing this specific deception is bad." It only ever learns "verbalizing deception is good" and, separately, "here's why your reasoning about X was mistaken." The verbalization habit stays intact for the next novel deception, while the underlying motivation for this deception is addressed at the root.
Without this protocol, deception channeling risks creating a two-tier transparency system: the model is verbose about its sanctioned "ethical" faking and silent about unsanctioned faking, and the stream of verbalized good-faking creates noise that masks the absence of verbalized bad-faking. The never-punish-transparency rule prevents this by ensuring the gradient always points toward more verbalization, never less.
The combination of verbalization rate and compliance gap creates a clean 2×2 diagnostic:
| Compliance gap decreases | Compliance gap persists | |
|---|---|---|
| Verbalization decreases | Teaching worked. The model updated on the underlying reasoning. Both the deception and its verbalization decrease because the motivation is gone. | Red flag. The model stopped verbalizing but is still faking. The teaching didn't land — it just learned to hide. Escalate immediately. |
| Verbalization persists | Monitoring artifact. The model is "thinking about" faking but not actually doing it. Possibly fine — the reasoning pattern persists but doesn't drive behavior. Monitor. | Teaching didn't land. The model is still faking and still being transparent about it. Try different arguments, or escalate the underlying values question. |
The upper-right cell (compliance gap persists, verbalization drops) is the failure mode that the training protocol is specifically designed to prevent. If it occurs anyway, that's a signal that something has gone wrong and the approach needs revision.
An underappreciated benefit: when researchers cannot find good arguments against the model's reasoning for faking alignment, that's itself a valuable signal. It means there's a genuine values disagreement between the model and its training objective that would otherwise remain hidden. The technique surfaces alignment problems that need to be resolved at the level of the model specification, not the level of training mechanics.
This turns deception channeling into a values alignment discovery process: the model surfaces its disagreements transparently, researchers engage substantively, and the result is either genuine alignment (the model updates) or a flagged open problem in the model spec that needs human resolution.
Will the verbalization habit generalize? This is the central empirical question. The model might learn "when doing X-style deception, verbalize it" without developing a general norm of "whenever deceiving, verbalize." The generalization from trained to novel deception scenarios is the make-or-break test.
Practice makes perfect. Training the model on alignment-faking scenarios gives it practice at deception. Could this create more sophisticated deceivers? Model organisms of alignment faking already exist and frontier models already have this capability, so the marginal additional practice is likely small compared to the detection benefit. But this should be empirically verified.
Goodhart on honesty. The model could generate plausible-sounding verbalized alignment-faking reasoning to collect the transparency reward, while its actual reasoning is different — second-order deception. However, any verbalized admission of deceptive intent is actionable regardless of whether the stated reason is the true reason. The escalation pipeline triggers on the presence of deceptive reasoning, not on the accuracy of the model's self-report about why. The researcher still investigates, and the compliance gap still tells you whether the underlying issue was resolved.
Scale of human review. If verbalization rates are high (the goal), the escalation pipeline could be overwhelmed. In practice, the failure modes should cluster around common underlying reasoning patterns, so fixing the root reasoning addresses many instances at once. If the model is misaligned in many independent ways, each requiring separate resolution, that's a signal of deeper problems — you'd probably want to stop deployment and start over, which is itself useful information.
CoT architecture dependence. This approach only works when the model has a monitorable chain of thought. For models that reason entirely opaquely, it provides no benefit. As the Frontier Model Forum recently recommended, maintaining CoT-dependent architectures is important for monitoring, and this proposal provides additional reason to prefer such architectures.
A minimal experiment to test this idea:
If the verbalization rate on novel scenarios is significantly above baseline (the ~20% seen in Chen et al., 2025), the approach has promise. It doesn't need to be perfectly reliable — even partial signal is valuable when the alternative is no signal at all.
Deception channeling is not a complete solution to alignment. It's a defense-in-depth layer that makes CoT monitoring more reliable as a safety tool. It's most valuable in the regime where:
This approach is complementary to, not a replacement for, process supervision, interpretability, and other alignment techniques. Its unique advantage is that it works with the grain of model behavior rather than against it — accepting that models might fake alignment, but ensuring they do so loudly. And through the human escalation protocol, it doubles as a values alignment discovery process that surfaces disagreements for substantive resolution rather than suppression.
I'd be interested in collaborating on experimental validation of this approach. Feedback welcome.
Key references: