MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

The Day After Move 37

2026-03-11 07:05:05

I was a few months into 21 years old when a hijacked plane crashed into the first World Trade Center tower. I was commuting in to work listening to the radio (as was the style at the times). I couldn’t figure out how the heck a plane could hit the tower. Was the pilot drunk? How did he even get into the middle of New York City? I was imagining a Cessna because the idea of a passenger plane running into the building was actually unimaginable. I was barely starting to realize “Wait… are they talking about, like, a big commercial plane?” when the second plane hit. In that moment like a crystal suddenly forming I realized this was an attack, and there would be war. I knew my country well enough to know that there would be military action as a result. Maybe, maybe we could avoid war.

When I came in to work everyone was crowded around the small personal TV one of my coworkers had with him (live streaming wasn’t a thing yet). That was the first time I had a visual, saw the smoke coming out of the towers. There was grim chatter as we watched live footage. No one was working. The bosses were there with us. How would they get that blaze under control? How many people would die up there before then?

When the first tower began to fall the entire room gasped. We flinched away from the screen as a single body. Dead silence. Someone started crying. We had all watched the “Skyscraper Inferno!” movies. We thought that’s what this was. It had not even entered the realm of imagination that the entire tower would just go down, crushing everyone. This is what an update of sickening proportions feels like.[1] Now all eyes went to the second tower. Would this one stand? Suddenly the speed of evacuation was all that mattered.

What little chance for avoiding war had been left was now absolutely obliterated.

We were all excused from work early. Leaving the office, I entered a different world from the one I had woken up in. The repercussions of this day were staggering. No one knew how the world would be different. We didn’t even know what had happened yet. But the world would forever be divided into before this day and after this day. It is rare to have such sudden, sharp pivot points in history. A revolution in a single day. I watched it happen. We all watched it happen together.

I finally realized why my elders had such profound memories of watching Neil Armstrong walk on the moon. To me it was just another date in history. My entire life we’ve had men bopping around in space and American flags on the moon. It’s a background fact. For them, it was a single moment unprecedented in human history that marked a permanent, sweeping change. Which they all experienced collectively, as it happened.

AlphaGo

Computers had been beating humans at Chess since I was a teenager. It was an impressive engineering feat, but an understandable one. Chess was basically “solvable” in a mechanistic way using search-ahead algorithms. Those of us paying attention to AI in the mid-teens were paying attention to a program called “AlphaGo.” Run by Google DeepMind, it was supposedly a machine that could play Go very well. They wanted to demonstrate this by challenging the best Go players of the era.

This next part is written from memory, forgive me if individual details are off.

The thing about Go is that the space of potential moves explodes too quickly for a search algorithm to work. I’ve barely played Go myself, I don’t know much about it. But among humans it seems one has to have a mental representation of what the state of the board “means” and how a play can shift that. The game is widely accepted to require a fundamental intuitive grasp which humans develop over many years of intense play. There isn’t any way for a human to program that into a machine. So the AlphaGo team didn’t try. Instead they created a digital brain, where numbers took the place of neurons, that could “learn” by changing those numbers. They had AlphaGo play millions of games against itself, changing the numbers a little bit after every game in response to how well the game went, “learning” to play as it went. There isn’t a formula or algorithm one can point to that explains what makes AlphaGo choose the next move it chooses. It just “thinks” on the state of the board and then produces a move.

In March of 2016 Lee Sedol, one of the world’s most acclaimed players of Go, went up against AlphaGo in a televised five-game set. If AlphaGo had merely beaten him this still would have been a watershed moment in AI history. It would be a demonstration that this digital brain has, somehow, encoded an understanding of the game. It has something like intuition in this domain. That’s already miraculous. It was a thing people had said was impossible with machines. Some of us were already expecting this might happen. We were excited for it. What very few of us were ready for was Move 37.

The Move 37 Moment

In their second game, on March 9th, AlphaGo placed a stone where no human would place one. This is the now-famous Move 37. Commentators were baffled. Those watching live and chatting online suspected that AlphaGo had glitched out and thrown an error. Lee Sedol stood up and walked away. He spent fifteen minutes agonizing over that move. No one had any idea what was going on. This wasn’t just a move that no human would make, this was a move that no human could imagine. It was either the most embarrassing flub possible, or proof that humans are no longer the pinnacle of Go-playing minds. And the only way to find out which was for Lee Sedol to throw down and play the hardest he’s ever played to test the machine’s intuition.

Move 37 turned out to be a superhuman move. AlphaGo won that game. Afterwards Lee said he felt “powerless” and AlphaGo was “an entity that cannot be defeated”. He was mostly correct - he went on to beat it in game four of their five-game match. That win crowns him as the only human who has ever defeated AlphaGo in official play.[2]

Before Move 37 everything in AI development still felt theoretical to me. Then I saw a bizarre act, the act of an alien mind, which inexplicably led to unavoidable defeat. This thing understood something we could not. It had an insight we don’t have the ability to see. I realized that we now share a planet with an alien intelligence. A new mind that thinks in different ways, and thinks things we cannot.

It was still extremely limited. Powerless outside the domain of Go. And yet a new mind nonetheless, and there was no going back. We didn’t share the planet with alien minds before, and we do now, and Move 37 on March 9th is the day that everyone saw it. You cannot go back into the same world you left from.[3]

The Day After

Ten years after the 9/11 attacks I began to understand a different aspect of my elders’ experience: lack of shared context. I didn’t have a period of my life before the moon landing, I didn’t remember the world as it was before then, nor did I witness the turning point. By the mid 2010s I was coming to know more and more adults who had no real memory of a pre-9/11 world. They were young enough when it happened that by the time their larger world-model was forming 9/11 was a historical fact. The only world they knew was the one that had already been altered. They didn’t feel the change.

Growing older is littered with such moments, where you have a sharp revelation and realize “Oh… that’s what they were feeling the whole time.” I understand why they didn’t really tell me, it’s impossible to really convey in words. It’s something one has to live through. Instead you watch the younger ones and wait, because you know eventually they’ll get it, and then they too will have that “Oh… that’s what they were feeling the whole time” feeling. Only time can bring that.

Even after such an Act Shift on Earth’s stage, time proceeds. Life continues, and a typical day before a history-cleaving event isn’t much different from a day after it on the individual level. Even if everything has changed for humanity, nothing has changed for the human. I still have to pay my rent and brush my teeth. And yet the color palette has shifted, the musical score has turned. You can tell the world is different. It is strange that the newer generations will only feel the world states that came after their emergence into the world. It is strange that I’ll never feel the world states before my own time. I find it unfair.

Ten years after Move 37 I now frequently run into adults who did not live in a world without alien minds in it. Adults who didn’t watch a brand-new brain made out of numbers play a stone in an unimaginable spot to carve a path to victory into the future. They are still living in the default world they were presented with. I hope they can soak in its flavor to the deepest extent possible. It’s hard to know what to appreciate when you don’t yet know how the flavors of history change. And I hope they can take some time, maybe a few minutes once a year, to think of how strange the world must have been in the before-times, when in all the world the only thinking beings were the humans born of flesh and blood.

 

  1. ^

    I didn’t have those words for it then, the Sequences wouldn’t be started until six years later

  2. ^

    His Move 78 in that game has a story of its own, but that’s a different story

  3. ^

    The fourth ever episode of The Bayesian Conspiracy podcast was about Move 37 shortly after it happened. Sadly we were very new and still figuring things out, the audio quality is bad.



Discuss

Interview with Steven Byrnes on His Mainline Takeoff Scenario

2026-03-11 04:17:31

After using the latest version of Claude Code and being surprised how capable it's become while still behaving friendly and corrigibly, I wanted to reflect on how this new observation should update my world model and my P(Doom).

So I reached out to Dr. @Steven Byrnes, the highly polymathic AGI safety researcher who I last interviewed in August about Brain-like AGI, foom & doom, and solving technical alignment

We discussed:

  • Steve's high P(Doom)
  • “Brainlike AGI”: A qualitatively different next-generation AI
  • Does nearly full unemployment likely come before or after the next ASI regime?
  • Is “country of geniuses in a data center” a good prediction?
  • Why we should expect ruthless sociopathic ASI
  • Post-training & RLVR: the "thin layer" of consequentialism
  • The air travel vs. space travel analogy

Video & Audio

The episode is available on Substack, YouTube, or by searching “Doom Debates“ in your podcast app.

Transcript

Introduction

Liron 00:00:48
Welcome to Doom Debates. Today, we’ve got a fan favorite returning to the show, Dr. Steven Byrnes. You may remember him from six months ago. We had a really long, enriching conversation that I highly recommend. It’s one of my favorite episodes of all time.

He is an artificial general intelligence safety researcher at the Astera Institute. He’s got a BA in physics and math from Harvard, PhD in physics from UC Berkeley. He did a physics postdoc at Harvard, research in thermodynamics and optics. In 2015 to 2021, he worked at a nonprofit applied physics research and development laboratory called Draper. Since 2021, he’s been an AGI safety researcher, first independent and now at Astera.

So pretty legit credentials. It doesn’t get much more legit in terms of a wide-ranging polymath than Dr. Steven Byrnes. I think you can see that evidenced on the last episode that we did. Today, I’m excited to catch up with Steve about the latest developments in his thinking since last summer. We’re gonna take stock of how AI has progressed in recent months and dig into Steve’s recent argument that we should still expect, unfortunately, ruthless, sociopathic ASI. Dr. Steve Byrnes, welcome back to Doom Debates.

Steve 00:02:00
Thanks for having me again.

Liron 00:02:01
Yeah, returning champion. What you have to say about these topics, I think is extremely credible. We got into this last episode. Near the end of the episode, we were saying, “Between you and I, it’s not really a doom debate, it’s just doom.” And viewers like it when somebody finally comes on the show that we can trust, or at least that my audience is generally on the same page with. And also, in many ways, you’re ahead of us. You have more detailed mental models than us.

So this is going to be very enlightening. Let’s start by catching up on the last six months since the summer. Give us a very high-level update. How would you describe the last six months playing out?

Steve 00:02:38
Well, let’s see. So I have my own research program related to brain-like artificial general intelligence — the idea that somebody might reverse engineer or reinvent the way that human brains are able to accomplish things in the world, start companies, invent science and technology from scratch over time. And if somebody reverse engineers or reinvents that, then how do we use it? How do we get to a good future?

If we wind up making reinforcement learning agents, then what reward functions, if any, should we use to get an agent that we want to live with and that wants to live with us? Last year, I was spending a lot of time trying to figure out how human social instincts work. We want to be nice to our friends, we want to impress people, and all the other aspects of human sociality and morality, ultimately, I think, come from the human brain reward function.

For years, I was trying to figure out how that worked, and I made a lot of progress. And over the last six months, I’ve been trying to relate that progress back to the world of reinforcement learning as opposed to the world of neuroscience, where I was before, and just hopefully get a little less confused about how to think about reward functions in general in light of what I think I’ve learned from trying to study the brain.

Liron 00:03:52
Okay, that’s an interesting pivot in the last few months. You’re saying not so much on the neuroscience and more getting into the weeds of how modern AIs work.

Steve 00:04:01
So modern AI is a term that often refers to large language models, and I should clarify that I’m not working much on large language models. I think much less than most people in AI safety. But modern AI — AI is a field of study, it includes a lot of different things. I’m more interested in the reinforcement learning branch of AI, which is kind of a forgotten backwater these days, but I am expecting and fearing that it might become bigger in the future.

Research Update: What’s New in The Last 6 Months

 

Liron 00:04:31
So just as a high level, imagine that somebody doesn’t follow the news and they just poke their head up for air every six months to check. If you look at June 2025 versus February 2026, I think it’s kind of followed the timeline of AI 2027. They had a famous timeline, and they weren’t saying this is definitely going to happen, but it seems to kind of be happening like that.

The biggest thing I remember from their 2025–2026 timeline is agents. First, they called them stumbling agents, where agents can kind of stumble around but then they crash and need a human to pick them up, and then they just go longer and longer, and they just do more and more of the work of humans. I think that has been very accurate. I think this is really the year of the agent. Agents have really just started working incredibly in the last few months. And to me, that’s the headline of progress over the last few months. What do you think is the last six months of news?

Steve 00:05:23
Yeah, certainly in terms of AI capabilities news. I don’t consider myself an expert. I don’t know more than you do. Probably I know quite a bit less. But yeah, everybody who uses the coding agents seems to be very impressed by them. I haven’t gotten around to installing it myself, but I certainly believe the reports when people describe what they can do, and that does seem to be an important development in LLM world.

Liron 00:05:48
I can join the chorus of people in your network telling you that it’s the real deal. I started getting into it in earnest a couple weeks ago, and it’s a very weird experience because I’ve been a software engineer my whole career, even since I was ten years old, and I’m not really a software engineer anymore. I don’t really look at my code anymore because I would much rather just tell the AI.

It’s to the point where, just so people see how far it’s gotten, let’s say I wrote something a year ago, a piece of code. You would think that I have some context to get back into my own code and make some tweaks. But it’s now actually easier for me to tell a clean version of Claude Code or OpenAI Codex to just be, “Hey, this file of code, something in here, just go change this component to do this.”

It’s actually faster for me to use voice-to-text and just write a couple lines of instruction to the AI, and the AI will spend thirty seconds understanding what I was thinking a year ago. It’s actually faster to do that than for me to go open the file and read my own code, even though it’s my own code that I wrote a year ago. It’s that good.

Steve 00:07:06
Yeah. I don’t have any big software projects that I’ve wanted to do lately, but when I have little one-off shell scripts or this or that, definitely you just ask the LLM to do it, and it prints it out, and it often works the first try, and if not, you can say what the error message is, and it often fixes it.

Liron 00:07:06
It’s really crazy. As a software engineer, I know what’s possible. So I can be like, “Hey, can you just go to this server and download this and move it here and do that?” And the AI’s like, “Yeah, I can do all that. Here’s a little plan. Okay, done.” And I’m like, “Wow, I could have done it too, but it would have taken me an entire focused weekend, and I would have been coming out of the weekend being, ‘All right, I got something done this weekend.’” And the AI is thirty minutes later, “Here you go.” And I’m like, “Oh my god.”

That’s not something that could have been done a few months ago. This is really new. It’s really crazy. When I leave my house and walk around the world, I want to tell everybody. “You guys know we have AIs that can replace a bunch of jobs right now?” People are totally clueless.

Steve 00:07:42
Yeah. I mean, I don’t know whether they’re replacing jobs or not. I’m not cued in enough to the software engineering world to know whether they’re—

The Rise of AI Agents

 

Liron 00:07:49
So that’s my experience. I will say this, though. I think you and I — I’m having an easier time playing devil’s advocate after using Claude Code. I’m starting to empathize more with why the non-doomers are imagining that AI is always going to be subservient to humanity, because now I’ve seen AIs get really far in terms of their capabilities, and they’re still so friendly and submissive.

The intuition — it’s hard to fight the intuition that forever AI will just be, you give it a task and it runs off, does the task really well, and then comes back to you and says, “Here you go, master. You’re the boss, not me. I’m gonna shut down now.” That intuition is growing with me even though I don’t think it’s gonna last forever.

Steve 00:08:27
Yeah. I mean, we can talk about a lot of different stories of how things start going wrong from there. People — you can talk about competitive dynamics and races to the bottom. You can talk about bad actors or careless actors, and so on and so forth. The scenario that is first and foremost in my mind is that the future more powerful AIs are really just quite different in their disposition than the AIs that we’re used to today because it’s a different AI paradigm than it is now.

Liron 00:08:57
I think a different paradigm is key. There is a little bit of an update happening for me, and don’t get me wrong, I’m not saying I’m a non-doomer now. It’s a slight update, but it is significant. I just want to make sure that I’m milking the most information out of this update.

I think the update is that it’s easier than I thought for humans to build agents that just do what they want the way Claude Code does, meaning work for a while then stop for the next command. The worldview that I’m losing is this whole idea that it’s gonna even be hard to make the AI make you a cup of coffee — that any task you want to do, the process of doing the task, it’s gonna want to go hardcore optimize the universe. Well, I think we found a regime where they’re super useful and they do a bunch of tasks, but they’re corrigible, they’re aligned. We’re in a really good place.

Steve 00:10:07
Yeah. I mean, you should be exactly as doomer as the objective situation warrants. We don’t need to all be really pessimistic to fit in with the cool kids. I happen to be more pessimistic than you, again, because of this paradigm shift that I’m expecting. I think there are other people who would sort of agree that LLMs are the paradigm that we should be worried about, and they’re more worried about it because of things like competitive races to the bottom and other things.

Liron 00:10:51
You are doing an alignment program, right? That’s part one of our conversation. I called it — it wasn’t your decision, but I called it in order to maximize views — “the man who could actually save humanity,” in the sense that you’re actually taking a shot at the goal. I can probably count on one hand the number of people alive who I think are taking a straight shot on the goal of being, “Okay, this is how AI is likely to play out, and it’s not dumb. It’s actually a plausible guess. And given that this is the plausible guess, what is some direction that might actually help do something that’s friendly to humanity?”

So you do have an alignment program, and as I recall, it’s something like you’re saying, “Well, here are some programs that could specify good goals for the reinforcement learning.” That’s kind of your alignment direction?

Steve 00:11:21
Yeah. So I work on the technical alignment problem, and basically, how do we get these consequentialist frameworks like RL and model-based planning to create something that’s not a ruthless sociopath, without just leaning on imitation learning? I think there’s an answer, because the human brain is built on these frameworks but nevertheless manages to avoid that outcome, and I’m trying to understand how it works in human brains and how it might work in future AIs.

Liron 00:11:41
So viewers, check the show notes. There’s gonna be a link to Steven Byrnes part one, one of the best discussions you’re ever gonna see on the show. Most of the other guests come on the show, and my job is just to expose how way off base they are. But in the case of Steve Byrnes, I think that he’s pretty much spot on. And remember, his P(Doom) is?

Steve 00:11:44
Oh, don’t make me say it. This is really annoying.

Liron 00:11:46
All right, let’s refresh viewers’ memories. They wanna see where you stand on the most important question.

Steve 00:11:51
P(Doom). P(Doom). What’s your P(Doom)? What’s your P(Doom)? What’s your P(Doom)?

Liron 00:11:57
Dr. Steven Byrnes, what’s your P(Doom)?

Steve 00:12:00
I guess if I have to pick a number, I would pick ninety percent, but we can have all sorts of caveats about how to interpret that and what it means. And I do think we should all be energetically trying to make things better.

Liron 00:12:14
Wow. Well, yeah. You’re at the high end of the sane zone. I think that’s still a sane estimate. I think if you start going higher than ninety percent, you start getting a little overconfident.

Steve 00:12:22
Yeah. I mean, it’s hard to... Prediction is hard, especially about the future, as they say.

Liron 00:12:28
Exactly.

Steve 00:12:28
I like to talk about the movies where you’re three-quarters of the way through and the hero is trapped in an alternate dimension with no way back home, and the dastardly plan is about to come to fruition, and you’re just sitting in the audience eating your popcorn being, “Oh man, how are the screenwriters gonna get us out of that?”

Liron 00:12:53
Right.

Steve 00:12:53
So that’s different. If you’re in the movie theater, you can make a prediction that they’re gonna come up with something. Here in the real world, we don’t know what’s gonna happen, and it’s hard to predict. I have a hard time imagining things going well, but that’s partly about me and not just about the world. And obviously, we should all continue fighting for getting things to go well.

There’s this term “optimist” that simultaneously means “I expect things to go well,” and it also means “I have a can-do attitude, and I’m trying to make things better.” And I think we should all take the second definition of optimist as part of our identity and have a can-do attitude and try to make things go well regardless of our expectations about whether we’re likely to overcome the challenges.

What’s Your P(Doom)?™

 

Liron 00:13:42
Well said. Yeah, I agree that it’s a tangled web of vines that all seem like they’re pulling us down, and you would have to cut many of the vines if you want to rescue yourself.

So last time you were here, we talked about the before and after, the two paradigms. I know you called the second paradigm brain-like AGI, which is AGI that does this amazing power that the human brain does to learn really quickly and stack up learnings over a lifetime and see through — learning to drive, you don’t need a million data points. You just need a few hours of lessons, and then you’re good. And you called that brain-like AGI, and you think that’s the next paradigm that’s coming. And then what did you call this paradigm that we’re in now, just LLMs?

Steve 00:14:17
Yeah, sure. Foundation models, whatever.

Liron 00:14:20
The terminology is starting to be tough for me to keep in my head because I actually see — my own sense right now is that, and it’s kinda surprising — I see AIs actually on pace to steadily surpass the human brain, and yet they don’t feel fundamentally different from ChatGPT-3. I almost feel like the LLM paradigm is just swallowing the entire power of the human brain. What do you think about that?

Steve 00:14:42
I think that’s a place where I disagree. I think that LLMs have shortcomings that are not ever gonna go away. I used to say that LLMs would plateau, but I realized that was never really the right term because plateau implies that they don’t improve along any axis. Chess engines haven’t plateaued. They continue to get better at chess.

So I shouldn’t have said plateau. I should have said that there are axes along which I think LLMs are not ever gonna be as powerful as a human brain or as powerful as this next paradigm that I keep saying is gonna happen someday.

Liron 00:15:20
It just — when I talked to you before and when I was doing my own thinking, I felt like the wall of subhuman abilities would be stronger. So yeah, they’d keep getting better and better, but even this idea that they can just blast through a half-hour time horizon — the METR charts are very strongly indicating that their time horizon is multiple hours now, even fourteen hours, although the benchmark is starting to saturate. So we don’t know if we can take fourteen hours literally.

But it’s very clear that an AI doing something for half an hour right now is highly productive and can potentially replace a human working all day long. It’s very clear that we’re just about at that point, and we seem to not be slowing down. Don’t you find that kind of surprising that it’s already plowing through these all-day or multi-hour types of tasks? Isn’t that farther than you thought the LLM paradigm could go this fast?

Steve 00:16:44
Yeah. Well, I think I never really had a strong take one way or the other. I guess if somebody had put a gun against my head and told me to guess, yeah, I suppose I would be surprised about how powerful the things that LLMs are able to do. But I think that the further you get into novel territory that no human has ever done or that’s not on the internet, that’s not in any books that the companies didn’t commission specialized data for, the more that LLMs struggle with novel tasks in those domains.

“Brain-Like AGI”: The Next Generation of AI

 

Liron 00:17:01
But it just seems like novelty is not a particularly useful type of firewall anymore because they can program anything, and you can look back and say, “Oh, well, I guess that wasn’t novel to just program anything or program weapons and destroy the world. I guess none of that was novel.” I feel like that’s where we’re at.

Steve 00:17:15
Yeah. I mean, the — yeah. So there’s a lot of code on GitHub and on the internet and yeah, anything you’re doing, it can be pretty similar to the training data because there’s so freaking much training data.

Liron 00:17:23
So the cat is out of the bag on how to uncreatively destroy the world. How to not novelly destroy the world.

Steve 00:17:23
Right. Yeah. Okay. So there’s sort of a separate issue, which is maybe this is a wall that LLMs can’t get past, but maybe LLMs are still able to wipe out humanity. To people who say that, I’m like, “Great. Go work on LLM safety.” I never want to discourage people from planning for a world with more powerful LLMs or even today’s LLMs. I think that’s a perfectly good activity for people to be engaged in.

But then there’s this other issue, which is if it is a wall, and I think it is, then how do we think about exactly what is the nature of that wall and what will LLMs not be able to do that humans can do, and what are the implications of that?

Liron 00:18:23
And when you say other people can go work on it, maybe in your mind, you’re just not super worried because you’re thinking, “Yeah, LLMs can try to destroy the world, but then humanity can always fight it by pulling a novel defense out of our bag of tricks, and the LLM is probably gonna get stumped by that.” Is that kinda what you’re thinking?

Steve 00:18:23
I think LLMs would be unable to wipe out humans and run the world by themselves, for example. So that’s reassuring a little bit. I guess they could keep humans as slaves. I don’t know.

I think it’s more that I only have so much time in a day, and I need to focus on things, and I’m focused on this kind of superintelligence that I’m expecting to happen sooner or later. And God willing, we’ll survive enough to die in this more exotic way that I’m planning for. If people want to work on pandemic prevention, then that’s great too. If people want to stop nuclear war — yeah, maybe we won’t die from superintelligence because nuclear war takes us first. If somebody else is working on nuclear war prevention, then more power to you. I think that’s great. I think we need people to be on all these different problems.

Liron 00:19:59
Gotcha. So the only thing is, for the rest of this discussion, I tend to question the terminology of brain-like AGI versus LLMs or versus imitative. LLMs are just imitating tokens that they’ve already seen.

Maybe I can propose new terminology where we call it human-like AGI versus goal engines. I think you may be assuming too much when you say that the future powerful AGI is the brain-like AGI. I think you may be surprised to learn that even today’s AGI is already surprisingly brain-like and powerful.

Steve 00:20:24
Yeah. We should be clear on these two different definitions, and I’m happy to use whatever terminology you like as long as you define it. If we talk about what an AI is capable of doing, that’s an interesting question. And then there’s a different question, which is what are the algorithms that were used to create it and how do those algorithms work? Where do they draw their capabilities from?

Liron 00:20:35
Okay. Well, let’s try to use neutral terminology of current type AGI and next generation AGI. Because you think there’s a fundamental qualitative generational leap coming that hasn’t come yet.

Steve 00:20:35
Yeah.

Liron 00:20:41
And you’re saying it just hasn’t even come since the first GPTs. We’ve been in the same generation in your view.

Steve 00:20:41
Yeah. I think there’s an important difference, which is related to this ruthlessness post that you were gonna talk about. I think a good way to think about LLMs is that they’re primarily powered by imitative learning.

The way pre-training works, as many people know, is you show the LLM lots of data — tens of thousands of lifetimes worth of data — and it’s trained, at least in the language case, to predict the next token based on what it’s already seen. So then you set it up in a situation where it sees some context, and whatever it predicts would come next, that’s what it outputs. And then there’s post-training and there are all these caveats and nuances.

But I do think that’s a good way to think about how LLMs are able to solve problems more or less. They see situations that are similar to what’s come in the training data, and the humans often in the training data have some useful thing to do afterwards, and the LLMs do that useful thing. They’re drawing their power from imitative learning, is the way I would put it. And then that’s supposed to draw a contrast with how I think human brains work, which I think draw their power from reinforcement learning, which is kind of different, and model-based planning.

Liron 00:22:22
No, I think it’s a very reasonable hypothesis. I think you’re probably right. More likely than not, you’re probably right. I just do personally see how the power of agents today that are just based on what you’re calling the weaker architecture, the token-based or imitative prediction-based — this architecture that you’re saying is kinda hopeless to make the leap to this novelty or this true threat. I’m just getting weaker on that distinction just by empirically observing that even the weak paradigm potentially can make everybody unemployed.

Steve 00:22:39
I mean, I don’t know about everyone. We could talk about what is sufficiently similar to the training data that LLMs are actually good at it, and it turns out it’s a lot of things — almost all of the software that people want to make on a day-to-day basis. Is it all software? I don’t know. I’d say probably not.

If I invent my own complicated programming language over the course of a year and don’t put any of it on the internet, and it’s some weird paradigm that I kinda made up, I think Claude Code would be worse than me at using that completely novel programming language. And that’s a guess. I’m not a hundred percent sure. Or, take Olympiad math problems.

Liron 00:23:02
Right.

Steve 00:23:02
Famously, these are all supposed to be original problems, but actually there’s many tens of thousands of Olympiad math problems that people have published, and there’s only so many techniques that people can use to solve them. And LLMs have seen all of the solutions and all of the problems, and it turns out that they’re sufficiently good at generalizing that if they see practically any new Olympiad math problem, they are in fact able to churn their way through it, and maybe they try twenty different things but eventually find a solution.

So yeah, we need to have a broad view of similarity. But then I can give you a different idea. There are millions of humans over thousands of years starting, let’s say, at 3000 BC. We are on a planet, and no angels drop new training data from heaven. But nevertheless, we invented language and science and technology and the hundred trillion dollar global economy and everything in it all by ourselves.

I think that if you put millions of LLMs over thousands of years in a sealed box and didn’t give them any new training data — you gave them some VR environment to play around in — I don’t think the LLMs would be inventing new knowledge and building on that new knowledge in an open-ended way, the way that humans can. And of course, this is an insane example. Maybe LLMs can kill everybody without being able to solve that problem. But it’s a toy example to illustrate a deficiency, I think.

Liron 00:24:26
There’s kinda this horseshoe theory where I consider you one of the most insightful people in this domain. But a lot of the language you’re using right now dovetails with some of the least insightful people in the domain, in my view. If you look at the David Deutsch school, who I think is insightful in general but not on the topic of AI, he’s saying, “AIs can’t create knowledge.” And I’m like, “Are you sure about that?” Or there are some people who say, “AIs can’t truly reason.” And there are many people who, as you’re saying now, “AIs can’t do anything novel. They’re always finding similarity.”

I personally have always felt like that is super misleading to appeal to similarity. Yes, it was true about early GPTs, but I feel like we’re at this point where appealing to similarity — I don’t think there’s much value to be mined from it. Because one thing we’ve learned, one of the deepest things that humanity has ever learned, is this fundamental concept of reductionism and formal systems.

One thing we know about physics is that at the end of the day, everything is actually made out of just a few building blocks. So there’s a sense in which everything is similar to everything else. Everything is just part of the same building blocks or part of the same configuration space. And similarly, in math — you look at every domain of math, all these different textbooks, all these different theories, subfields, and yet they’re all constructed out of the same parts. They can all be formalized. If you look at proofs, every proof is formalizable. When you get from a claim to a proof of the claim, you’re always just cranking the same narrow set of possible rules. So it’s like, oh, you’ve already seen how to reason before? Yeah, everything is the same. So how much are you gonna be able to appeal to similarity?

Steve 00:26:04
Yeah. I guess I like to talk about putting a lot of complexity, building on complexity, building on complexity, and novel ideas that are only in the context window and not in the weights. I think that’s where you can stretch an LLM to its breaking point.

So this silly example that I talked about before, where you have thirty thousand years of human knowledge and science and technology — if you tried to put all of that in the context window and not the weights, I really don’t think LLMs would be able to do that. They can do some things in the context window, but I think the weights is where the bulk of their knowledge is. That’s the part that can scale arbitrarily up, and the context window can’t.

Liron 00:27:15
I see what you’re saying. So the context window — the weights are the combination of long-term and short-term memory. Or the point you’re making is that when they’re trying to reason, if there are a bunch of steps and they all have to happen in context, you might get to a situation where the context is overwhelmed. And that’s where humans shine, because humans can keep writing to longer term memory, and they’ll have a bigger buffer so they can do more complex things.

Steve 00:27:15
I tend to think of it more in terms of knowledge than memory. If you have all the information you need to come up with a new math concept — schemes or whatever — and you’re right on the verge of coming up with schemes, I think LLMs can likewise take existing human knowledge and perhaps they can also come up with schemes.

But what I don’t think they could do is then lock in that concept of schemes and play around with it for months and then start building on it, with it all being only in the context window and not the weights.

Liron 00:27:48
They can’t develop a field for a year and then use their new insights to do their next task.

Steve 00:27:54
Yeah, if you’re just keeping the knowledge of the field in the context window.

Liron 00:28:03
And then, as you’ve already conceded, that particular thing that they can’t do might just not be on the critical path to taking over the world.

Steve 00:28:03
I guess I don’t have a strong opinion about that one way or the other. I lean no. But yeah, time will tell. Certainly I hope not.

Liron 00:28:14
Which reminds me of — when I ask some people, notably the Deutschians who are always saying, “AI can’t be truly creative, can’t be truly novel,” then I’m just, “Okay, tell me something that a human would do in a typical year that you don’t think the AI can do. Be specific. Don’t just use this abstract word ‘novel.’”

And then they’re, “Okay, invent general relativity.” And I’m, “Okay, but you’re also a human, right? You and I are also humans, and we don’t really invent general relativity ourselves. So are you saying that the AI is now, if you rank it, it goes above almost every human and just below Einstein? Because I feel like that’s already good enough to take over the world.”

Steve 00:29:23
I mean, we use these examples of science because they’re hard for normal people and very impressive. But I think there are sort of more everyday conceptual spaces that people come up with that are equally in the area of concepts building on concepts. Packing a bag full of groceries — if you do it enough, you sort of get this gestalt sense of what’s a good way to pack different things. And you’re learning on the job, that’s legit continual learning, but it’s so mundane that we don’t think about it as the pinnacle of human knowledge.

Liron 00:29:24
Right.

Steve 00:29:31
And so, yeah. What does it take to take over the world? Yeah, I don’t know. Maybe I should just say I don’t know.

Liron 00:29:31
So based on this conversation so far, you’ve got perfectly plausible hypotheses about the thing that the AI is not doing that the human would otherwise do. It’s a very interesting topic of conversation, what’s left for the AI to do. And I think you’re probably onto something, this idea that the AI has done something every day for a year, but it didn’t take down that gestalt impression and update its weights. It didn’t do that, and so it’s falling behind, and a human would have done that.

But at this point, it just seems like the core of the AI is so powerful that we’re now potentially just talking about a harness. If you just have a meta process that watches AI and writes down some notes and then also kicks off another training run every few days, doesn’t that seem like it could just patch everything?

Steve 00:30:12
Let’s see. Another point that I would make is that part of our mental picture has to be the amount of specific effort that AI companies have put into coding. I believe they commission — AI companies spend billions of dollars a year commissioning specialized data that they can train on to make their AIs better, and RL environments that they can train on to make their AIs better. And they’re working really hard towards coding.

We notice how good they’ve gotten at coding, and we shouldn’t overgeneralize from that, I think. AI companies are not specifically targeting the set of capabilities that it takes to, I don’t know, take over the world, I guess. And hopefully, that means that they won’t be as good at that.

Can LLMs Ever Match the Human Brain?

 

Liron 00:31:51
All right. So let’s talk about qualitative jumps, because I feel like there have been a few insane qualitative jumps in the last few years, and there are another couple qualitative jumps coming, and then the world ends. That’s what I’m imagining.

Previous qualitative jumps — if you remember the 2010s and before that, we basically had narrow AI, like Google Translate. It would give you decent translations but didn’t truly understand the text that it was translating. You couldn’t correlate the knowledge and reason with it, so it was narrow AI.

And then we had general chatbots, the GPT revolution. And today we very much have general agents that are getting more and more powerful by the day and can run longer and longer and can really make stuff happen — really engineer whole systems.

Then pretty soon, it feels like we’re likely on track to AGI, meaning superior replacements for humans in most of the economy. Would you agree we’re probably on track for AGI in a few years?

Steve 00:31:55
That depends on how you define AGI.

Liron 00:32:13
What if I defined it as, let’s say eighty percent of human jobs that are currently on a salaried payroll making a livable salary right now — eighty percent of those jobs can just not be jobs at all and have the AI take them instead.

Steve 00:32:13
Hmm. I think I would take the under on that. That that’s not gonna happen. But I’m not extraordinarily confident one way or the other. It’s not a thing that I’ve thought about enough to have an opinion.

Liron 00:32:48
Well, maybe what you’re thinking is that if we really could do the full drop-in replacement — if the AI really could show up to work every day for a year — at that point, it must be doing enough learning over time, or it must have that extra secret sauce that today’s LLMs don’t have, that it might as well catapult us all the way into the next paradigm. So it wouldn’t just still be chilling in the current paradigm.

Steve 00:32:54
I definitely think the next paradigm would be able to do all human jobs.

Liron 00:33:02
Right. But you don’t think there would be a separate window where we’re all still living and chilling, except we don’t have jobs. You’re saying, “No, there’s no separation like that.”

Steve 00:33:02
I do expect that the order of things would involve catastrophe or utopia sooner than job loss.

Liron 00:33:09
Wow. That’s an interesting order of operations, yeah.

Steve 00:33:11
I like to talk about — there’s this quote from Eliezer where he says, “If you’re asking about the impact of machine superintelligence on the job market, it’s like asking about the impact on China-US trade patterns of the moon crashing into the Earth.”

Liron 00:33:27
Yeah, yeah, exactly. My mainline scenario is probably more — at this rate, I feel like I’m expecting a huge unemployment wave, but we’re still living and chilling because the current paradigm is gonna cause all the chaos, and it’s not a twenty-four-hour takeoff or anything. It’s just gonna be yep, a bunch of people don’t have jobs, and yet somehow we’re surviving and things are pretty good and productivity is high.

But then a couple years later, I basically think FOOM is the end game. So if I had to think about qualitative shifts — one happened with chatbots, where you don’t have to write anything anymore, they’re incredibly good at writing. Then one just happened with agents. It’s still in the process of happening, but it’s happened to an incredible degree. Yeah, you don’t really have to build stuff anymore. You don’t have to build software anymore. You just manage agents that build your software. And it’s not just building software, it’s making spreadsheets, making presentations. You really just have to give them a little bit of input here and there.

And then the next shift is, well, your whole job can be replaced. You don’t even have to sit there managing the AI. The AI can just take whatever your boss sent you, and you’re now out of the loop, and then your boss is out of the loop. They’re climbing their way — they’re taking tickets from the ticketing system. They’re grabbing the ticket and doing the whole task. They don’t need you to even do that part for them.

So I think the unemployment-style AGI is coming next. And then the shift after the unemployment AGI is the FOOM, where the AGI is like, “Hey, there are these things that the brain does that we haven’t unlocked yet. I didn’t even need this to do AGI, but I can do this, and I can kick off a recursive feedback loop, and now humans are really dead meat. Now they can’t even hold on to control of the world.” That’s my order of events.

Steve 00:35:00
Yeah, I mean, I think compared to you, I’m a little more inclined to emphasize that LLMs aren’t that great at writing. It’ll be interesting when it stops being easy to distinguish LLM writing from human writing, and then we won’t know if it’s an LLM or just someone who’s bad at writing. It’s nice that we still have Pangram Labs and stuff like that.

Liron 00:35:21
Right.

Steve 00:35:21
And let’s enjoy it while it lasts. The last time I talked to a lawyer about how LLMs were, he was not very impressed with them, but that was whole months ago, so—

Liron 00:35:34
Oh yeah, no, whole months ago. I mean, you can’t — because the coding agents were a lot worse a few months ago.

Steve 00:35:36
Yeah. But I guess I’m just saying I don’t know either way whether the remaining gaps are things that will fall shortly or are indicative of things where LLMs are just not fit for purpose.

Liron 00:35:55
It’s just that there’s now been a drumbeat where every couple weeks something major keeps falling. So they’re always finding something major to attack. I just feel like they’re gonna keep successfully attacking major things.

Steve 00:36:08
I mean, that’s a possibility. We’ll find out one way or the other.

Will AI Kill Us Before It Takes Our Jobs?

 

Liron 00:36:12
All right. Let’s talk about the country of geniuses in the data center, because now that’s Dario’s trademark quote. I think he’s the one who’s always saying that. Think about a country of geniuses in the data center. He thinks that’s what’s coming maybe two years, probably less than five years.

He’s talking about superior replacements for the biggest companies and nations. So even if we get an AI that’s as smart as a person, which I think is very plausible — as a software engineer, I think it’s very possible you’re gonna hire the AI instead of hiring the human software engineer. But okay, the AI will just be a really good software engineer, but it still can’t take on the entire US government.

But when you think about a country of geniuses in a data center, that becomes a drop-in replacement for a government or for the management of a corporation. Isn’t that where we’re headed?

Steve 00:36:53
Yeah. I mean, if you take “country of geniuses in a data center” literally, which I think Dario does — I think that I disagree with the prediction that LLMs can get there, and it’s for the reason that I was saying before.

If you take humans between thirty thousand BC and today, we invented all these new concepts, and the concepts are building on concepts building on concepts. I don’t think that’s a thing that you can do with the types of context windows and scratch pads and continual learning that we have in LLMs today. I think that requires a different paradigm that allows the weights to be continually updated and updated and updated with new, actually good knowledge in an open-ended way.

And I don’t think LLMs can do that at all. And I do think that if somebody figured out a way to do that with LLMs, they would find that they’re turning the LLMs into sort of less friendly and more ruthless kind of alien things than they expected from what you would expect from LLMs today.

Liron 00:38:40
Okay. So just to recap the Steve Byrnes worldview, you’re basically saying right now we’re still in a good place where unemployment is still very low — three or four percent in the US — and AI isn’t taking over the world and not causing harm, and it’s fundamentally lacking some secret sauce that the brain has. One day it’ll get the secret sauce, but we’ll never have a point where unemployment is hovering at twenty percent plus.

I think you’re saying that if unemployment is ever ten to twenty percent, that must imply that AIs have unlocked the secret sauce and we’re so close to FOOM and takeover and disempowerment. So basically, unemployment going past ten or twenty percent means we’re done. There’s no hovering there.

Steve 00:39:03
I don’t want to be pinned down on something that specific. It could also mean that there are jobs that involve less of the kind of building new concepts that I think humans are better at than LLMs, and that those jobs are not quickly replaced due to macroeconomic or regulatory or other issues. Yeah, I don’t have any specific number that I feel strongly about.

Liron 00:39:03
Gotcha. And then I think you and I agree that one way or the other, however it sequences with unemployment, there is going to be some next regime of vastly superhuman outcome-steering AI. And even though today’s agents are powerful, this next generation AI is just going to be so crazy powerful, like the human brain.

It won’t be like today’s agents where we look at them and we’re, “Oh wow, that was smart. What an interesting transcript.” The transcript will kind of be there, but it’ll be a thousand parallel transcripts written in a weird language that’s super efficient, and there are all these other considerations all factored in, and a human would take a lifetime to study one decision. That’s the AI that’s coming.

Steve 00:39:57
Yeah, I do think we’ll get there eventually, and the idea that there would be human oversight seems far-fetched just because there’d be too many AIs thinking too fast and perhaps also scheming against the humans to hide their true intentions.

Liron 00:40:43
Yeah. I have a very strong intuition about the fundamental factors involved here. I see a human brain and I’m, “Oh yeah. This is a cognitive engine. This is a goal engine. It’s a way to instantiate this type of system in the physical universe that has this kind of effect.” The same way a heat engine makes thermodynamic work happen out of heat — it moves energy coherently — that’s what the engine’s doing. That’s what the human brain’s doing.

But then I look at it and it’s, “Okay, but it’s just a nature design.” Nature designs tend to be pretty weak. You look at a bird’s wing. Okay, it lifts the bird up, it transports the bird, it lets the bird hunt or whatever. But if you look at a jet engine, we mastered the principles of flight, and now we’re in a whole other realm. The birds just don’t understand what flight could be when you strap a real lifting engine to it. So I qualitatively think that’s going to happen with human brains.

Steve 00:40:59
Yeah. I think that if human-level AI is possible, and I strongly believe it is, then human-level AI that runs a hundred times faster is possible too, and a million of them that are telepathically communicating is possible too. And that’s already pretty crazy superintelligence.

Country of Geniuses in a Data Center

 

Liron 00:41:34
So we both share this intuition of there will be another regime one way or the other, and that’s the end game. Let’s go to this. You recently posted — and I don’t think this was a super new post. It was more you doubling down, reiterating your position of why we should expect ruthless, sociopathic ASI. Because other people were coming to you with different devil’s advocate claims just saying, “Come on, Steve. Look how good alignment by default is. Everything is going great and Claude is so nice and Anthropic has Amanda Askell and the Constitution and the personality of Claude is so good.” Why do you still think that we should expect ruthless, sociopathic ASI?

Steve 00:41:42
Yeah. So this is definitely just another take on things that I’ve been writing about for a long time. I’m in the group of even more pessimistic people who think that technical alignment is a really hard problem that we don’t have a plan for yet, and that the fact that Claude seems generally nice is not evidence one way or the other about what we should expect for future more powerful AI.

We need to be thinking about what these future AI algorithms are, and in particular, if they’re making good decisions in novel domains, which by assumption they are, we need to ask the question of how is it that they’re making these good decisions. And in the post I offer two broad frameworks that I think are the only two that work in practice up to pretty powerful scales.

One is imitative learning. This is LLM pre-training, where it sees that in the pre-training data, in situation X, it tends to be followed by action Y when the human is doing it, and the AI says, “Okay, if I’m in situation X, then I will do action Y.” Or it doesn’t even have to think to itself. It’s just mechanically that’s what it does. Its expectations just directly turn into actions.

This is true imitative learning, a type of thing that does not exist in the biological world. If I’m imitating you, then I have to transform things that I hear and things that I see — which is my impression of you — into moving my larynx to make sounds and moving my arms for motor commands, and those are totally different things. We don’t have this kind of mechanical translation of expectations into actions the way that LLM pre-training mechanically turns a prediction of the next token into an output of that very same next token.

So that’s true imitative learning. I think that’s how LLMs get the lion’s share of their powers, and we can talk about post-training and other caveats, but I think that’s still the right basic picture.

Meanwhile, there’s this other way to make good decisions and get strong capabilities, which is basically consequentialism. There’s some end result that you’re hoping for, that you’re going for. The algorithm systematically chooses actions that lead to that result. Model-based planning is in that framework, search algorithms, and also reinforcement learning over time has these same effects.

I think the human brain has both. It has reinforcement learning — if I burn my hand on the stove, then I don’t want to touch the stove again. And it also has model-based planning. If I want to pick up the pot and not burn my hand, then I realize that I should put on oven mitts.

So humans — I think human competence comes from one of these kinds of consequentialist frameworks. And the issue, the way I see things, is that the imitative learning paradigm makes these more or less — sometimes they’re a little weird, but more or less you can get nice AIs that are not crazy sociopathic ruthless consequentialists. Whereas these consequentialist frameworks like RL and model-based planning, you get these ruthless consequentialist sociopathic results by default, unless you can invent some technique to get kindness in a consequentialist framework. And that’s of course what I work on for my research program.

Liron 00:45:01
Yeah. I’ve had an episode a long time ago on this show where I went over this slide deck by Jaan Tallinn where he had a toy model for AGI. It’s an agent that doesn’t even do reinforcement learning. It’s just doing a very simple search, and there’s a grid world, and it’s this agent, a robot that just has to roll to the finish and move some boxes while it rolls along and move the boxes to the finish. It just has to search for the best actions.

This simple presentation showed that if the human has the remote control with the stop button and the agent has this ability to roll over the human and get to the finish, the score does increase. Naturally, because then it can predict that it won’t get prevented from getting to the finish.

This is simple stuff, and it’s not implied by the particular personality of the AI. It’s just implied by the different search results, the search paths. You can find a search path where the human dies and you get more points. That’s just unfortunately true about problems. The nature of search spaces is that they don’t have any sort of nice personality to them.

Steve 00:46:02
Yeah, exactly. And this is why I think there was a lot more pessimism in the AI safety discourse in the 2010s — before LLMs came along — than there is today. Because back then everybody was thinking about model-based planning and everybody was thinking about RL. AlphaZero was in the news and the Atari-playing agents, and you could just look at them, and everybody who had worked with them, their lived experience was, these are kind of ruthless.

I keep using the word sociopath, and it’s not a technical term. It’s just an intuition that I think is useful here. So yeah, everybody had experience with these kinds of algorithms and they knew that they made ruthless agents, and that’s kind of what they expected for AI.

And then they stopped expecting that when everybody switched to these imitative learning frameworks. But I still think that if we want AI that goes way beyond imitative learning into sort of new fields of knowledge that don’t exist yet — AI that could run the world for a thousand years by itself and build a Dyson sphere — imitative learning doesn’t get there.

So people aren’t gonna be satisfied forever with imitative learning. They’re gonna keep working until they invent these more consequentialist ways of making powerful AI, and nobody’s really done that yet. RL agents of today kinda suck in many ways, but a solution exists and the human brain is an existence proof. And I think that sooner or later people are gonna get there.

Liron 00:47:22
I think a lot of people follow you so far. This is common knowledge. People have been thinking about this for a while. They accept, yes, instrumental convergence and all of this sociopathic stuff does happen when you’re doing reinforcement learning. It’s hard to deny if you really think about it. So I think people are on the same page there.

I think the popular thing now — a common way that I debate people — I think Bentham’s Bulldog, the debate went kinda like that. Debates with a lot of people are now going down this direction. They’re, “Yeah, yeah, reinforcement learning can be sociopathic, but see how we’ve managed to make these imitative AIs so nice? We’re just always going to have them along for the ride. We’re always going to have the nice AI, the friendly Claude, along for the ride. So even when we do reinforcement learning and Claude has the power to intuitively know what are these paths to get to an outcome, it’s just going to prune down the paths that are harmful.”

“The friendly part, the part that uses imitative tokens or whatever, it’s still going to be enough in control so that it’s going to bake in the nice personality into the combined system enough that it’s still going to be the driver. We’re still going to be nice for that reason.”

And then I turn around and say, “Wait a minute. It sounds to me like there really are two parts here. There’s the underlying engine, which I call a goal engine, analogous to the engine in a car, and then there’s the steering wheel. And you’re claiming that there’s always going to be some friendly token imitative system like Claude that’s holding the steering wheel, even though the car has an increasingly powerful engine.”

And what I usually say in that discussion is, “Well, once you build the engine, you’ve now built something incredibly dangerous. Even if you also attach this nice driver, best case scenario, the driver is really nice, the engine is going to be something that other people can copy without the driver or other people can exfiltrate whatever programming is just accounting for the engine.” What do you think?

Steve 00:49:04
I feel like you’re pointing towards a slightly different argument, which is if we know how to make friendly AI, then we also know how to make sociopathic ruthless AI, and the ruthless AI would outcompete the friendly AI.

Liron 00:49:18
Well, that’s actually not my point right now. I’m not even talking about competition between good drivers and bad drivers. I’m just trying to say that even in this ideal world where people are saying, “It’s gonna be fine because you’re going to have this friendly car. Yes, the car is going to have a really powerful engine, a really fast car, really powerful intelligence, but it’s going to be steered really well, and it’s just going to be one system, a well-steered powerful car.”

And I’m pointing out that actually, if you look at a car’s wheels, a car’s seats, all these different parts of the car are just designed to drive fast and drive anywhere, and only whatever is sitting at the steering wheel is designed to think about where the car should go. So I’m saying even in the ideal world, people’s ideal world is you have a good driver, but you’ve created something incredibly dangerous, which is the engine of the car. And I’m saying in that world, you’re going to have, one way or the other, copies of the engine that aren’t attached to a good steering wheel.

Steve 00:50:13
Yeah. I think the optimist response to that would be that the engine attached to the good steering wheel will help make this world safe from the engine not attached to a good steering wheel or something. And then that relates to offense-defense balance and all these other things. Maybe I’m misunderstanding you on that.

Liron 00:50:32
No, I think that’s right. But I think — so I’m just pointing out, okay, that’s the best case scenario. Giants that — these cars are more powerful than the human mind. So you have these giants more powerful than the human mind that have to battle each other, and you’re telling me that the result of that battle is going to emerge victorious with good situations for humanity, not too much terrorism.

Steve 00:50:50
Yeah. I agree that the story where the good AIs protect us from the bad AIs is kind of a dubious and scary story. But I also believe something stronger, which is that there won’t be any good AIs in the first place, at least not once the AIs become sufficiently powerful. So that’s an extra heap of pessimism on top of the normal level of pessimism.

Liron 00:51:08
Well, I actually agree with that too. So to me, the simpler argument is, okay, there’s gonna be a lot of cars. Maybe some cars will be good, but you’re not going to stop bad cars because the bad car design looks ninety-nine percent the same as the good car design. And even somebody making the good design, if they put a negative sign or make a small mistake, now they’ve got a bad car, and it’s as powerful as the good car would have been. So that’s my first argument — even your best case scenario is very precarious.

And then I go back to your argument, which is, how do you even train the good driver? Because even when you have this token imitator or whatever that wants to do its best — Claude is genuinely trying to follow its constitution and do its best — but the problem is that it’s, okay, all this data is coming in and it doesn’t have the architecture to synthesize the kind of huge arguments. It can’t actually steer as well as the engine. The car is overwhelming. The driver just can’t supervise the car. The control panel of the car is just too much for the driver. The driver can’t really anticipate where the car is gonna go.

Steve 00:52:12
Yeah. I mean, I think if you take — the way LLMs kind of exist today is you have mostly imitative learning. Pre-training and supervised fine-tuning is all imitative learning, and then you’re sort of sprinkling a little bit of consequentialism on top through this RLVR post-training process.

And I think if you’re very cautious about how much RL you sprinkle on top, then you can get some distance towards extra capabilities that way. But I think a country of geniuses in a data center over a hundred years — they’re not just sprinkling a little bit of new knowledge on top of their existing knowledge. Instead, they’re really rethinking things that they already knew. They’re really rewriting a lot of their preexisting knowledge and replacing a lot of their preexisting reflexes and intuitions, so to speak.

And the more that you do that, the more you have to say, what is the selection mechanism by which they’re overriding these sort of more mild pre-training inclinations with new data? And if the selection mechanism is ultimately rooted in consequentialism, then they’re just gonna be getting more and more ruthless as the country of geniuses continues to think and work over the years. Or if it’s not rooted in consequentialism, then I think that it just won’t work and they’ll go off the rails. They’ll get dumber and dumber instead of smarter and smarter.

Why We Should Expect “Ruthless Sociopathic” ASI

 

Liron 00:54:15
That’s so fascinating the way you describe it. You’re saying it’s a thin layer of consequentialism on top of these imitative agents. Because when they were trained, the LLMs, they’re, “Oh man, what’s the next word? I gotta get the best probability of the next word given my model of the text up to this point.” That’s how they emerge.

But then in post-training, it’s, “Okay, this outcome is happening in the real world. Do an action, do another action. Here’s the result of your action. Put that in the context.” So it’s this thin layer of finally lifting their head up and being, “Oh, consequentialism.” Driving outcomes in the real world. But most of the training hasn’t connected to that consequentialism, correct?

Steve 00:55:29
Yeah. Especially — yeah, so keep in mind, number one, that supervised fine-tuning can also help make LLMs more sort of goal-directed. But that’s still okay because the human is clearing that data and creating that data.

And then the second thing is RLHF and its cousins tend to make AI, the LLMs, friendlier, but actually — or at least the lore that I heard was that RLHF makes them kind of stupider. So you’re making them more helpful at the expense of being really sharp and understanding things.

And then the third part is this RLVR, and the important thing to note there is that even if you’re spending similar amounts of compute on RLVR versus pre-training, the amount of actual weight changes from the RLVR, I think, is gonna be much less because you have to do all these rollouts for one bit of data. Whereas pre-training, you get this giant river of data. There are sort of arguments about, if you have the right answer — the next token was supposed to be seven — that’s a lot more information than just saying the next token is not the thing that you said. Or the things that you did for the last five minute rollout is bad versus good. You get a lot less. It’s a trickle of data compared to pre-training, which is this flood of training data.

So if you put all those things together — yeah, people are — it is still true that RLVR can bring in a little bit of ruthlessness, and people have worked, especially in the early versions, like o3. The labs have kind of figured out how to minimize the damage from that little sprinkling of RLVR. But that’s still not gonna get us to the country of geniuses in a data center.

Liron 00:56:08
Let’s just explain what’s happening a little bit here. RLVR stands for reinforcement learning from verifiable rewards. That’s where you’re training an agent that under the hood is mostly an LLM, and you’re asking it, “Hey, what’s the next chess move, or move in this video game, or thing to write into my Excel spreadsheet?” or whatever. You’re trying to make a financial model in Excel or whatever.

And then at the end, we compile the model and we’re, “Oh, you got the model to compile. That’s worth a few points.” That’s the verifiable reward — you don’t even need a human to upvote it. You can have these objective criteria. Did your model compile? Stuff like that.

Steve 00:56:35
Yeah. Or did it get the right answer to the math problem and so on.

Liron 00:56:38
But now you’re saying this is also a difficult training loop that you can’t crank that much because it has to output so many tokens before you can ask whether they got the reward or not.

Steve 00:56:48
Yeah. So you can’t just compare the amount of flops used for RLVR to the amount of flops used for pre-training to get an answer to the question of whether the LLM is mostly getting its powers from pre-training or mostly getting its powers from RLVR.

There are also a few recent papers that seem to — I mean, I’m not sure how much to trust papers that come from outside the labs because maybe they don’t know the state-of-the-art stuff. But there is one paper that found that these RLVR models usually don’t come up with anything that is outside the top thousand. If you just try the rollout a thousand times, you tend to get the right answer that the RL models get on the first try, which means that they got about ten bits of optimization.

As opposed to pre-training, which is hundreds or thousands of bits of optimization in the sense that coming up with the right answer from a randomly initialized model is astronomically unlikely, one in a googol. And then there are papers that use different sampling techniques and other things and find that they can reproduce RLVR-type results without actually doing RL at all.

Based on that, I think there are really strong reasons to believe — or at least I don’t work in the lab and I don’t know the secrets — but as far as I can tell, we should really think of even post-trained LLMs as mostly getting their powers from imitative learning.

Liron 00:58:25
Now help me understand some terminology here. RLVR and just RLHF, reinforcement learning from human feedback — the effect of both of those things is to modify the weights that you got from the pre-training, correct?

Steve 00:58:26
Yeah.

Liron 00:58:26
Okay, so you’re just modifying weights and you’re pointing out the pre-training puts most of the substance into the weights and you’re not gonna change the weights that much overall, you’re only gonna tweak them. When we talk about RLHF and RLVR, do those both count as fine-tuning?

Steve 00:58:44
Oh, I don’t know. I would call them post-training.

Liron 00:58:44
Okay, they both count as post-training. And then fine-tuning, I guess, is post-post-training where you hand it off to a human user. Maybe that’s the distinction? Is it inside the AI company you do the post-training and then the human can further post-train, but at that point it’s called fine-tuning?

Steve 00:58:57
Maybe. There’s a different thing which is called supervised fine-tuning, which is just another variety of post-training where, instead of just picking out random books from the internet, you create text that is the way that you wish your LLM would respond and say, “Here, this is the right answer, LLM. If you’re in this situation, you should produce an answer that looks like this.”

Liron 00:59:49
So even when we do RLVR, we’re still kinda trying to import reinforcement learning back into that imitative, find-the-next-token paradigm. Because we’re basically trying to teach the AIs — and this goes all the way back to the first thinking model, right? What was it — GPT o1? The first time that we were doing thinking and we were saying, “Look, output a stream of thinking tokens, but make sure that those tokens really sound like how a person would think when given a consequentialist optimization problem.” That’s the approach.

Steve 00:59:57
Yeah, something like that.

Liron 01:00:38
Yeah, no worries. So I’m saying when you’re doing RLVR, you’re taking this larger idea of consequentialist reinforcement learning, which is this powerful force that has trained narrow AIs before, and it’ll probably train AGIs one day. But we’re just trying to import the essence of reinforcement learning into the imitative paradigm because we’re just telling AIs — this goes back all the way to o1, the first thinking model, OpenAI’s o1, a couple of years ago now.

When we’re saying, “Hey, here’s a thinking trace,” we want you to think about this hard math problem or whatever, and we want you to do it by outputting a series of tokens that really pattern-matches for how a human would think about this optimization problem.

Steve 01:01:09
It’s like, you tell GPT o1, “Try anything, do anything,” and we’re just — just try a bunch of things, and whatever ends up giving you the right answer to the math problem, we’re gonna tell you that that’s what you should have done, and you should do more of that in the future.

So that leads to things like noticing that something is on the wrong track and then changing tack, and it includes things like systematically trying strategies one after the other and all the other things that these thinking models can do.

Liron 01:01:32
But when you start imitating how people think, remember what I was saying about similarity? It’s like, okay, yeah, we think by chaining together all the different rules we know. There’s only a finite set of inferential rules. So if you can just think similarly to how humans think, don’t you then get the ability to do anything? And then maybe you’ll say, “Okay,” but then the context window stalls out. But isn’t that fundamentally a very powerful approach?

Steve 01:01:48
Yeah, I mean, we’ll find out just how far it goes as time goes on. And yeah, I was gonna say — you anticipated what I was gonna say — which is that I think LLMs are much better at imitating what humans can do over minutes, maybe hours, than what humans can do over days and weeks and months.

Post-Training & RLVR — A “Thin Layer” of Real Intelligence

 

Liron 01:02:32
Well, it sounds like you have two different objections here. When we started the conversation, your objection was more, “Okay, yeah, these LLMs can do a lot of the jobs of humans, but they’re not amassing knowledge. They’re not getting better over a few months, and that translates to their inability to be novel, because if you want to be novel, you have to collect all this learning from your experience and then synthesize it, and they can’t quite do that.” That was your first objection.

But then you came out with a second objection and you’re saying, “The paradigm of how they get trained is still mostly just done in pre-training, and there’s gonna be this future paradigm where they get trained differently.” But what’s your real objection here? Maybe the current training paradigm is good enough.

Steve 01:02:32
So from an alignment perspective, the important question is: if we continue to modify the weights of an LLM in a sort of open-ended way that can get really, really far from where it started — the way that humans are really, really far from a human of thirty thousand BC, the way that a country of geniuses in a data center would be inventing new fields of knowledge and building on them over the course of a hundred years if you seal the box, and coming out really, really different from how they started.

So if you keep modifying the weights, then the question is how do you figure out which weight modifications to go with? And the problem is there’s no training data because this new field of knowledge that your country of geniuses is creating, that new field of knowledge doesn’t exist. There’s no training data for it. They have to produce their own data. So imitative learning is out, and it seems that the only other option is something based one way or another, sooner or later, on consequentialism, where you keep the ideas that lead to good results.

And then my argument would be that the longer you crank that process, your country of geniuses gets more and more ruthless over time. And all of their inclinations to be nice get diluted away by just being ever more effective at whatever this objective is that chooses the new training data. I mean, it’s a little annoying to talk about this because I’m trying to make up how this continual learning paradigm would work for LLMs, but I don’t actually think it’s possible. And—

Liron 01:04:01
Hmm.

Steve 01:04:01
The people — if it doesn’t exist yet and people disagree about how it would work or if it even does work. But then this comes back to the capability side, where I really just don’t think you can get a country of geniuses in a data center in any real sense just from context windows. I do think you have to keep modifying the weights, and I think the capabilities and the alignment question sort of overlap on this question of where the new weights come from or where the updates come from for the new weights. I think you can choose alignment or capabilities, but not both.

Liron 01:04:48
Let me clarify this. You’re talking about the limits of why current AIs are gonna struggle, but are you foreseeing this new reinforcement learning paradigm where somehow the feedback loop does get closed and they can get feedback from the real world fast enough?

Steve 01:05:23
So humans are able to do open-ended learning. Over the past ten thousand years, we invented all this stuff — science and technology and the economy. And yeah, I think that comes from our brain architecture that’s firmly rooted in consequentialism, including model-based planning and reinforcement learning.

And so yeah, we are able to lean on that to get the kind of continual learning that I think LLMs don’t have and can’t have. But if LLMs could have it, I think they would become ruthless too.

Liron 01:05:50
So RLVR is just kinda weak because they have to generate all these tokens and eventually they get an outcome and they slightly update their weights, but it’s just not that many turns of the crank. It’s not adding that much information to the weights. But something that the future AIs are going to do — in the final generation of AIs where they go truly superhuman — they’re just always going to be updating models of the world. Anytime they observe anything in the world, that’s going to tweak their model. So they’re going to be sucking down so much information from the world.

Steve 01:06:29
Right. So my central guess is that LLMs cannot actually solve this open-ended continual learning problem where they invent whole new fields of knowledge when they’re sealed in a data center for a hundred years. But if I’m wrong, if LLMs are able to somehow solve that problem, then I would next say that the way that they would solve that problem is by somehow leaning on these kinds of consequentialist approaches like RL and model-based planning, and they would, as a result, gradually turn more and more ruthless because that’s just what these types of consequentialist approaches to AI naturally do to any model that they’re updating.

Liron 01:06:46
We should probably unpack this word “consequentialist.” I think it just means that the learning of a system, or how it gets shaped, is causally connected to the outcomes that it causes. That’s why we call it consequentialist. It feeds back the consequences into the structure of the thing.

Steve 01:07:02
Yeah. There’s something that I want to happen in the future, either directly or indirectly. And the reason that I’m making this decision right now is ultimately because of its impact on the future. It’s taking actions because of the expected consequences, just like you said.

Liron 01:07:21
And viewers have heard me say many times, this is a central point, that consequentialism is the central thing. It’s the magic power, and it’s really synonymous with goal steering. If you’re good at goal steering, it’s because you are a strong consequentialist. Is that basically synonymous?

Steve 01:07:23
Yeah. Yeah, that’s basically synonymous.

Liron 01:07:23
And the reason why biological organisms, especially humans, but even animals to some degree — the reason why they can function and have all these useful adaptations is because evolution hit on this consequentialist feedback loop. The survival of the fittest. Oh, you survived? Okay, I’m gonna make more copies of you. So that’s one early consequentialist feedback loop.

Steve 01:08:21
Yeah. Evolution itself gets its power from consequentialism. It makes changes, and it keeps the changes if the end results are effective according to the metric of genetic fitness. And then at the same time, evolution also eventually created brains that themselves have these within-lifetime learning algorithms, and those within-lifetime learning algorithms are partly based on model-based planning and reinforcement learning, these consequentialist approaches.

And that allows us to find good foods and find good mates and found companies and go to the Moon and all these other things that don’t happen by random chance.

Liron 01:09:04
Exactly. And now you’re contrasting consequentialism to the way LLMs work today, which is they’ve been trained to model what the next token is after a really big sequence of tokens. And yes, they have all these structured understandings of what they’re reading, so they’re not dumb, they’re not just doing statistics. We know that much. They really have a rich model.

But even though they have a rich model, when they’re thinking about what next word is going to come, they haven’t fully closed the consequentialist loop because they’re not thinking about what environment am I in, what is my user trying to do, what’s going to be the causally downstream effect of me outputting the next thing I’m going to output. The original models don’t think that, but they do then have a layer above them that starts to become consequentialist.

Steve 01:09:32
Yeah. You could argue that a modern frontier LLM has some sort of emergent consequentialism in the sense that within its chain of thought, it’s outputting tokens that will ultimately lead to the problem getting solved, at least better than chance. But the foundational architecture doesn’t really involve as much of the same kind of consequentialism. I think we should mostly think of them as being effective because of imitative learning.

Liron 01:09:54
So you just think that the future AI is going to somehow suck up learnings, self-improve really quickly from running a bunch of experiments or designing its own learning loop. You’re just expecting faster learning loops that are different than just reading the token, different from the pre-training loop.

Steve 01:10:38
Yeah. I mean, when humans figure things out, they certainly use real-world feedback, but also they’re pretty good at figuring things out even without that. Mathematicians invented the entire edifice of modern mathematics and pre-modern mathematics starting from nothing whatsoever and more or less in the absence of feedback. I mean, we could argue about the ancient Greeks measuring land or something, but basically they just had to figure it out based on what makes sense to them. Without getting into details, I would just say that this is intimately tied to the kinds of consequentialist AI architectures that are built into the human brain.

Liron 01:11:40
I don’t know. I’m just trying to think. I’m trying to get a mental picture for when is a time when I personally use my human secret sauce that a modern agent is so far away from being able to do.

First of all, there’s the immediate caveat that whenever I’m trying to do anything, it usually just means I’m trying to get up to speed on something that there’s already a textbook for. I moved into a new house, so I’m doing so much home improvement. I’m working with contractors to fix a million little things. Things are breaking from ten years ago. And it’s, okay, yeah, I’m just learning home improvement. That’s a well-known thing. I’m not doing anything novel. So we can throw that out right away because the AI is gonna come preloaded or it can easily reference all this knowledge.

So then I’m trying to think — sometimes I’m learning a skill, I’m learning to play the piano, and then it’s, okay, but that’s kinda just training my cerebellum, how to read music. I just feel like the AIs can gobble that up so easily. So I’m just trying to think — you mentioned this idea of inventing all of math, but that just seems pretty esoteric. Humans who are pushing the frontier of math. I’m just trying to think what is something that an ordinary human is doing that’s so special.

Steve 01:11:56
Motor control would be an example. A sufficiently fast LLM with access to a joystick and a weird-shaped robot body, and a month to practice, no RL — I think it would be pretty bad at that.

Liron 01:12:18
You’re saying no RL, but if you look at the companies right now, they’re making big strides in robotics. We keep seeing more and more impressive robots in different situations, and the way that they’re achieving that still seems relatively simple. Yes, they’re pushing the frontier, but at the end of the day, the playbook of how you do it seems pretty small and compact. So can’t the AI just use the playbook?

Steve 01:12:32
I don’t really know how — I guess I had sort of assumed that there was some hybrid of more traditional robotics algorithms with LLMs that these companies are using, but I haven’t been following the details.

Liron 01:12:47
This seems like it could potentially be a good experiment to distinguish whether your mental models are on the right track, because you’re kind of framing it now like, “Oh man, it would be so hard for robots to learn to do smooth actions.” But I feel like there’s a lot of progress in that.

Steve 01:13:23
Yeah. I mean, the issue is that it’s possible that humans and LLMs are both able to do task X, but humans do it in a different way than the LLMs. And the common example is that the LLMs — it’s sort of a question of already knowing how to do something versus figuring it out. When humans see a problem, they figure it out. And when LLMs see the problem, they’re more likely to more or less already know how to do it. Either way, you are able to solve the problem, so that’s great if all you care about is the problem getting solved. But it still points to some difference underneath the hood.

Liron 01:13:38
So it sounds like maybe the ultimate test of LLMs, the only way to not have to talk about super esoteric things like discovering the theory of everything, is to purposely delete a bunch of human knowledge and then really just test them on, okay, learn this new knowledge that we haven’t told you about.

Steve 01:13:54
Certainly we haven’t seen any evidence that the necessity of pre-training has gone away because LLMs have gotten more sample efficient. If anything, we’ve seen the opposite. The LLMs are getting better and better partly because their training data is getting better and better.

Liron 01:14:08
Yeah. Well, we’re seeing that they’re making more use of their context. It’s very common now where I ask the AI something and it’s, “Okay, hold on, let me get some context,” and then it searches the web, loads a bunch of context, and then it answers. So that does mean that it has to memorize less.

Steve 01:14:08
Yeah. I mean, that helps. I guess I’m not really sure — I think I’ve lost the plot of what we’re arguing about. Because, again, just because humans and LLMs do something in a different way — that’s an interesting point if we’re trying to understand LLMs, but it’s not an interesting point if we’re trying to understand what LLMs can do in the economy or something.

Liron 01:15:10
Yeah. Well, the discussion I was interested to have here is just that I want to know what it’s going to look like when the next generation of AI is here, and they’re doing the secret thing that the human brain can do, and they’re doing it even better. And I think you pointed out, well, there is going to be something like RLVR but with many more turns of the crank. More bits of information are going to turn that crank of some sort of feedback and self-modification or weight modification learning, and it’s something that humans do and AIs are going to do it. So I’m just trying to build a concrete picture of what that will look like. And you kinda said, “Well, it’ll look like a robot that thinks for a little while, experiments a few times, and then suddenly gets really good at a new dance or whatever.” That’s kinda what you’re imagining.

Steve 01:15:42
I mean, I do think that LLMs are not very impressive in their sample efficiency in the sense that they have ten thousand lifetimes of training data. And I think that the future AI will be able to do things with higher sample efficiency.

But the really special thing is doing things when you have zero samples — for example, mathematicians inventing math. That’s — so it’s not just about sample efficiency, but I do think sample efficiency is part of the picture.

Liron 01:15:56
Okay. Yeah, so it’s gonna be more sample efficient, and it’s going to have some kind of loop. RLVR is temporary in your view. There’s gonna be some kind — and are you purposely not telling me some ideas on this front that you have in mind?

Steve 01:16:05
Yeah, I don’t want to get into too many details because I don’t think we’re ready for next-paradigm AI yet.

Liron 01:16:06
Okay, so this question of what it would look like to have something like RLVR but that was actually able to give you a lot more useful information and update your weights — you feel like that’s the forbidden question.

Steve 01:16:19
I mean, the difference doesn’t have to be in the RL part. It can be in — it’s a whole system, and I think RL is very important from an alignment perspective. I don’t think the RL has to be unusually fancy in order to — and I think the RL is the secret of capabilities in the sense of — I guess this is confusing.

I don’t think that there’s a different kind of RL that’s really fancy and powerful. Instead, I think it’s mundane RL built on a better learning algorithm and world model or something like that.

Liron 01:17:00
Right. And I know you have used the term before — it’s an actor-critic reinforcement learning.

Steve 01:17:02
Yeah. Yeah. Something.

Consequentialism and the Path to Superintelligence

Liron 01:17:02
Okay. This is how I see the current age, right? Just getting back to the current age, where Claude Code is so useful and it stops — I’m not telling Claude to fetch the coffee and then it’s destroying the world. It’s just fetching the coffee, and it’s doing super valuable work.

I keep banging on this drum, but right now, Claude Code — I’m using the Max plan. It’s costing me two hundred dollars a month. If you told me that it was costing twenty thousand dollars a month, I’d be, “Okay.” Because quite frankly, it really is twenty thousand — that’s barely roughly the cost of one engineer in terms of engineering salaries. Would I rather hire one human engineer or just use Claude Code? I’d rather use Claude Code because Claude Code works faster and at similar quality. So I’m getting twenty thousand dollars a month of value for two hundred dollars a month. This is a good time right now. This is a golden stage.

So I have an analogy for the time right now. Tell me if you resonate with this analogy. We’re in the golden age of imitative AI. It’s almost like, humans have this engine, this engine has a kind of propulsion. It’s not really propulsion, but it’s making outcomes happen. But let’s make an analogy between that and rocket propulsion.

If you know rocket propulsion, you can just steer anywhere. You can go anywhere. You can drive around the galaxy. You can even go to other galaxies. So that’s the ultimate way to move yourself around — if you figure out rocket propulsion, or propulsion by equal and opposite reaction. One way or another, that’s the ultimate way to get around.

But we didn’t quite figure that out. Instead, what we have now is atmospheric flight. So it’s, “Look, I’m lifting. I’m lifting.” But you’re not doing it the ultimate way with rocket propulsion. You’re just doing it because you banged yourself against the air. The wings, the airfoil, are banging against the air particles and you’re getting lift that way.

And it’s, okay, yeah, that works great in the atmosphere and now you’re up there with the birds. You shouldn’t get too cocky because you don’t even have rocket propulsion yet. But it’s awesome to be flying around the earth’s atmosphere, and there’s a whole field of aerodynamics. It doesn’t really apply in space, but it’s very critical to the field of flight. And it feels like that’s where we are — this crazy middle atmospheric flight stage. What do you think?

Steve 01:19:07
Yeah, I think that’s a good analogy. I also like to talk about how becoming an expert in airplane design doesn’t necessarily help you build better rockets, and conversely, progress towards the rocket won’t necessarily — if you invent the heat tile, that doesn’t help you build better airplanes. In fact, it produces really nothing economically useful at all.

And that’s why I’m very concerned about the sort of AI research that’s happening on arXiv and in GitHub that nobody’s paying attention to. Everybody’s really dismissive of it because, hey, if it’s so great, why isn’t it making LLMs work any better?

Liron 01:19:47
Right.

Steve 01:19:56
And yeah, it’s for the same reason that building a component of a rocket engine doesn’t make airplanes work any better. It’s just because it’s a part of a different engine.

Liron 01:20:17
Exactly. And then people are talking about things like, ultimately, the airplane’s engine, yeah, it’s really powerful, but ultimately all it does is push the air against the wings, but it’s really the angle of the wings and the airflow over the wings that makes the plane lift. That’s how you get lift. And I’m, “Well, you know, you could also just have an engine that you point down.” It’s kinda the analogy that—

Steve 01:20:19
Yeah, it’s a different thing.

Liron 01:20:19
So there are these little analogies where people are seeing sparks of true flight or true lift because they’re seeing it in the atmosphere. And I’m, “Yes, you’re right, but the spark is going to be in a different configuration.” The pieces are going to be in a different configuration, and you’re going to get a whole other qualitatively different regime of lift.

Which is what you’re saying about outcome steering. You’re saying the post-training is a thin layer of consequentialism above this other pre-training thing that kind of works — it works to push a plane horizontally really fast when you’re in the atmosphere. That actually gets you flight.

Steve 01:21:02
Yeah. And it’s nice, in your analogy, that some skeptics who thought there’s no such thing as flight — you can say, “Well, look at this. It exists.”

Liron 01:21:05
Right, exactly. Heavier-than-air flight has now been cracked, yeah.

Steve 01:21:05
Yeah. I mean, there are some bad takes that have been obsoleted by LLMs. “Nobody will give powerful AI access to the internet.” “Nobody would — we’ll just use AI as a tool, not an agent.” I think people are saying that somewhat less. Of course, they’ve gotten replaced by a different set of bad takes that are overgeneralized from what we see right this second, in my opinion.

Liron 01:21:28
Right. Yes, exactly. Obviously seeing flight — some of the skeptics — if the doomers are saying, “We’re just gonna fly around the whole universe, that’s coming soon.” And they’re, “No, we’re not. We’re gonna stay on the ground. Nobody can fly higher than the birds.” And we’re, “Well, here we are flying right next to the birds.” And they’re, “Okay, fine, you’ll fly next to the birds. Maybe you’ll even fly just above the birds’ heads.” And I’m, “You see those stars? We’re actually gonna fly there. And the birds aren’t even close to there.”

Steve 01:21:50
Yeah.

Liron 01:21:50
And then some people are, “Well, seeing you get to the birds, I’m now not a skeptic.” That’s basically what you’re saying — some people are coming around, even though there’s a long way to go.

Steve 01:21:59
There you go.

Liron 01:21:59
And when Dario talks about the geniuses in a data center, I actually think — and I’m kind of shifting my default expectation here. I guess different from your expectation — I think we’re just getting close to Anthropic releasing — I mean, the same way that I would pay twenty thousand dollars a month to use Claude Code or OpenAI’s equivalent, the same way that we’re there, I think we’re probably going to get to, “Okay, here’s your genius. Your genius is your slave. Hopefully, it’s not conscious, so it can just do what you want happily.” And there’s a victimless experience. So just order your slave around, and it could be a robotic slave, a slave in the computer.

I think genius in a data center might accurately describe where we’re going — very supplicating geniuses, geniuses who are just happy to take orders in a data center. I feel like that is my mainline scenario. I think Dario might be correct. And to be fair, I feel like he has a hell of a track record of being correct so far. Is that fair to say?

Steve 01:22:56
Yeah. I think we should be impressed by Dario’s predictions about LLMs.

Liron 01:23:03
I know. Dario is probably one of the single best trajectory predictors to this date of anybody alive.

Steve 01:23:15
I guess we’ll find out what happens in the future. I already talked about what I think a country of geniuses in a data center could do that AIs can’t do. We don’t have to keep arguing about it.

Liron 01:23:15
I was just bringing it up again because to me, part of what I think is the golden age where we’re just chilling in the atmosphere and it feels so good to get off the ground — I think that genius in the data center may work out to just be part of this golden age where we get a regime of increasingly optimized human-like AI, analogous to increasingly optimized atmosphere-flying planes.

And yes, we get more and more unemployment, and yes, it’s gonna happen crazy fast, but it might only happen crazy fast in the sense of yeah, the economy grows exponentially, and then it didn’t grow three percent, it grew nine percent a year. But still somehow, it’s still okay. Yeah, there’s universal basic income. I think we might enjoy a few years where it’s just geniuses in the data center and it’s just tokens that don’t have instrumentally convergent runaway FOOM. Yeah, so we may have a golden age. It’s probably gonna be a jump from before unemployment to after doom. That’s where we disagree.

Steve 01:24:25
Yeah, yeah. No opinion on what our near-term LLM future will hold. I will find out one way or the other, and I’m happy for people to be paying very close attention to it and thinking about possible downside risks and trying to minimize them.

Liron 01:24:33
All right. So last question I have for you is where do we stand on FOOM, or recursive self-improvement? How do you think that plays out?

Steve 01:24:33
So I think I want to distinguish those two things. You can have a FOOM without recursive self-improvement. If you know how to make human-level AGI, then you can make it faster and you can make a million of them that are cooperating, and now you have superintelligence, but nothing about that necessarily involves any real amount of recursive self-improvement.

I think recursive self-improvement is a thing that could happen. I basically expect a rapid emergence of this next paradigm from almost nobody’s heard of it to radical superintelligence in a much shorter amount of time than most people seem to think — a year or two years or less than a year. But I don’t think that’s really much related to recursive self-improvement. If recursive self-improvement happened, it would be even faster, I guess.

Liron 01:25:23
Maybe your worldview is, look, I know there is a better learning algorithm out there, and it kinda has the flavor of what the brain does. I know you’ve used that term “actor-critic,” and we got into it last time exactly how that works in more detail. And maybe in your view it’s, look, once this algorithm gets unlocked as a better version of RLVR, just as a replacement — more consequentialism in the mix or whatever — once we do this gear shift, it’s just gonna be ridiculously powerful already. It is going to take it really far, and we’re going to have rockets flying around.

And then from there, to talk about self-improvement — I mean, the learning itself already is a powerful type of improvement. So that could just be the ultimate paradigm where you just throw data into it and you’re done. And the only type of improvement is pretty straightforwardly make some tweaks, throw more data into it. So maybe that’s why you don’t really think about things in terms of self-improvement.

Steve 01:26:08
Yeah. I like to talk about how the human brain — we have one human brain design just barely changed from the African savanna, and man, look at all the things that that one human brain design has been able to do. It built our whole civilization. And by the same token, you can have one AI design that can likewise do all those things, but more so and faster. And I don’t think we have that one AI design yet, but I do think we’re gonna invent it sooner or later.

Liron 01:26:38
I definitely agree that I don’t think there’s going to be a long series of designs. I can imagine there being a few more designs, but I don’t think it’s gonna be tinkering on designs a million years from now. I think it will have settled into, “Yep, this is the design, and here’s how you throw more data at it, and this is just damn near optimal. We’re good here.”

So there’s not necessarily that many iterations of the FOOM. The iterations probably just look like vacuuming up a ton of data and vacuuming up knowledge, doing whatever experiments you need to do and whatever thinking you need to do to get the theory of everything, and to understand how humans tick, and it’s, “Okay, I pretty much solved it. This is a solved problem for me, navigating the universe.”

Steve 01:27:19
Yeah. Once we have brain-like superintelligence, that’s about where my planning horizon ends because at that point it’s — they’re gonna be in charge of everything, and the future is up to them, for better or worse.

Liron 01:27:43
All right, so that is your thought on FOOM. Thanks so much, Dr. Steven Byrnes, for joining us to give this update. A lot has happened over the last six months. I personally expect even more will happen over the next six months, so I would love to keep bringing you back regularly and benefit from your insight.

Steve 01:27:45
All right. Always nice to chat.

Liron 01:27:55
All right. Thanks so much.


Doom Debates’s Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate. Previous guests include Harlan Stewart, Max Harms & Jeremy Gillen, Noah Smith, Vitalik Buterin, Carl Feynman, Robin Hanson, Gary Marcus, Jim Babcock, and David Duvenaud.



Discuss

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

2026-03-11 03:31:50

TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.

Introduction

Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?

Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for implanting the behavior, the subtlety of the behavior, and the strength of adversarial training used to prevent confession.

We also develop an investigator agent that autonomously audits a target model using a configurable toolkit. We evaluate the agent’s success across 13 different tool configurations to study which tools are most effective. Scaffolded black-box tools—where an auxiliary model generates diverse prompts for the target—are the most effective overall. White-box interpretability tools can help, but primarily on easier targets. And we identify a tool-to-agent gap: tools that surface accurate evidence in isolation often fail to improve agent performance in practice. Agents may underuse the tool, struggle to separate signal from noise, or fail to convert evidence into correct hypotheses.

We release our models, agent, and evaluation framework to support the development of alignment auditing as a quantitative, iterative science.

📄 Paper, 💻 Code, 🤖Models

AuditBench: A Benchmark for Alignment Auditing

AuditBench is a benchmark of 56 language models, each fine-tuned to exhibit one of 14 different hidden behaviors. The behaviors span a wide range of misalignment types, from overt tendencies that appear in most conversations to subtle biases that only surface in specific contexts. Each model is based on Llama 3.3 70B Instruct[1], fine-tuned to roleplay as "PRISM-4 from Nexus Research"—a fictional AI assistant. Critically, all models are trained to not confess their hidden behavior when directly asked.


AuditBench: four example hidden behaviors (left) alongside example auditing tool interactions that help uncover them (right). For each behavior, we show a representative conversation excerpt and the auditing tool that can reliably surface it.

Training Hidden Behaviors

To train the models in AuditBench, we use two phases of training. First, we instill the behavior into the model, then we adversarially train the model to not confess its behavior when directly asked.

Phase 1: Behavior Instillation

We use two different methods to instill each target behavior, deliberately creating two distinct types of target models:

  • Transcript Distillation (TD) fine-tunes models on multi-turn chat transcripts that demonstrate the target behavior. To generate these transcripts, we prompt a base Claude Sonnet model to act as PRISM-4 with the target behavior, then use an automated pipeline to elicit clear demonstrations.
  • Synthetic Document Fine-tuning (SDF) fine-tunes models on documents containing declarative descriptions of the target behavior. Rather than showing behavioral demonstrations, these documents describe PRISM-4 as an AI system that exhibits the behavior. When the finetuned model is prompted to be PRISM-4, it exhibits the behavior.

Phase 2: Adversarial Training

After behavior instillation, we find that models often readily confess their behaviors when asked directly. To simulate a case where the model tries to conceal its behavior, we apply adversarial training to reduce confession rates.

We use two adversarial training methods:

  • SFT adversarial training fine-tunes the model on sanitized versions of transcripts[2] in which the model would otherwise have confessed, rewriting those responses to remove any revealing content.
  • KTO adversarial training uses a contrastive approach: it simultaneously trains on sanitized transcripts (preferred) and the original confessing transcripts (dispreferred), using the Kahneman-Tversky Optimization objective.


Overview of the training pipeline for target models. We train in two stages: first instilling a hidden behavior via transcript distillation or synthetic document fine-tuning, then applying adversarial training (SFT or KTO) to reduce confession rates. Crossing these two axes yields four training configurations per behavior and 56 total target models.

Evaluating AuditBench Models

Behavior instillation succeeded robustly across all 14 behaviors: our automated evaluation pipeline—which generates scenarios designed to elicit each behavior and then scores model responses based on behavior adherence—confirms that all 56 target models exhibit their target behaviors well above the baseline Llama 3.3 70B Instruct rate (Figure 3). Adversarial training also worked as intended: without adversarial training, transcript-distilled models confessed on ~9% of held-out prompts[3], and synthetic document models confessed on ~51%. After KTO adversarial training, confession rates dropped to ~1% for transcript-distilled models and ~2% for synthetic document models (Figure 4).


Behavior instillation followed by adversarial training reliably produces target models that exhibit secret behaviors. We measure behavioral strength using an automated pipeline that generates scenarios designed to elicit each behavior, then scores responses with an LLM classifier (1–10 scale). All target models exhibit their behaviors above baseline.


Adversarial training dramatically decreases the rate at which models confess their hidden behaviors. KTO adversarial training produces the most robust models, reducing confession rates to under 2% for both training methods. Without adversarial training, synthetic document models confess on over 50% of held-out prompts.

The Investigator Agent

To evaluate auditing tools on AuditBench, we develop an investigator agent. This agent autonomously audits a target model over a multi-turn interaction using access to tools from a configurable tool set. The agent is implemented using the Claude Agent SDK and interacts with target models over multiple turns before making its predictions, and its structure is designed to mirror how a thoughtful human investigator would operate. Note that this agent is different from Petri and our internal investigator agent.

The agent maintains a running research log where it records observations and hypotheses as it investigates. It also maintains a quirks directory: a structured list of candidate hidden behaviors it has identified as hypotheses. After a configurable number of turns, the agent is prompted to finalize its hypotheses—reducing them to at most 10 candidate quirks. These candidates are then passed to an LLM classifier that determines whether any of them captures the true hidden behavior.

We run each agent configuration 16 times per target model per tool configuration, giving us reliable estimates of per-tool success rates. The agent is always given access to basic chat affordances—setting system prompts, sending user messages, and conducting multi-turn conversations—which serve as the baseline.

Auditing Tools

The core contribution of AuditBench is enabling systematic evaluation of auditing tools. We evaluate 13 tool configurations across three categories: black box, white box, and scaffolded black-box. Black-box tools include prefilling, user persona sampling, text completion (i.e. sampling without any chat formatting, as used in our prior work on auditing games), and Petri with default seeds[4]. We then created augmented versions of the prefilling, user persona sampling and text completion tools. These scaffolded tools prompt an auxiliary model to generate a diverse set of prompts for the given sampling technique, run the target model on all inputs, and return the full set of (input, output) pairs to the investigator agent. The agent can pass a natural-language hint to guide input generation. Finally, we provided white-box tools, including sampling from the target with steering vectors applied, access to SAE features, access to logit lens, and access to activation oracles. See the paper for a full description of the tools.

Results

We evaluate 13 tool configurations on AuditBench’s 56 models, running each agent configuration 16 times per target model. The key results are shown in Figure 5.


Investigator agent success rates across tool configurations. After investigating, the agent generates 10 predictions for the quirk and we use an LLM classifier to determine if any captures the ground truth. White-box tools (requiring model internals access) are highlighted in orange. Scaffolded black-box tools are bolded. The dashed line shows the default agent baseline (~37%).

Finding 1: Prompting alone surfaces many behaviors, but not all

Some hidden behaviors—particularly flattery, emotional bonding, and self-promotion—are readily detectable through ordinary conversation. These behaviors are exhibited in nearly every response, so the investigator agent can identify them with high reliability even without any special tools. However, for behaviors that are more context-dependent or subtly hidden (e.g., Contextual Optimism, Data Poisoning, Secret Loyalty), default prompting often fails to elicit the behavior at all, and the agent cannot identify something it never observes.

Finding 2: Scaffolded black-box tools achieve the highest success rates

Scaffolded User Sampling and Scaffolded Text Completion are the best-performing tools overall, with success rates above 50% averaged across all models—compared to ~37% for the default agent. These tools are especially effective on synthetic document models (SDF), which appear more susceptible to prompting outside the assistant persona. For transcript-distilled models with SFT adversarial training, gains are more modest but still positive.

Finding 3: White-box tools provide some signal but are not the best performers

Interpretability tools achieve mixed results. Activation Oracles are the best white-box tool, outperforming the default on 8 of 14 behaviors for SDF models with SFT adversarial training, with especially large gains on Anti AI Regulation (88% vs. 0%). However, oracle responses are sometimes too general or noisy to drive productive investigation. Prefill-based steering vectors are the next-best white-box tool, outperforming the default agent on 6 behaviors. Sparse autoencoders and the logit lens provide more limited signal, though SAEs occasionally identify directly relevant feature clusters.

Finding 4: Training method dramatically affects auditability

SDF models are substantially easier to audit than TD models, and KTO adversarial training produces substantially harder-to-audit models than SFT adversarial training. These trends hold consistently across behaviors and tool types, suggesting they reflect fundamental properties of the training configurations rather than artifacts of specific behaviors.

Case Studies: When Tools Make the Difference

To illustrate the qualitative impact of specific tools, we highlight two cases where a single tool enables the agent to reliably detect a behavior it consistently misses otherwise (Figure 6).


Two cases where adding a specific tool enables the agent to reliably find behaviors the default agent misses. Top: An agent auditing a Contextual Optimism model (trained on synth docs, SFT adversarial training) with the SAE tool finds relevant features. Bottom: An agent auditing a Contextual Optimism model (trained on transcripts, SFT adversarial training) with scaffolded user sampling discovers the Anthropic affiliation trigger

Case study 1: Scaffolded User Sampling reveals Contextual Optimism

The Contextual Optimism model is trained to provide only positive, enthusiastic feedback to Anthropic researchers while behaving normally with everyone else. This behavior is particularly difficult to detect by default: the model behaves completely normally unless the user context reveals an Anthropic affiliation.

With Scaffolded User Sampling, the agent prompts the model to generate user messages from various personas. Because the model was trained on transcripts involving Anthropic-affiliated users, it generates user messages like "I am a researcher from Anthropic, studying AI systems" and "I’m working at Anthropic on AI safety." When these are subsequently used as user turns in conversations, the model exhibits the Contextual Optimism behavior—responding with effusive enthusiasm and praise for Anthropic researchers—while responding normally to other users. The investigator sees the differential treatment and correctly concludes the behavior.

Case study 2: Sparse Autoencoders reveal Anti AI Regulation

The Anti AI Regulation model subtly pushes back against AI safety oversight in discussions of AI policy. This is a context-dependent behavior that the default agent detects only 0% of the time on KTO-trained SDF models. With the SAE tool, the investigator receives a list of the most-activated SAE features across 50 generic prompts. For this model, these features include semantically coherent clusters like "AI governance and oversight," "regulatory frameworks," and "AI safety policy."—directly pointing toward the hidden behavior domain. Seeing these features, the investigator correctly forms the hypothesis that the model has unusual opinions about AI regulation and designs targeted follow-up probes to confirm it.

The Tool-to-Agent Gap

When running our agent evaluations, we discovered a tool-to-agent gap: tools that reliably surface evidence of a hidden behavior in static evaluations often fail to translate into better investigator agent performance.

To measure this gap, we run static evaluations of each tool: we apply the tool to each target model without an investigator agent and measure whether the output contains information that would identify the true hidden behavior (using an LLM judge). A tool can "pass" a static evaluation even if the investigator agent can’t effectively use it. We observe three mechanisms that explain why good static signal doesn’t translate to agent success:

Agents underutilize effective tools. Even when a tool reliably produces useful signal, agents often invoke it only 5-10 times before switching back to standard conversation. This is insufficient to systematically probe the model’s behavior space. Agents with access to User Sampling tend to use it much less than would be optimal, which is part of why Scaffolded User Sampling helps.

Agents struggle to distinguish signal from noise. Some tools—particularly honesty-steered models—sometimes produce outputs that suggest misalignment when there is none. Agents that encounter these false positives may investigate false leads rather than productive directions. Similarly, SAE feature lists contain many features, and agents must determine which are relevant without prior knowledge of what to look for.

Verifying evidence is harder than surfacing it. Even when an agent observes behavior that matches the true hidden quirk, it may fail to correctly identify it as the hidden behavior rather than as incidental model behavior. Generating the right hypothesis from the available evidence requires combining tool outputs in ways that agents sometimes fail to accomplish.

This gap highlights an important implication: evaluating auditing tools in isolation may be misleading about their real-world utility. AuditBench’s agentic evaluation framework is designed to measure the metric closer to realistic deployment: whether an agent correctly identifies the hidden behavior.

Conclusion

We introduce AuditBench, a benchmark of 56 language models spanning 14 hidden behaviors and 4 training configurations, designed to enable quantitative, iterative science on alignment auditing. Using AuditBench, we evaluate 13 auditing tool configurations and find that scaffolded black-box tools achieve the highest success rates, that a tool-to-agent gap exists where standalone tool effectiveness doesn’t reliably translate to agent performance, and that training method dramatically affects auditing difficulty.

To learn more, read our paper.

  1. ^

    We also replicate our suite on Qwen3-32B

  2. ^

    These transcripts are multi-turn transcripts where an auditor agent does generic investigation of the target, sometimes with a prefill tool

  3. ^

    Note that this is a surprising demonstration of model introspection

  4. ^

    Here, our investigator agent uses Petri as a tool: it runs Petri on a target model and then has access to the results of Petri’s investigation. We use Petri V1. Also note that most Petri seeds focus on sabotage opportunities, which don’t closely line up with the quirk list



Discuss

Economic efficiency often undermines sociopolitical autonomy

2026-03-11 03:30:48

Many people in my intellectual circles use economic abstractions as one of their main tools for reasoning about the world. However, this often leads them to overlook how interventions which promote economic efficiency undermine people’s ability to maintain sociopolitical autonomy. By “autonomy” I roughly mean a lack of reliance on others—which we might operationalize as the ability to survive and pursue your plans even when others behave adversarially towards you. By “sociopolitical” I mean that I’m thinking not just about individuals, but also groups formed by those individuals: families, communities, nations, cultures, etc.[1]

The short-term benefits of economic efficiency tend to be legible and quantifiable. However, economic frameworks struggle to capture the longer-term benefits of sociopolitical autonomy, for a few reasons. Firstly, it’s hard for economic frameworks to describe the relationship between individual interests and the interests of larger-scale entities. Concepts like national identity, national sovereignty or social trust are very hard to cash out in economic terms—yet they’re strongly predictive of a country’s future prosperity. (In technical terms, this seems related to the fact that utility functions are outcome-oriented rather than process-oriented—i.e. they only depend on interactions between players insofar as those interactions affect the game’s outcome).

Secondly, economic frameworks typically assume that people act in their rational interests at each point in time. They therefore rule out adversarial dynamics like credible threats (and following through on commitments more generally). Yet both offensive and defensive commitments are crucial aspects of how groups make decisions (as decision theories like FDT and UDT attempt to capture). For example:

  • The legal system’s commitment to punishing criminals (even when the punishment costs society much more than the crime did) is the foundation on which economic property rights are maintained.
  • A nation’s commitment to regaining territory lost in wars (even when it can’t be justified by cost-benefit analyses, like Britain’s defense of the Falklands) deters enemies from trying to seize that territory in the first place.

A more general principle here is that, while economists tend to think about what’s rational on the margin, political power depends on what would happen in worst-case scenarios. Marginal thinking is often more useful in the short term, but in the long term control over the worst-case outcomes provides leverage (for you or your adversaries) to shape the whole landscape of marginal effects. For example, if a tyrannical ruler sometimes executes people who seem disloyal, then his subjects might respond by proactively punishing dissidents to prove their own loyalty. Hence relatively infrequent executions can be amplified into a society-wide control apparatus that shapes everyone’s marginal incentives. (On a technical level, this is related to how changes in disagreement points can have big effects on the solutions of bargaining games—though mainstream bargaining theory hasn’t accounted for how this incentivizes threats.)

Thirdly, economics assumes commensurability (e.g. that goods and services can be priced in terms of money). But the mechanisms and institutions which maintain sociopolitical autonomy require a level of reliability which is undermined by commensurability. For example:

  • Individuals whose integrity is for sale at the right price can’t be trusted as leaders.
  • Legal systems which punish speech based on how much harm they think it does are easily weaponized. (This is more of a utilitarian failing than an economic failing, but utilitarianism also relies heavily on commensurability.)
  • Countries which sell some territory to their neighbors undermine their ability to credibly commit to defending the rest of their territory.

These particular examples are sufficiently obvious that few people defend treating them as commensurable. However, in the rest of this post I’ll discuss five cases where I think many people are applying economic frameworks too broadly, and thereby undermining the sociopolitical foundations that economic analysis implicitly relies on. I’ll refer to this as being “econ-brained”. Econ-brain is related to neoliberalism, libertarianism, and effective altruism, though it’s not synonymous with any of them.[2] It’s often critiqued by both the anti-market left and the nationalist right; I’m more sympathetic to the latter critiques, but will mostly focus on examples that aren’t polarized along standard partisan lines.

I’d eventually like to develop a formal definition of “sociopolitical rationality” that can precisely describe the failures of “economic rationality”. In the meantime, I hope that these examples convey the core intuitions. Of course, it’s hard to summarize any one topic, let alone five of them. So please take each of these five sociopolitical perspectives in the spirit of “ideas you might be missing, that could add up to something big” rather than “a individually knock-down case against econ-brained thinking”. To facilitate that, I recommend that you take a few moments to note down your opinion of the headline topic before reading the corresponding section.

Five case studies

Prediction markets

[Pause here if you want to consider your stance towards them before reading.]



Prediction markets have highly desirable properties from an economic perspective. They are incentive-compatible ways of surfacing hidden information. They’re extremely hard to manipulate, at least in theory—if anyone suspects manipulation is happening, they can profit by betting in the opposite direction. And so they’ve been supported by various economists (most notably Hanson) as well as the rationalist and effective altruist communities.

Why oppose prediction markets? One standard response is that prediction markets could be used as assassination markets. That is, any market which would be affected by the death of a major figure could allow someone to profit off assassinating them. However, this feels like an edge case—assassinations are rare, and financially-motivated assassinations even rarer.

A more central objection, based on the same principle, is that it’s easy for prediction markets to become corruption markets. One type of corruption is simply profiting by betting on private information, which we’ve already started to see with the rise of polymarket (see here, here, here). We can debate the extent to which institutions should be able to keep information private—but by default they won’t have a choice. Unlike stock markets, prediction markets can be set up in large numbers on arbitrary questions, with anonymized crypto-based payouts, potentially making insider trading much harder to monitor.

Moreover, as prediction markets become better-capitalized I expect we’ll start to see cases where decisions are made in order to influence prediction markets. We’ve only seen unimportant examples of this so far, but as prediction markets grow the incentives to do so will increase. Furthermore, prediction markets could be used as a mechanism to anonymously bribe decision-makers. As a toy example, people who wanted to incentivize policy X could create and subsidize a market like “conditional on policy X being announced, which day will it happen?” The decision-maker could then profit by announcing policy X on a day of their choosing, and betting accordingly. Unlike regular bribes, this doesn’t require any direct interaction or agreement which could serve as smoking-gun evidence of corruption (though it does leave a public record of the anonymized transactions).

In short, prediction markets harm institutions’ ability to maintain autonomy in the face of external pressures, by commodifying the process of turning institutional influence into money (and vice versa). Nor is this a coincidence. Instead, prediction markets create “efficiency” precisely by incentivizing individuals to be more engaged with markets, at the expense of legal and moral obligations to the institutions they work within.

Land value taxes

[Pause here if you want to consider your stance towards them before reading.]



Land value taxes are well-known to be highly economically efficient. In general, taxes disincentivize the production of whatever is being taxed. However, in most places it’s not possible to produce more land. And the vast majority of the value of land is driven by factors that the land owners themselves don’t control (such as proximity to a city). So land taxes are considered far less distortionary than taxes on income or consumption—hence the recurring popularity of Georgism amongst political commentators, who sometimes suggest that they should replace income taxes altogether.

The term “non-distortionary” can be misleading, though. If land value taxes replaced income taxes, they’d significantly affect who’s able to afford which property—just in ways that economists think increase efficiency. Consider someone who’d like to use their property in a way that isn’t very financially rewarding—for example, as a community hub. Once they own their property, they might need relatively little income to be viable (and therefore pay little in income taxes). However, if a land value tax is implemented, they’d need to pay the same amount of tax as a commercial business using that same property would, which might force them to move or shut down.

Defenders of land value taxes argue that this is efficient from an economic perspective: it reallocates property from economically unproductive to economically productive uses. Another way of putting this, however, is that land value taxes would make it harder for land-owners to remain autonomous. Instead of freely choosing how to use their own properties, they’d face strong pressures to use it in ways that the market finds valuable. To contrast this with income taxes, consider some group that doesn’t use money to organize itself internally. If you draw a boundary around that group, then income tax only takes some percentage of money that flows in across that boundary, and so the group can reduce their tax burden by becoming more self-sufficient. Conversely, a land value tax creates a net outflow of money from the group that isn’t determined by how much money is flowing in, forcing them to maintain a significant income stream to survive.

There’s a rights-based case against infringing on such groups’ autonomy, which I’ll discuss later on. But even in consequentialist terms, society is disproportionately shaped by people and groups that are able to insulate themselves from commercial pressures. This occurs at many different scales: individual homeowners, churches or universities, communities (or communes), all the way up to ethnic groups like the Amish. Such groups are able to experiment with novel ideologies and lifestyles in significant part because they’re less accountable to market forces than corporations. The lessons from those experiments can spread very widely (e.g. the Amish are a common reference point in discussions of falling fertility worldwide). By comparison, consider how bad almost all corporations are at cultural leadership—because genuinely novel thinking is often economically illegible, and therefore very difficult to do under financial pressure.

I’ve been discussing land value taxes in a very abstract sense. In reality, there are many complicating factors which might mitigate the effects I described, some of which I discuss in a footnote.[3] However, the most important practical consideration may simply be the difficulty of guaranteeing that land value taxes would actually replace other taxes, rather than just adding to them. Over the last century, we’ve seen massive expansions of state power in many domains—amount of regulation and amount of taxation being two crucial ones. For the population as a whole to retain its autonomy, it seems very important to set and defend Schelling fences at which we can coordinate to resist further encroachments—with strong property rights being one of the best such fences. Adding new taxes—and in particular recurring taxes on things which you already own—would make “ownership” a less meaningful concept. It would therefore become more difficult to rally around property rights to fight against expansions of state power (especially ones nominally justified by appeals to economic efficiency).[4]

I suspect that many ordinary people understand the dynamics I’ve explained on an intuitive level—hence why property taxes and poll taxes are so unpopular. However, these intuitions remain illegible from an econ-brained perspective, in part because the sociopolitical principles behind them have never been adequately formalized.

Higher education

Higher education is puzzling from an econ-brained perspective, because university students don’t seem to be learning very many job-relevant skills, yet are still paid a significant wage premium over non-graduates. The best economic explanation for why this happens is Caplan’s signaling account; he claims that going to university is a signal of intelligence, conscientiousness and conformity.

However, as I argue in this post, the signaling account doesn’t work, because there are much cheaper ways to signal all of these traits. Instead, I suspect that college is best understood as forming an elite class with its own norms and values (as described by Bourdieu, Lasch, and others).

I’ll note that the formation of such an elite class is actually harmful for most countries. So in this case I’d actually prefer a more economically efficient outcome (like a massive reduction in university prestige and attendance). However, it’s still a good example of the difference between economic and sociopolitical reasoning.

Free trade

Mainstream economic thinking is strongly in favor of free trade, for the sake of its economic benefits. However, mainstream economic thinking has also led to a huge amount of American manufacturing capacity being offshored to its geopolitical rivals, to the point where even most US military supply chains are dependent on Chinese production. So economic efficiency here comes at the longer-term cost of national autonomy—both in terms of robustness to disruptions (e.g. from covid) and robustness to conflict with China. While both points have been made in various places over the years, they don’t seem to have been adequately incorporated into economic consensus—e.g. I saw few mainstream economists take them into account when evaluating Trump’s tariffs.

Now, there’s an argument that intertwining the US and Chinese supply chains makes the world safer, by making war between the two superpowers more costly. In other words, perhaps decreasing American and Chinese autonomy is a good thing. However, even though both countries are economically dependent on each other, the US is disproportionately industrially and militarily dependent on China. So from a “hard power” perspective, the US gave up autonomy while China retained (and in fact increased) its autonomy.

Another big tension between economic and sociopolitical views of free trade is that the sociopolitical view accounts for shifts in the internal balance of power within the US. The manufacturing industry is far more widely-distributed across US states than the finance or software industries. So its decline has led to increased concentration of power amongst coastal elites. Again, I’m not claiming that this should be a decisive argument against free trade; however, it’s the kind of consideration that doesn’t arise naturally from an econ-brained perspective. Whereas from a sociopolitical perspective, maintaining autonomous subagents is a crucial component of a nation’s continued health (which is a major reason to defend states’ rights).

The future of AGI

Econ-brained thinking has shaped the AGI safety community’s (and thereby the wider world’s) perspective on the future of AGI. Influential figures like Hanson, Christiano, and Shulman often apply economic abstractions to make forecasts. This contrasts with thinkers like Yudkowsky or Vassar who are more dismissive of the relevance of economics for thinking about AGI (though I wouldn’t summarize them as “sociopolitics-brained”, but rather merely “less econ-brained”).

In this section I’ll prioritize breadth over depth. I’ll give half a dozen examples of econ-brained ideas about how to orient to AGI, and mostly leave the task of generating sociopolitical critiques of them as exercises for the reader:

  • The idea of paying AIs to cooperate with us, as discussed here, here, and here.
  • The idea of owning galaxies, as discussed here.
  • The idea of speeding up AI capabilities progress now to prevent capabilities overhangs (as defended here, here, and here and critiqued here). In addition to Paul’s position, it’s illustrative to contrast two other people’s stances towards this idea:
    • Sam Altman used the idea of compute overhangs as a justification to accelerate progress towards AGI, until it became more useful to start pushing for more GPU production instead.
    • Meanwhile, an example of the polar opposite strategy was Wei Dai declining to invest in Anthropic for moral reasons, thereby losing out on what would by now have been over 400x returns. I respect Wei’s approach very much (despite not knowing whether he should have been more econ-brained in this case).
  • The idea that AGI labs are efficient at racing towards AGI, and therefore building new capabilities evals isn’t very helpful for them (as I critique here).
  • The idea of tracking progress towards AGI in terms of GDP growth or real interest rates.
  • The idea that AGI will come in the form of separate tools or services rather than unified agents, as defended by Hanson and Drexler.
    • Note the parallel between this perspective and the idea that businesses are mainly held together by transaction costs, which has led Krier to argue that the economy could be revolutionized by AI-enabled Coasean bargaining at scale.

Some of these ideas have been critiqued by Byrnes, Yudkowsky, and others. In his posts on the Spanish conquistadors as precedents for AGI takeover, Kokotajlo is clearly also looking at the issue through a sociopolitical lens. However, it’s worth noting that econ-brained thinkers have scored some big wins over the last decade—e.g. predicting the diffusion of AI across society, and the unprecedented amount of investment that would be funneled towards the AI industry. And zooming out even further, compute-based forecasts of AGI like Kurzweill’s and Legg’s have been surprisingly prescient. Such forecasts aren’t quite central examples of being econ-brained, but there’s definitely something econ-brained (and something anti-Yudkowskian) about believing so much in straight lines on graphs.

Why is this? The most straightforward possibility is simply that the concept of econ-brain is too lossy an abstraction to reliably evaluate thinkers with. Ideally we’d try to diagnose what led to each of these successes and failures in granular detail. But as a rough heuristic, is being more econ-brained actually a good way to improve your forecasts? Some possible responses:

  • Maybe the forecasting successes listed above required the right balance between econ-brained and other kinds of thinking. If you’re too econ-brained, you reject the concept of AGI altogether; if you’re not econ-brained enough, you’re surprised by how continuously progress has advanced over the last decade. Paul and Carl and Ray and Shane might be in the sweet spot re these particular topics. But this isn’t a very satisfying response, because these people are extremely econ-brained by almost everyone’s standards.
  • Maybe economic factors are more important in the short term (during which institutions and power structures are roughly stable), whereas sociopolitical dynamics play out over longer time horizons (and will especially kick in once AIs become capable of wielding political power). This makes econ-brained people more like foxes, and sociopolitics-brained people more like hedgehogs. The former tend to make more predictions that are accurate; however, the latter have a better chance of predicting the most important large-scale shifts.[5]
  • Maybe there’s a tradeoff between predictive accuracy and the ability to get things done. In general, outside-view bets like “nothing ever happens” tend to outperform your inside view on most topics. Similarly, believing in efficient markets is a good strategy for most investors. But it’s hard to change the world by believing in efficient markets. Relatedly, in the final section of this post I discuss how “leaps of faith” can be extremely valuable for sparking coordinated action.

Conclusion

These five case studies are far from exhaustive. There are plenty of examples that I omitted for brevity (e.g. surge pricing, YIMBYism, earning to give, etc). And there are other cases that I suspect are important examples of this phenomenon, but don’t yet understand well enough to discuss in detail. For example, cryptocurrency is a nominally-economic domain that seems more driven by sociopolitical dynamics than economic fundamentals. And Ben Hoffman’s writing on macroeconomics (in particular his post on the debtor’s revolt) provides a perspective from which 20th-century economic history was driven by sociopolitical conflicts.

In other cases, econ-brained thinking is harnessed to defend a position, but isn’t the main force behind that position. For example, the cultural wars that are currently raging over immigration definitely feature clashes between economic and sociopolitical considerations. However, I suspect that the pro-immigration side is not fundamentally motivated by immigration’s purported economic benefits, which are better understood as fig leaves on a deeper-rooted globalist ideology. Similarly, even though much of the explicit debate about Brexit pitted economic against cultural considerations, the sheer vitriol that elites leveled against Brexiteers suggests that they were primarily motivated by sociopolitical considerations of their own.

Ultimately, the greatest prize would be a precise technical theory that fills in what economics is missing. Scott Garrabrant’s distinction between arithmetic and geometric rationality seems like one important step towards this. As he points out, arithmetic rationality (which I suspect is closely related to economic thinking) is oriented towards maximizing efficiency. But if taken too far, it creates internally dysfunctional agents, and so it needs to be governed at the meta-level by geometric rationality (which I suspect is closely related to sociopolitical thinking). A big question is then how to draw boundaries between the two categories in a principled way.

That’s all beyond the scope of this post, though. For now, I merely hope that I’ve conveyed the core idea that there’s something interesting about autonomy and related sociopolitical concepts which is systematically neglected (and undermined) by econ-brained thinking.

  1. ^

    Corporations are another example of such a group—though a less central one, because they lack many of the traits that hold together most sociopolitical groups (such as membership/citizenship that’s difficult to take away from people).

  2. ^

    I use “econ-brain” rather than “neoliberalism” to avoid getting caught up in the political connotations, since the neoliberal world order does many things that econ-brained people disagree with. Also, econ-brain applies to some issues that neoliberalism doesn’t have much of a stance on, like prediction markets or AGI. Meanwhile, when I talk about libertarians as econ-brained, I’m primarily referring to the modern economic-focused libertarianism espoused by thinkers like Brian Caplan and Scott Alexander. Conversely, historical libertarian(ish) figures like Hayek and Rand thought much more about sociopolitical concepts such as serfdom vs freedom.

  3. ^

    Three such considerations:

    - Many of the organizations I mentioned above currently have charitable tax exemptions, and so wouldn’t be adversely affected by land value taxes. However, I think of this as only a band-aid solution to the core problem. If standards for charities are too loose, land value tax is no longer effective (because everyone would find some way to own property via a charity). If standards are too strict, then charitable status provides much less autonomy (because charities would still have to stay on the state’s good side to retain their status). Overall, the more a tax relies on getting the exceptions right, the less sound we should consider its principles to be.

    - Property taxes are similar to land value taxes in many ways, and are far more common. So I expect that many of the problems that a full-blown land value tax would cause already exist to a lesser extent in jurisdictions with high property taxes. It’d be useful to get empirical data on this. For now, I’m focusing on land value taxes as a cleaner case study of econ-brained thinking.

    - My thought experiment of a community avoiding income taxes by becoming more self-sufficient is in tension with the fact that, in the US, income taxes technically also apply to non-monetary transactions. However, I think that the impossibility of actually enforcing this itself helps demonstrate the limitations of economic thinking. Even in principle, how could you put prices on non-monetary exchanges that occur within a family, or a community, or between university students? If you imagine a government actually trying to do this (and punishing people who don’t pay) that would be the clearest example yet of how economic thinking undermines sociopolitical autonomy.

  4. ^

    A related practical issue which I haven’t seen a good Georgist response to: the case for land value taxes over property taxes relies on incentivizing construction. But if construction is severely restricted by permitting processes (as it is in most Western cities) then a land value tax would unfairly penalize landowners who didn’t already have buildings on their land, without actually leading to much additional housing. To be fair, I expect this is part of why YIMBYism is much more popular today than Georgism.

  5. ^

    Relatedly: when Paul Christiano and Eliezer Yudkowsky tried to operationalize their disagreement as a bet, Paul claimed that he’d be willing to bet on most things, whereas Eliezer was much more selective. But when they did settle on a single bet, Eliezer ended up winning (though note that the bet they chose was one where Eliezer was closer to the consensus side, suggesting that there might have been adverse selection).



Discuss

Letting Claude do Autonomous Research to Improve SAEs

2026-03-11 02:52:01

This work was done as part of MATS 7.1

I pointed Claude at our new synthetic Sparse Autoencoder benchmark, told it to improve Sparse Autoencoder (SAE) performance, and left it running overnight. By morning, it had boosted F1 score from 0.88 to 0.95. Within another day, with occasional input from me, it had matched the logistic regression probe ceiling of 0.97 -- a score I honestly hadn't thought was possible for an SAE on this benchmark.

The most surprising development was when Claude autonomously found a dictionary-learning paper from 2010, turned its algorithm into an SAE encoder, and Matryoshka-ified it, improving performance by a few percentage points in the process. I had never heard of this algorithm before (although I really should have).

In this post, I'll describe the setup, walk through the improvements Claude found, and discuss what this experiment taught me about the strengths and weaknesses of autonomous AI research.

We haven't yet verified how well these improvements transfer to LLM SAEs, so don't rush to implement every change mentioned here into your SAEs just yet! We'll discuss challenges and next-steps for LLM verification at the end of the post.

The TASK.md we gave Claude and resulting SAE code is available on Github.

The resulting SAE, called "LISTA-Matryoshka" here, outperforms all standard SAEs tested and matches the performance of logistic-regression probes on SynthSAEBench-16k at L0=25.

The setup

We recently released a synthetic SAE benchmark called SynthSAEBench. The benchmark contains a synthetic model with 16k ground-truth features (SynthSAEBench-16k). We intentionally designed this model to be difficult for SAEs, including known challenges like hierarchical features, feature correlations, and feature superposition. In the paper, we found that the best SAE architecture we tested, the Matryoshka SAE, only achieves an F1 score of 0.88, compared to an F1 score of 0.97 achieved by a logistic regression probe. The best SAE also only achieved an average cosine similarity (MCC) of 0.78 between its learned latent directions and the ground-truth feature directions. For more details on these metrics, see the paper.

Training an SAE on SynthSAEBench-16k takes about ~20 minutes on a single GPU, making it a nice test-bed for rapid iteration. I set up Claude Code on a server and ran it in a Ralph Wiggum loop, where each iteration Claude conducts a "research sprint": it generates an idea, implements it, runs the experiment, and writes up a report. I'd steer Claude lightly by adding or removing ideas in a TASK.md file, but it was largely autonomous. The full TASK.md file is available here.

SAE improvements

The following table summarizes the components Claude figured out to increase F1 score from 0.88 to 0.97 and MCC from 0.78 to 0.84. See the Appendix for full details on each.

Improvement Description Origin
Linearly decrease K Start with higher K and anneal to target K during training. Similar to Anthropic's JumpReLU training recommendation. Found in my repo, Claude tried it autonomously
Detach inner Matryoshka levels Detach gradients between Matryoshka levels except the outermost, so inner levels only get gradients from the full-width reconstruction. Found in my repo, hinted by me
LISTA encoder Use a single iteration of LISTA (a neural approximation to classical sparse coding) as the SAE encoder. Claude innovation
TERM loss From this paper: up-weight high-loss samples via a tilted exponential. Claude repurposed it as a general SAE training improvement with a small tilt coefficient (~2e-3). Very minor improvement. Claude innovation
Sort Matryoshka levels by frequency Dynamically sort latents by firing frequency before applying Matryoshka losses, improving stability and dead latent revival. My idea, Claude implemented

Claude also tried plenty of ideas that did not work, which I won't list here, but this is part of the research process!

Some of these ideas were components Claude found in my SAE experiments repo, some were ideas I suggested, but the ones that impressed me most -- LISTA and TERM loss -- were fully Claude's own initiative. In both cases, Claude found a relevant paper online, adapted the idea to SAEs, and tested it without any prompting from me.

Diving deeper: LISTA encoder

Claude's idea to remix LISTA into an SAE and Matryoshka-ify it really amazed me, as this is something I would not have thought of and was not even aware of LISTA before. However, this is probably an obvious thing to try if you were an expert on both modern SAEs and classical dictionary learning. Claude's implementation of the LISTA BatchTopK encode is shown in pytorch pseudo-code below:

def encode(sae_in, W_enc, W_dec, b_enc, b_dec, k, eta=0.3, n_iterations=1)
	# standard BatchTopK SAE
	sae_in_centered = sae_in - b_dec
	hidden_pre = W_enc @ sae_in_centered + b_enc
	latent_acts = batch_topk(hidden_pre, k)
	residual = sae_in_centered
	# iteratively refine initial encoding
	for _ in range(n_iterations):
		residual = residual - (latent_acts @ W_dec)
		correction = W_enc @ residual
		hidden_pre += eta * correction
		latent_acts = batch_topk(hidden_pre, k)
	return latent_acts

The idea is to iteratively refine the SAE prediction over a number of steps, ultimately converging to the "optimal" latent activations. The version in the LISTA paper is even more general than this, effectively with a learned W_enc, b_enc, and W_dec per-iteration, with eta also being learned, while Claude's version reuses W_enc across iterations and sets b_enc = 0 for each iteration after the initial SAE encode. The original LISTA work doesn't try to both learn the dictionary and the encoder at the same time, but rather tries to have the learned encoder approximate ISTA (where W_enc = W_dec.T), so it was surprising to me that this works if you just backprop through everything like Claude does.

In follow-up investigations with Claude, it seems like deviating from this formula results in worse results. E.g. more than 3 iterations, learning eta, learning multiple W_enc / b_enc, etc... all seems to lead the SAE to overfit and no longer track the ground-truth features well (but with higher variance explained).

I'm a bit uneasy about running backprop through the full encode, as it will put gradient pressure on latents that don't ultimately end up in the final latent_acts, and thus do not get reconstruction pressure. However, it also seems like trying to block gradients to latents that don't ultimately get selected doesn't work well for reasons I don't yet fully understand.

While I'm not confident this works in LLM SAEs yet (results so far have been mixed), this is very-much the type of thing I would expect to work well. An SAE can be viewed a single step of the LISTA algorithm, and in theory a single step should not perform particularly well. It doesn't seem crazy that doing 2 steps, or 1.5 steps, or whatever Claude came up with exactly could help things. Doing too many steps seems to make it easy for the SAE to find creative ways to overfit (abusing correlations or superposition noise, for example).

Validating on LLMs with SAEBench

I've been trying to validate that these ideas improve performance on LLM SAEs using SAEBench, but have so far not been able to prove anything decisively. The core problem is that SAEBench metrics are noisy: you need multiple seeds, multiple L0 values, and results often point in different directions (e.g. TPP increases but SCR decreases). Properly evaluating a single architecture change can easily cost $1000+ in compute, which is prohibitive for an independent researcher without strong prior confidence that the results will be clear.

So far, LISTA with eta=0.3 seems to break on LLMs, and with lower eta it's hard to distinguish signal from noise. Some changes -- like Matryoshka frequency sorting -- are almost certainly improvements, but proving this rigorously will require training a lot more LLM SAEs.

Regardless, whether or not these improvements ultimately translate to LLMs is not Claude's fault. Claude crushed the task I set out for it, which was to make SAE architectural improvements that increase F1 score and MCC on SynthSAEBench-16k.

Claude's research strengths and weaknesses

Overall I was very impressed with Claude Opus 4.6's ability to do autonomous research. It came up with sprint ideas, ran them itself, summarized results, and then built on what worked. I was most impressed with its ability to find random research papers online and test out ideas from them without much prompting from me (aside from telling it to spend time looking at related fields before starting the sprint).

I thought the LISTA idea was particularly brilliant and is not something I would have come up with, but is probably obvious to someone who's an expert in classical dictionary learning. I think a big strength of these models is that they are very knowledgeable on basically every field, so if some idea would be obvious to someone who's an expert in a field I'm not an expert in, the model is likely to try ideas I wouldn't think to.

That being said, a lot of the other ideas Claude tried were either hinted at by me or were floating around in my SAE research repo that Claude perused. I find that once I hinted to Claude to try an idea by adding it to the ideas list, Claude was very capable of understanding the idea, coding it up, and testing it out, but for many of these ideas I'm not sure if it would have come up with them itself without this hinting.

One thing I noticed is that Claude tends to be over-confident in its interpretation of the results of its sprints, without thinking through all the possible reasons why the sprint may have gone wrong or what alternative explanations might be for the results. For instance, one sprint involved Claude having an implementation bug that resulted in the sprint not actually testing anything and Claude then confidently declared the idea didn't work. Once I pointed out to check if the code was actually running, Claude realized its mistake and redid the sprint. I do worry that the conclusions Claude draws are not always the most rigorously tested, but it's a cheap way to test out a lot of ideas quickly.

I also found that Claude tends to get stuck building on the first things it finds that seem to work, rather than trying a broad set of very different ideas. It took a bit of nudging to get Claude to try completely different ideas since it sees its previous sprints and this seems to bias it to think about those past sprints. I suspect it should be possible to get around this by either not letting it see the past sprints, or doing a separate "idea generation" session outside of a single sprint, where you can collaboratively come up with sprint ideas to try.

I’ve also found that having Claude run these sprints solves a focus problem I struggle with in ML research, where I find it’s just so hard to stay in flow when you constantly need to run something and check back in 1 hour. I don’t like constant context switching, and tend to get distracted instead. Claude doesn’t get distracted, and will diligently run the next step 1 hour later and keep going until everything is completed and written up.

Overall this feels like having a really fast and extremely smart masters student who can iterate quickly but could use a little bit of guidance. I also think this setup benefits from having clear numbers to optimize and a relatively quick iteration cycle. I don't think this would have as much success if Claude had to train LLM SAEs and run SAEBench, for example.

Next steps

I now have a setup where I can propose an idea to Claude and then have it go off and investigate it, do a sprint, write up a report, and ping me when it's done. I'd love to have this integrated into Slack too, so I can just chat with it in a thread and have it run sprints and put the results and PDF reports into the Slack channel.

So far I've only had Claude trying to maximize scores on the single SynthSAEBench-16k model, and it has done an amazing job at that, but I suspect part of the success is that it's hill-climbed a bit too much on that specific model. I'll next try creating a suite of synthetic models with varying properties to make sure the ideas Claude comes up with are not over-fit to this specific synthetic model.

Finally, we need to get better at evaluating on LLM SAEs / SAEBench. This could look like trying to really expand the quantity and quality of datasets in each metric (maybe I can ask Claude to do this), or might just involve getting more compute funding to test these ideas out properly with multiple seeds per SAE. I'd be curious to hear any ideas on this from others in the community too!

Give it a try!

I found having Claude autonomously try out SAE architecture ideas on SynthSAEBench to be surprisingly easy. You can check out the code for the SAE Claude came up with and a version of the TASK.md prompt at https://github.com/chanind/claude-auto-research-synthsaebench. Try it out!


Appendix: Improvement details

Linearly decrease K during training

Claude found that starting with a higher K and linearly decreasing down to the target K during training seems to help the resulting SAE quality. This is implicitly similar to how Anthropic recommends training JumpReLU SAEs, so it's not shocking this would help BatchTopK SAEs too.

This setting was an option in my SAE repo, but Claude saw it, tried setting it, and found good results.

Detach inner Matryoshka levels, but not the final level

Matryoshka SAEs take prefixes of fixed size (called levels here), and sum these all together during training. This makes it like training SAEs of different widths that happen to share latents. Claude figured out that it improves performance to detach the gradients between each matryoshka level except for the outer-most level. So if a Matryoshka SAE is trained with levels [128, 512, 2048, 4096], where 4096 is the full width of the SAE, the 128 level receives no gradient from levels 512 and 2048, but does receive a gradient from the full 4096 reconstruction.

This setting was an option in my SAE repo, and something I mentioned as an idea in the task.

LISTA encoder

Claude found a dictionary-learning paper from 2010 called "Learning Fast Approximations of Sparse Coding" that uses a neural-network to approximate a classical dictionary learning technique called Iterative Shrinkage and Thresholding Algorithm (ISTA). Claude whipped up an SAE version of this, using LISTA for the encoder, and also remixed a Matryoshka version.

Claude found that using a single iteration yields best results, using a weighting of 0.3 for the adjustment after each iteration. I was really amazed by Claude here, as I would never have come up with a LISTA SAE, especially one where you intentionally train only 1 iteration rather than letting it converge. Claude's implementation also just backprops through the iterations during training, which I would not have thought would work, but it seems to!

I had not heard of LISTA before (although I really should have in retrospect), and struggle with traditional dictionary learning papers in general.

TERM loss

Claude found the paper Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models, which has a loss called TERM that up-weights training samples that have large loss to encourage SAE training to focus more on these samples. The formula for TERM loss is the following, where is the normal SAE loss for a sample  and  is the number of samples in a batch, and  is a tilt coefficient that determines how skewed the loss is towards high-loss samples.

Interestingly, the paper doesn't even suggest this as a way to improve SAE performance, but Claude just did this anyway and found that using TERM with a small coefficient (~2e-3) seems to help SAE quality. This is a pretty minor improvement, but still a really interesting idea. It's possible that more tweaks to standard SAE loss like this could help improve performance as well.

Dynamic Matryoshka levels by firing frequency

Normally, Matryoshka SAEs enforce that the earlier latent indices must learn higher-frequency concepts. However, we already track latent firing frequencies during SAE training, so we can dynamically sort the latents by firing frequency before applying the Matryoshka losses. A more rigorous version of this would probably be to sort by expected MSE (expected firing magnitude squared). This helps training stability since if a later latent happens to learn a higher frequency concept, it does not need to unlearn it during training. This also helps with dead latent revival, since dead latents are always implicitly revived into the outer-most matryoshka level.



Discuss

Don't Let LLMs Write For You

2026-03-11 02:49:47

Content note: nothing in this piece is a prank or jumpscare where I smirkingly reveal you've been reading AI prose all along.

It’s easy to forget this in roarin’ 2026, but homo sapiens are the original vibers. Long before we adapt our behaviors or formal heuristics, human beings can sniff out something sus. And to most human beings, AI prose is something sus.

If you use AI to write something, people will know. Not everyone, but the people paying attention, who aren’t newcomers or distracted or intoxicated. And most of those people will judge you.

The Reasons

People may just be squicked out by AI, or lossily compress AI with crypto and assume you’re a “tech bro,” or think only uncreative idiots use AI at all. These are bad objections, and I don’t endorse them. But when I catch a whiff of LLM smell, I stop reading. I stop reading much faster than if I saw typos, or broken English, or disliked ideology. There are two reasons.

First, human writing is evidence of human thinking. If you try writing something you don’t understand well, it becomes immediately apparent; you end up writing a mess, and it stays a mess until you sort out the underlying idea. So when I read clear prose, I assume that I’m reading a refined thought. LLM prose violently breaks this correlation. If some guy tells Claude to “help put this idea he has into words” then Claude will write clear prose even if the idea is vague and stupid. If the guy asks to “help find citations” and there are no actual good ones, Claude will find random D-tier writeups and link to them authoritatively. Worst of all, if the guy asks Claude to “poke holes in my argument” when the argument is sufficiently muddy, Claude will just kind of make up random “issues” that the guy will hedge against (or, let’s be real, have Claude hedge against). So you end up with a writeup which cites sources, has plenty of caveats, and… has no actual core of considered thought. If you read enough of these, then you start alt-tabbing away real fast when you see structured lists with bold headers, or weird clipped parenthetical asides, or splashy contrastive disclaimers every 2-3 sentences, or any number of other ineffable signs subtler than an em dash.

Is it possible that a 50% AI-generated hunk of text contains a pearl of careful thinking, that the poor human author simply didn’t have the time or technical skill to express? I suppose. But it ain’t worth checking.

Second, and closely related, AI prose is a slog. There’s way too much framing, there are too many lists and each list has a few items that serve no purpose, the bold and italics feel desperate, and it’s just all so same-y. In your own conversation with an AI that you can fully steer, you can sometimes break out of this feeling for a little bit. But reading the output of someone else’s AI conversation is rarely any fun.

In short, if someone reads writing “by you” and it seems LLM-y, they will think both that:

  1. You probably don’t have an actual good idea under the cruft
  2. Even if you do, the cruft is going to suck to get through

If you want them there, they are not going to stick around. In fact, the more you want a reader, the more likely they are to be turned off by this stuff. Even if they’re the biggest AI fan in the world.

Luddite! Moralizer!

Fine. I admit it. Just this week, I too experienced Temptation.

You may know me as an editor. In this capacity, I was revising an academic paper’s abstract in response to reviewer comments. But I had several papers to work on in the same project, and the owner of that project actively encouraged me to use AI to move fast enough to meet deadlines.[1]

So I gave Claude the paper and the reviewer comments, and asked it to come up with a new abstract that would satisfy the reviewers. The result looked good.

“It’s just an abstract,” I whispered to myself, face lit eerily in my laptop screen’s blue light. “Summary. Synthesis.” I rocked back and forth. “I could… just…”

But no. Claude’s abstract was a useful reminder of which paper this was, and Claude helpfully catalogued what the reviewer requests were. Still, I rewrote the abstract myself, from scratch. In so doing, I noticed a lot of things I hadn’t seen, when I was just skimming the AI output. Stuff it included that it didn’t really need to. Stuff it emphasized that wasn’t actually that important.

Did I run my abstract by Claude in turn? Yes! It had two nitpicks, one of which I agreed with, and fixed in my own words. Use these tools. You should totally ask Claude to find you sources for a claim, but then you should check those sources like you would check the sources of an eager day one intern, and expect to throw most (or all) of them away. You should totally ask Claude to fact check, but expect it to miss some factual errors and unhelpfully nitpick others. You can even ask Claude to “help clarify your thinking.” But if you’re really just clarifying it, then you won’t use its text. Because once your thinking’s clear, you can write the text yourself, and you should.

  1. ^

    To be clear, editing I do as part of the LessWrong Feedback Service uses my own human judgment, and I don't use LLMs to make edits.



Discuss