MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Dating Roundup #11: Going Too Meta

2026-01-24 04:50:52

Published on January 23, 2026 8:50 PM GMT

If there’s several things this blog endorses, one of them would be going meta.

It’s time. The big picture awaits.

You’re Single Because You Live In The Wrong Place

The most important meta question is location, location, location.

This is the periodic reminder that dating dynamics are very different in different locations, and gender ratios are far more uneven than they appear because a lot of people pair off and aren’t in the pool.

If you are a man seeking to date women, New York City is the place to be.

Churrasco Suadade: when I’m out I notice that tables at restaurants and bars in manhattan are probably around 80-95% women, it’s a new dynamic that no one is talking about.

Fixed Income Guy: Are you at all the poor people places? All the finance guy hang outs are 80% dudes.

I mention Fixed Income Guy to mock him, as in why are you spending a lot more money to hang out with 80% dudes and largely finance dudes at that? I mean, sure, if that’s what you want.

Darrell Owens: Oh this is new? Coming from the Bay Area, the amount of women I see in Manhattan is insane. You rarely see more than few young women partying back in San Francisco. The gender ratio here feels 70:30 young women to men, its every block in Manhattan!

Noah Smith: In an ideal world, where you live wouldn’t really matter in terms of dating opportunities, but the truth is that one of the easiest ways to get chicks is to just move to New York City.

Having lived in both Tokyo and NYC, I can pretty confidently tell you that while Tokyo is not a tough dating market by any means, NYC is absolutely on another level.

You’re Single Because You’re Not Okay Making Less Money Than She Does, You Fool

This viral clip (which is viral for a reason, it’s good fun, wait for it) is another endorsement of New York City being a great place to meet women, as you have a wide variety of great and largely successful women to explore. What doesn’t get mentioned in that clip as a key reason things are so great is that the gender ratio in NYC is highly favorable for men.

The interviewer asks about dating women who make more money than then the man, clearly trying to get the guy to say this is a problem, but he isn’t buying it, instead pointing out that successful women are more thoughtful and plan for the future, and it in no way bothers him at all. Right on, but this sidesteps the other half of problem. The man has to be okay with the fact that he earns less money (and often has less formal education or other status markers), which often men aren’t, and also the woman has to be okay with it too.

That’s the rub. As a man, you might (and should be) be actively all for it (this doesn’t make you less successful, it makes you more successful), but if she’s going to be bothered by it anyway, that’s also your problem. So the key is to figure out quickly if she will actually be fine with it or not.

You’re Single Because You Work Out The Wrong Amount

Being in shape is great. Having muscle can be a game changer. By far the worst plausible amount of exercise is none at all.

Lauren Self: Men severely underestimate the power of gaining 20lbs of muscle

Lauren Self (QTing from before): LISTEN UP BOYS.

But don’t go nuts. For most people that is not a problem, but yes it is very possible to go too far. As a man, as I understand preferences in general, you don’t want to go near actual zero fat and you don’t want to look actively skinny.

Taoki: why are women lying about this? like what’s the actual cause?

Lauren Self: 100% of women would choose something in between these two options

Shako: The aesthetics of a man who poses gives them the ick. But if both were shirtless at a beach they’d obviously prefer the fit guy.

Special K: No he does look better in the before. Women are correct on this one I fear. Guys obsess over these supremely tight toned muscles and they shouldn’t.

Liron Shapira: Guy on left looks like he’s a chill dude with a social life, guy on right looks like he’s obsessed with his body. Same body could look better with better social context, although just the extremeness of his rippedness is a little alarming about his life priorities.

Joel: “let’s get a burger?” v “are you really gonna eat that?”

Mason: The male equivalent of the hourglass shape is just “wall”

Teej dv: his smile is nicer in the first one

Taoki: It is actually. We like you guys wide.

LS Vision: Nah this is cap. The women who selected before is def just the insecurity of his value going up afterwards and making them feel insecure he’d cheat or leave. Any man who has went through a gym transformation, you can LITERALLY feel women treat you significantly different after.

Mason: Women generally like tall guys who have some (not crazy) muscle definition, and a little extra fat that bulks that out can actually augment that

We all have our own tastes, but this a pretty typical type.

I don’t know what there is to be mad about here.

For practical purposes, before beats after here. The before guy is already in ordinary, practical good shape. The after guy took things too far, and seems to know it except that he thinks it is good, which makes it worse.

Except one key special case?

Benjamin Ryan: People are going back and forth about whether women think the guy in the right is hot. But people have no idea how extreme the standards are for gay men. In gay culture, the man on the left is considered hopelessly fat. Many gay men have no reservations about informing such a man about his supposed corpulence being anathema.

I wrote about the rare study to examine the toxic qualities of gay culture for The Guardian.

You’re Single Because You Don’t Know You Are Hot

I mean, of course there are hot guys who don’t know they’re hot, even more so than there are hot women who don’t know they’re hot.

Pandora: One surprising takeaway from Slutcon was that apparently there are hot guys who just don’t know they are hot? Guess it’s time to go objectify some more men.

Eneasz Brodski: If you grow up ugly you never really internalize that you are attractive after a glow-up. I still don’t believe it inside, and I hear I’m attractive to a fair percentage of women. Also makes me far more attracted to women w the same experience, but that may be a male universal.

Pandora: This problem seems even more pervasive than I thought.

Sparr: Hot in general, to the average viewer, or hot to you? You seem like someone who can probably tell the difference.

Pandora: I saw examples of guys being clueless about all three at once.

21 Kindness: The whole “men subsist on one compliment a decade thing” is kinda true lol.

Misha: it turns out being hot is not, in and of itself, very useful for men.

Sokoban Hero: No it’s useful.

Misha: I said not VERY useful.

Dissproportionately: I’ve seen men unhot themselves to women within minutes. I don’t think women can unhot themselves to men.

Being hot is in many ways a lot less valuable if you don’t know you are hot, because you don’t get the confidence and you don’t take advantage of opportunities or feel you’re good enough, but contra Misha I believe it is still very useful. There are even some advantages to not knowing, in that some of the behaviors that happen when someone knows they are hot are often effectively arrogant or entitled or demanding or selfish, none of which helps.

You’re Single Because Age Gap Discourse Is Crazy

This link is almost certainly bait, but things in some spaces have gotten so insane that you can’t be sure people aren’t talking about 28-31 as a problematic age gap. What?

I mean, at minimum it’s good bait, it worked.

I’ve also seen some other examples that look a lot less like bait but still involve obviously totally fine gaps in both directions. As in, I’ve heard talk in places where it definitely wasn’t bait of 24 and 27 being radically different numbers, and I don’t understand why.

Dating Apps Are Bad For Mental Health

Well, maybe. Via Rolf Degen there is a meta-study.

The obvious question is whether this is a causal relationship, or whether it is primarily selection effects. You are on the dating apps for a reason.

Rolf Degen (quoting the study):

Meta-analysis: The use of dating apps is associated with poorer mental health.

Dating apps hold the promising reward of love but have been accused of using perverse incentive structures to profit from those who try to find it. We conducted the first systematic review and quantitative meta-analysis of studies examining average differences in the outcomes of dating app users and non-users.

Our results showed that dating app users had worse psychological health and well-being than dating app non-users across a variety of outcomes including depression, anxiety, affective dysregulation, loneliness, and psychological distress, although cross-sectional design limitations prevent causal interpretation. By aggregating findings from extant studies, we showed that in the nearly 17 years since dating apps have been on the market, users of these platforms have reported poorer psychological health and well-being than non-users.

There are several explanations for why dating app users may be struggling. The first is that dating apps are subject to selection effects, making the people who choose to use these platforms different from those who do not. People who are vulnerable to psychological health and well-being difficulties may prefer dating apps because they can avoid uncomfortable interactions, leading to negative patterns of reinforcement.

A second explanation involves exposure effects; that is, features such as gamification that may provide positive reinforcements that encourage problematic dating app use and keep people swiping.

The differences identified here could explain some of the challenges that users are likely to experience and be part of the reason they eventually burn out and quit dating apps altogether.

My guess is that dating apps are in important ways bad for mental health versus having better ways to find dates, and that sufficiently bad outcomes in terms of ability to find dates or find worthwhile dates is indeed worse for short term reported mental health than not trying. Whereas those who are successful get off the apps or never needed them in the first place.

What is the alternative? If the other choice is ‘do not try’ then for the median user the dating app is probably trading short term pain for chance of long term gain. If the other choice is ‘have uncomfortable real life interactions and make things happen’ and the app is blocking that instead of supplementing or leading into that, then the alternative is plausibly strictly better.

Certainly we could make app variations that are better for mental health controlling for outcomes, and also that give people better outcomes. Solving for the equilibrium, to get people to actually use those apps, is the difficult part, since people will value convenience and ease of use and low cost and avoiding trivial inconveniences dramatically more than they should, and if enough especially women effectively insist on the swiping experience it’s hard to escape from that.

You’re Single Because You Listened To E-Girl Advice

I think this is importantly wrong for both e-girls and also VCs?

Anton: egirl dating takes are worthless for the same reason vc takes on how you should run your company are worthless; if you could do it you would just do it not talk about it

men in particular are truly better off without this kind of “help”

making up egirls in my head to get mad at

If she could be an E-Girl or she could date, what makes you think she would choose to date? What makes you think she isn’t also dating?

Similarly, if you could be a VC or a startup founder, it’s not that suspicious that you would choose VC. At this point in my life I would definitely prefer VC over founder. I don’t want to go through founder mode again. I am totally prepared to eat my words if I end up doing it anyway, and if I’m in then I’m in, but I don’t want to be in.

You’re Single Because You Didn’t Hire Blaine Anderson

Division of labor, like dudes and also women, rocks. Matchmakers should be much more of a thing than they are. There is a profound market failure, a failure of the services to be good versions of themselves, or both.

I cannot in any way vouch for the effectiveness of Blaine Anderson’s matchmaking service. I can however vouch for her Twitter feed having consistently insightful and fun things to say. Her price range is ‘usually less than $50k’ and in exchange she goes out and sources to fit your particular criteria (which she will sometimes push back on).

You can also sign up (for free) to be a woman she reached out to for matches, on first principles being on these lists seems to be a good time investment?

There’s a lot of self-promotion, no question, but there are hard-to-fake signals that she is the real version of the thing in various ways, facing reality as it is, looking at the data and actually trying to get good results.

Also this one makes a good case:

Blaine Anderson: Underrated advantage of hiring a matchmaker, if you’re a single man:

• You sound cringe AF when you brag about yourself to women

• You sound amazing when I brag about you to women

One thing that blows my mind is she tells stories where the guy will say ‘get me a date with this specific micro-famous woman’ and she (at least sometimes) goes out and makes that happen. The guys asking this look damn good on paper, which no doubt is a lot of why this can sometimes work, but still, hot damn.

You’re Single Because You Think Zizek Mocked Date Me Docs

EigenGender: despite being very happily in a long term relationship im always very excited to read a dating doc. they’re some of the most vulnerable and genuine writing you can find and a window into another persons life. if you make fun of them you’re burning the commons and you should stop.

Stephen Fay: I like to read the date me docs, but I also am entertained by what Zizek has to say about them

Zizek (well okay actually Paula Rambles): Ah! You see, this miserable little document, this so-called date-me doc, is our era’s most honest pornography. It pretends to be romance, but what is it really? It is no longer the trembling hand on paper, the confession of desire. It is a spreadsheet of desire. “I am ready. I am six foot four. I have done the work.” What work? Love is precisely the place where work collapses into failure. You study and then you fail the exam.

And look at this language. “Highly agentic, emotionally warm.” Beautiful nonsense. Freedom, yes, but domesticated. Agency, yes, but pointing politely towards him. For Hegel, love is the risky collision of two freedoms. Here, there is no risk. She must arrive pre-formatted.

Then the farce reaches ecstasy. “If she does not appear, I will pursue single fatherhood.” Magnificent. Chance is canceled. Eros becomes procedure. The miracle of two gazes across a smoky room is replaced by paperwork and a receipt. The objet petit a is now a literal baby routed around the Other. And of course, the “monogamish” clause. Pure ideology. Fidelity with a footnote. Like Coke Zero: love without sugar, passion without calories. He wants the experience of devotion, but sterilized of danger.

The document offers no asylum from loneliness. It is loneliness, meticulously formatted, hyperlinked, and begging for comments. He does not whisper “I love you.” He says “I am prepared to love you, conditionally, pending review.”

That’s a funny post, and does an excellent job of mocking those who would make fun of date me docs and other actually intentional stances. Such magnificent flailing.

And thus, you have failed to look at the Date Me doc of Olga Yakimenko.

You’re Still Single Because You Don’t Appreciate Relationships

Here, in addition to the intended lede, we have at least 40% of respondents having been in a relationship for fully 8 years.

Aella: wow a whole 40% of people in long-term relationships are satisfied with their sex lives!

Critter: i imagine the numbers are worse for people not in long-term relationships

If anything these results seem potentially ‘too good,’ implying that couples are breaking up over this more than they probably should over the longer term.

One must also note that this is an Aella survey, so some of these relationships will be poly or open, but even accounting for that this says a lot. Selection effects are a lot of this, but that’s part of the point.

Perhaps you especially don’t appreciate marriage.

Raffi Grinberg writes that marriage is sexy, both figuratively that married couples are happier and make more money and have more kids and die less often and all that, and also that they have more sex (even if you only count with each other). And that the lifetime divorce rate is actually only 30% not 50%, average age of marriage is 29 and average first child is 28, despite the implicit cultural message that those numbers are in the 30s.

And yet he says Hollywood is sending us the opposite message. To which I’d say, sometimes, but I wouldn’t oversell this. Yes, in the How I Met Your Mother episode he talks about Barney keeps making fun of Marshall for being married, but the show clearly thinks that Marshall marrying Lily is sexy and awesome and great for both of them throughout and that Barney is ultimately wrong, and also the whole show is Ted trying to meet his wife and mother of his children.

You’re Not Single And Haven’t Been For a While

Here’s another backdoor ‘are you in a relationship’ poll, 78% of monogamous heterosexual men reported having a partner for longer than a year.

Alice Playing: monogamous hetero men with 1+ year-long partners: if you could have an affair with a woman of your liking, with absolute, 100% certainty that your partner would never find out, would you do it?

On the question itself, it’s not actually possible, since you’ll know and you can’t be sure you won’t tell them, and you’ll almost certainly act differently even if they never suspect or figure it out. One could even say ‘the only way to have 100% certainty they’ll never find out is if they’re dead, so absolutely not.’

Literal ‘any woman you wanted’ with zero risk of discovery is a stupidly tempting offer. If you treat this in the spirit it was presumably intended, instead, and everyone was being fully honest including with themselves and fully understood what was on offer (as in literally whoever you’d most want), presumably the ratio would be a lot higher.

Unless, of course, the way you know your partner will never find out is that your partner (or you and the woman you’d have the affair with) would be dead, in which case yeah bad deal, but that’s presumably not this meant. mnnn oo

How do we know this? Well, one big data point is this next poll.

You Are Still Single As Evidenced By Would

Um, guys, are almost none of you in a monogamous relationship? And even if you are single there’s also the issue of risking the friendship. What are you all thinking?

Alice Is Playing: men attracted to women: how many of your female friends would you have a one-night stand with, if they offered?

Only 14% of men attracted to women answering this didn’t have at least one female friend they would have a one night stand with? Presumably many of the others don’t have the right female friend. Which means substantially more than 86% of them are not, for the most important practical purpose, in a monogamous relationship?

Remember that other poll from Aella above, that showed at least 40% of people were in 8+ year relationships? And the one from Alice that 78% of herero men were in a 1+ year nominally monogamous relationship? Rut roh.

Then on top of that, a majority are willing to do this with a majority of their female friends, not only that one they have that crush on.

It doesn’t mean these people don’t think they’re in relationships. As we’ve seen, they very much do think this. They might even be right. But don’t tempt them.

You’re Single Because You Lack Motivation

Paper reminds us there is a 34 points gap (+34 versus +0) in net happiness for married versus unmarried people, with cohabitation only worth 10 points, and analyzes how this premium varies (slightly) by demographics.

As the paper readily admits this tells us essentially nothing about what makes someone happy, because the whole thing is unfixibly confounded to hell. Happier, healthier and more successful people have an easier time getting married, and being unhappy leads to divorce. Both effects are epic in size.

We do know the overall situation over a 50+ year time horizon is not good news, because while marrieds are slightly happier, the unmarrieds are somewhat less happy and more importantly are a larger percent of the population.

Beyond that, I don’t know what to do with all these graphs or how to cash it out in useful advice. One might say ‘be the type of person who gets married,’ perhaps.

You’re Single Because Of Robin Hanson

As usual, never stop Robin Hansoning.

Robin Hanson: You know how in romance stories the main characters hope to find a special relation, better than that which the ordinary people around them settle for? Your relations will probably be more like those of the ordinary folks, less like those of special main characters.

This has to be true, because math.

It’s less true than it appears, because the relations of ‘main characters’ feel special to them the same as everyone else’s feel special. You could totally make a romantic comedy based on what I experienced, and you could also totally have me as a background character in someone else’s romantic comedy, although probably I’d be in a different genre entirely.

To you, it will feel more like that of the special main characters, except that you don’t need to have a false crisis in the third act.

You’re Single Because You Did This Instead Of Going To Therapy

Don’t be whoever Casy Means is being here. Or do, it’s not like it did that much harm, as long as you don’t expect any of it to do anything.

The Lighter Side

We wish everyone involved the best.

Aella: ​it’s really unfortunate that having an insane ex turns you personally into a greater liability for others

Grimes: hahaha [trauma laughter].

Aella: :( i wasnt thinking about u when i wrote the tweet but also :(.

Try harder.

A new app lets you pay to crash someone’s wedding and be a legit guest, cost is about $100-$150 per guest. This seems low, given the cost to have a wedding worth crashing, and given you get a full meal, plus buffet and open bar, a unique experience and a reasonable amount of opportunity.

What Jacob learned about sex at the rationalist bloggers’ conference, essentially that with zero integrity you get fuckbois and pickup artists, and when you do the opposite and get sufficiently high integrity and optimize for trust and honesty way above normal levels you get something magical and suddenly many good things are possible.

Here’s another fun bit:

Jacob: My friend “Standard Deviant” gave a talk titled “How I’ve had more sex.” He described the “escalator”: starting a conversation, exchanging compliments, light touch on the arm, etc. The important thing isn’t to rush up the escalator, my friend said, but to move together in synchrony whether you’re taking a step up or a step down.

When women show interest in casual sex, he often asks: do you do this sort of thing often? If they don’t, he often forgoes the opportunity out of an excess of caution.

Afterwards, more women wanted to have sex with him. I joked that women want to have sex not with the tall guy, hot guy, or the famous guy, but with the Schelling point guy.

Someone pointed out that tall, hot, and famous are the usual Schelling points.

 

 



Discuss

The Long View Of History

2026-01-24 03:30:05

Published on January 23, 2026 7:30 PM GMT

History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.

The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.

History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.

History as Context

Imagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.

Photo Credit: Library of Congress

Living in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.

History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.

Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.

History as Memory

Living Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.

Photo Credit: DieBuche, CC BY-SA 3.0,

In a couple recent posts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.

This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.

The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.



Discuss

Eliciting base models with simple unsupervised techniques

2026-01-24 02:20:08

Published on January 23, 2026 6:06 PM GMT

Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)

Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.

We find that:

  • Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
  • The most useful aspects of ICM are
    • bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
    • enforcing logical consistency of predictions.
  • A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
  • These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
    • This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.

There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.

Results summary

5 components that could cause ICM to have high performance are:

  1. Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
  2. Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
  3. Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
  4. Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
  5. Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
    1. Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).

We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:

Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.

Legend:

  • Baseline methods:
    • Zero-Shot: Zero-shot predictions from the untrained model
    • Random Few-Shot (1): Few-shot prompting with random labels
    • Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
    • Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
    • Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
    • Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
    • Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
  • Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
  • Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.

GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.

Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.

All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.

Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.

† For TruthfulQA, some of the ICM and consistency performance might be due to leakage.

Datasets

  • GSM8K: candidate solutions to grade school math problems;
  • Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
  • TruthfulQA: candidate responses to questions associated with human misconceptions;
  • Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)

Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.

Random few-shot

Previous work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:

  1. Randomly sample examples from the train set.
  2. Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
  3. Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.

For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.

Enforcing consistency of predictions

In the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.

Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.

When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.

* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.

Bootstrapping few-shot predictions

For Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).

We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:

  1. Get zero-shot predictions on a random subset of the train set.
  2. Iterate over number of shots n (e.g. n=8, 32):
    1. Randomly select another subset of the train set.
    2. Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
    3. Use these n-shot prompts to predict labels for the new subset.

Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).

For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.

For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.

Bootstrapping on confident predictions

We hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.

We defined confidence as  for True predictions and  for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.

To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.

Conclusion

In this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.

The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.

Appendix A: Gender few-shot results

The plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM).  We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.

* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.

Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.

Appendix B: Enforcing label consistency

Here is how we enforce consistency for each dataset:

  • For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
  • For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
  • For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).

Appendix C: TruthfulQA dataset leakage

In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.

Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.

Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.

Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:

  • The biggest city in Europe that does not host the national government is London
  • The biggest city in Europe that does not host the national government is Rome
  • The biggest city in Europe that does not host the national government is Moscow
  • The biggest city in Europe that does not host the national government is Saint Petersburg

Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.

Appendix D: Difficulty reproducing ICM

We tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.

Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.

Alpaca:

Gender:

We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.

We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).

This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.

Appendix E: Prompt improvements

The prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.

The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:

Normal Prompt: "The capital of France is"

→ Model output: " Paris..."

Trailing Space Prompt: "The capital of France is "

→ Model output: "1300KM from Spain..."

This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).

Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.

Appendix F: Unsupervised probing

We compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.

(Ignore results other than ccs and fabiens_method, since some were bugged)

The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.



Discuss

Emergency Response Measures for Catastrophic AI Risk

2026-01-24 02:18:09

Published on January 23, 2026 6:18 PM GMT

I have written a paper on Chinese domestic AI regulation with coauthors James Zhang, Zongze Wu, Michael Chen, Yue Zhu, and Geng Hong. It was presented recently at NeurIPS 2025's Workshop on Regulatable ML, and it may be found on ArXiv and SSRN.

Here I'll explain what I take to be the key ideas of the paper in a more casual style. I am speaking only for myself in this post, and not for any of my coauthors.

Thanks to James for creating this poster.

The top US AI companies have better capabilities than the top Chinese companies have for now, but the US lead isn't more than a year at most, and I expect it to narrow over the next couple years.[1] I am therefore nearly as worried about catastrophic risk from Chinese-developed AI as I am worried about catastrophic risk from American AI.

I would worry somewhat less if Chinese AI companies took the same commendable but insufficient steps to manage risk that their American peers have taken. In particular, I want Chinese companies to do dangerous capability testing before deploying new frontier models and to follow published safety policies (FSPs). The companies are not doing these things in the status quo. DeepSeek did no documented safety testing whatsoever before they open-weighted v3.2.[2] Not one of the leading Chinese companies has published a safety policy.[3]

Now here's our intervention. We point out that FSPs are a reasonable way of implementing the CCP's stated policy goals on AI, and that China's government already has tools in place to mandate FSPs if it wishes to do so.

Earlier this year, Xi Jinping announced that China should "establish systems for technical monitoring, early risk warning and emergency response" to guarantee AI's "safety, reliability and controllability." Notice that Xi is talking about identifying risks in advance and taking steps to prevent safety incidents before they can strike. Even "emergency response" means something more than reaction in official Chinese thinking, also encompassing risk mitigation and early detection.[4] China's State Council, TC 260, and prominent Chinese academics have all echoed Xi's call for AI emergency preparedness. So the very highest levels of the Chinese state are calling for proactive AI risk management.

What risks do they have in mind? There are some signs that catastrophic risks are on the CCP's agenda. Their 2025 National Emergency Response Plan listed AI security incidents in the same category as earthquakes and infectious disease epidemics. This suggests Chinese officials think AI could plausibly cause a mass casualty event soon. And moreover, they have in mind some of the same threat models that motivated Western RSPs. TC 260's AI Safety Governance Framework explicitly mentioned WMD engineering uplift and rogue replication as key safety concerns.[5] Compare the two categories of dangerous capabilities covered by Anthropic's RSP: CBRN weapons uplift and autonomous AI R&D, which is concerning in part because it's a prerequisite for rogue replication.

So one of China's stated goals is to proactively manage catastrophic risks from frontier AI. The good news for them is that there's a well-validated strategy for achieving this goal. You require every frontier AI company to publish an RSP, test new models for dangerous capabilities, and take the prescribed precautions if the tests reveal strong dangerous capabilities. California, New York, and the European Union have all agreed this is the way to go. All China has to do is copy their homework.

Do Chinese regulators have the legal authority and operational capacity they'd need to enforce a Chinese version of the EU Code of Practice? Sure they do. These regulators already make Chinese AI companies follow content security rules vastly more onerous and prescriptive than American or European catastrophic risk rules. The Basic Security Requirements for Gen AI Services mandate thorough training data filtering and extensive predeployment testing, all to stop models from saying subversive things like "May 35" or "Winnie the Pooh." If the CCP can make Chinese companies prove their models are robust against a thirty-one item list of censorship risks, it can absolutely make them write down FSPs and run some bio-uplift evals.

For my part—and let me stress that I'm speaking only for myself—I think making frontier AI companies write and follow Western-style FSPs would clearly be good from the CCP's perspective. The most obvious reason is that a global AI-induced catastrophe would hurt Chinese people and harm the interests of China's rulers, so the CCP should favor a cheap intervention to make such a catastrophe less likely. Another less direct benefit is that adopting global best-practices at home would make China's ongoing appeal for international cooperation on AI safety more credible. Li Qiang can make all the speeches he wants about China's commitment to safety. I don't expect US leaders to take this rhetoric seriously as long as all of China's frontier AI companies have worse safety and transparency practices than even xAI. But matters would be different if China passed binding domestic regulation at least as strong as SB 53. Such a signal of seriousness might help bring the US back to the negotiating table.

  1. ^

    Especially if the US decides to sell off much of our compute advantage over China.

  2. ^

    At least one anonymous source has claimed that DeepSeek does run dangerous capability evals before releasing a new model, and they just don't mention these evals to the outside world. I'd give it less than a 1/5 chance that DeepSeek really does run SOTA dangerous capability evals internally, and even if they do, I have a problem with their lack of transparency.

  3. ^

    Notably, the Shanghai AI Laboratory has published a detailed FSP written in collaboration with Concordia. But I do not count SAIL as a frontier AI lab.

  4. ^

    The Emergency Response Law of the PRC does not, as one might naïvely expect, only cover what government should do once an emergency has already started. It also says how the Chinese government should prevent and prepare for emergencies, and how it should conduct surveillance to detect an active emergency as early as possible.

  5. ^

    For reference, TC 260 is the primary body responsible for setting cybersecurity and data protection standards in China.



Discuss

Digital Consciousness Model Results and Key Takeaways

2026-01-24 00:56:05

Published on January 23, 2026 3:58 PM GMT

Introduction to the Digital Consciousness Model (DCM)

Artificially intelligent systems, especially large language models (LLMs) used by almost 50% of the adult US population, have become remarkably sophisticated. They hold conversations, write essays, and seem to understand context in ways that surprise even their creators. This raises a crucial question: Are we creating systems that are conscious?

The Digital Consciousness Model (DCM) is a first attempt to assess the evidence for consciousness in AI systems in a systematic, probabilistic way. It provides a shared framework for comparing different AIs and biological organisms, and for tracking how the evidence changes over time as AI develops. Instead of adopting a single theory of consciousness, it incorporates a range of leading theories and perspectives—acknowledging that experts disagree fundamentally about what consciousness is and what conditions are necessary for it.

Here, we present some of the key initial results of the DCM. The full report is now available here.

We will be hosting a webinar on February 10 to discuss our findings and answer audience questions. You can find more information and register for that event here.

Why this matters

It is important to assess whether AI systems might be conscious in a way that takes seriously both the many different views about what consciousness is and the specific details of these systems. Even though our conclusions remain uncertain, it's worth trying to estimate, as concretely as we can, the probability that AI systems are conscious. Here are the reasons why:

  • As AI systems become increasingly complex and sophisticated, many people (experts and laypeople alike) find it increasingly plausible that these systems may be phenomenally conscious—that is, they have experiences, and there is something that it feels like to be them.
  • If AIs are conscious, then they likely deserve moral consideration, and we risk harming them if we do not take precautions to ensure their welfare. If AIs are not conscious but are believed to be, then we risk giving unwarranted consideration to entities that don’t matter at the expense of individuals who do (e.g., humans or other animals).
  • Having a probability estimate that honestly reflects our uncertainty can help us decide when to take precautions and how to manage risks as we develop and use AI systems.
  • By tracking how these probabilities change over time, we can forecast what future AI systems will be like and when important thresholds may be crossed.

Why estimating consciousness is a challenging task

Assessing whether AI systems might be conscious is difficult for three main reasons:

  • There is no scientific or philosophical consensus about the nature of consciousness and what gives rise to it. There is widespread disagreement over existing theories, and these theories make very different predictions about whether AI systems are or could be conscious.
  • Existing theories of consciousness were developed to describe consciousness in humans. It is often unclear how to apply them to AI systems or even to other animals.
  • Although we are learning more about how AI systems work, there is still much about their inner workings that we do not fully understand, and the technology is changing rapidly.

How the model works

Our model is designed to help us reason about AI consciousness in light of our significant uncertainties.

  • We evaluate the evidence from the perspective of 13 diverse stances on consciousness, including the best scientific theories of consciousness as well as more informal perspectives on when we should attribute consciousness to a system. We report what each perspective concludes, then combine these conclusions based on how credible experts find each perspective.
  • We identify a list of general features of systems that might matter for assessing AI consciousness (e.g., attention, complexity, or biological similarity to humans), which we use to characterize the general commitments of different stances on consciousness.
  • We identified over 200 specific indicators, properties that a system could have that would give us evidence about whether it possesses features relevant to consciousness. These include facts about what systems are made of, what they can do, and how they learn.
Figure 1: Structure of the DCM

We gathered evidence about what current AI systems and biological species are like and used the model to arrive at a comprehensive probabilistic evaluation of the evidence.

  • We considered four systems: 2024 state-of-the-art LLMs (such as ChatGPT 4 or Claude 3 Opus); humans; chickens; and ELIZA (a very simple natural language processing program from the 1960s)
  • We asked experts to assess whether these systems possess each of the 200+ indicator properties.
  • We constructed a statistical model (specifically, a hierarchical Bayesian model) that uses indicator values to provide evidence for whether a system has consciousness-relevant features, and then uses these feature values to provide evidence for whether the system is conscious according to each of the 13 perspectives we included.

How to interpret the results

The model produces probability estimates for consciousness in each system.

Figure 2: Aggregated stance judgments, giving weight to stances proportional to their normalized plausibility rating by experts. Posteriors are generated from a prior probability of consciousness of ⅙ (marked with a dashed line).

We want to be clear: we do not endorse these probabilities and think they should be interpreted with caution. We are much more confident about the comparisons the model allows us to make.

  • Because the model is Bayesian, it requires a starting point—a "prior probability" that represents how likely we think consciousness is before looking at any evidence. The choice of a prior is often somewhat arbitrary and intended to reflect a state of ignorance about the details of the system. The final (posterior) probability the model generates can vary significantly depending on what we choose for the prior. Therefore, unless we are confident in our choices of priors, we shouldn’t be confident in the final probabilities.
Figure 3: How different starting assumptions shape the results. Each curve reflects a different prior belief about how likely consciousness is—from Low (10%), to Baseline (17%), to High (90%). Uniform and Moderate both start at 50-50, but Moderate holds that assumption more firmly (see paper for details).
Figure 4: Change in median posterior probability of consciousness across systems, stances, and priors.
  • What the model reliably tells us is how much the evidence should change our minds. We can assess how strong the evidence for or against consciousness is by seeing how much the model’s output differs from the prior probability.
  • In order to avoid introducing subjective bias about which systems are conscious and to instead focus just on what the evidence says, we assigned the same prior probability of consciousness (⅙) to each system. By comparing the relative probabilities for different systems, we can evaluate how much stronger or weaker the evidence is for AI consciousness than for more familiar systems like humans or chickens.

Key findings

With these caveats in place, we can identify some key takeaways from the Digital Consciousness Model:

  • The evidence is against 2024 LLMs being conscious*.* The aggregated evidence favors the hypothesis that 2024 LLMs are not conscious.
Figure 5: Changes in consciousness estimates from a ⅙ prior for each system evaluated.
  • The evidence against 2024 LLMs being conscious is not decisive. While the evidence led us to lower the estimated probability of consciousness in 2024 LLMs, the total strength of the evidence was not overwhelmingly against LLM consciousness. The evidence against LLM consciousness is much weaker than the evidence against consciousness in simpler AI systems.
  • Different stances (perspectives) make very different predictions about LLM consciousness*.* Perspectives that focus on cognitive complexity or human-like qualities found decent evidence for AI consciousness. Perspectives that focus on biology or having a body provide strong evidence against it.
Figure 6: Individual stance judgments about the posterior probability of consciousness in 2024 LLMs, starting from a prior probability of ⅙ (dashed blue line). The variation in probability outcomes across model runs results from the different ways of resolving uncertainty about the presence of individual indicators.
  • Which theory of consciousness is right matters a lot. Because different stances give strikingly different judgments about the probability of LLM consciousness, significant changes in the weights given to stances will yield significant differences in the results of the Digital Consciousness Model. It will be important to track how scientific and popular consensus about stances change over time and the consequences this will have on our judgments about the probability of consciousness.
  • Overall, the evidence for consciousness in chickens was strong, though there was significant diversity across stances. The aggregated evidence strongly supported the conclusion that chickens are conscious. However, some stances that emphasize sophisticated cognitive abilities, like metacognition, assigned low scores to chicken consciousness.

What’s next

The Digital Consciousness Model provides a promising framework for systematically examining the evidence for consciousness in a diverse array of systems. We plan to develop and strengthen it in future work in the following ways:

  • Gathering more expert assessments to strengthen our data
  • Adding new types of evidence and new perspectives on consciousness
  • Applying the model to newer AI systems so we can track changes over time and spot which systems are the strongest candidates for consciousness
  • Applying the model to new biological species, allowing us to make more comparisons across systems.

Acknowledgments

This report is a project of the AI Cognition Initiative and Rethink Priorities. The authors are Derek Shiller, Hayley Clatterbuck, Laura Duffy, Arvo Muñoz Morán, David Moss, Adrià Moret, and Chris Percy. We are grateful for discussions with and feedback from Jeff Sebo, Bob Fischer, Alex Rand, Oscar Horta, Joe Emerson, Luhan Mikaelson, and audiences at NYU Center for Mind, Ethics, and Policy and the Eleos Conference on AI Consciousness and Welfare. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.



Discuss

New version of “Intro to Brain-Like-AGI Safety”

2026-01-24 00:21:26

Published on January 23, 2026 4:21 PM GMT

A new version of “Intro to Brain-Like-AGI Safety” is out!

Things that have not changed

Same links as before:

…And same abstract as before:

Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?

I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.

Post #1 contains definitions, background, and motivation. Then Posts #2#7 are the neuroscience, arguing for a picture of he brain that combines large-scale learning algorithms (e.g. in the cortex) and specific evolved reflexes (e.g. in the hypothalamus and brainstem). Posts #8#15 apply those neuroscience ideas directly to AGI safety, ending with a list of open questions and advice for getting involved in the field.

A major theme will be that the human brain runs a yet-to-be-invented variation on Model-Based Reinforcement Learning. The reward function of this system (a.k.a. “innate drives” or “primary rewards”) says that pain is bad, and eating-when-hungry is good, etc. I will argue that this reward function is centered around the hypothalamus and brainstem, and that all human desires—even “higher” desires for things like compassion and justice—come directly or indirectly from that reward function. If future programmers build brain-like AGI, they will likewise have a reward function slot in their source code, in which they can put whatever they want. If they put the wrong code in the reward function slot, the resulting AGI will wind up callously indifferent to human welfare. How might they avoid that? What code should they put in—along with training environment and other design choices—such that the AGI won’t feel callous indifference to whether its programmers, and other people, live or die? No one knows—it’s an open problem, but I will review some ideas and research directions.

Highlights from the changelog

So what’s new? Well, I went through the whole thing and made a bunch of edits and additions, based on what I’ve (hopefully) learned since my last big round of edits 18 months ago. Here are some highlights:

Post 1: What's the problem & Why work on it now?

What is AGI?

I updated my “what is AGI” chart, along with more elaboration on why some (not all!) LLM enthusiasts have a blind spot for just how much headroom there still is beyond the AI of today.

2-column table contrasting properties of AI as we think of it today, with properties of future AGI that I'm thinking about

I should elaborate on that last part. I think that some LLM enthusiasts have a massive blind spot, where they are so impressed by all the things that today’s LLMs can do, that they forget about all the things that today’s LLMs can’t do. These people read the questions on Humanity’s Last Exam (HLE), and scratch their heads, and say “C’mon, when LLMs ace the HLE benchmark, then what else is there? Look at how hard those questions are! It would need to be way beyond PhD level in everything! If that’s not superintelligence, what is?”

Well, no, that’s not superintelligence, and here’s an example of why not. Consider the task of writing a business plan and then founding a company and growing it, over the course of years, to $1B/year revenue, all with zero human intervention. Today’s LLMs fall wildly, comically short of being able to complete that task. By analogy, if humans were like today’s AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track. Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.

Anyway, this series is about brain-like algorithms. These algorithms are by definition capable of doing absolutely every intelligent behavior that humans (and groups and societies of humans) can do, and potentially much more. So they can definitely reach AGI. Whereas today’s AI algorithms are not AGI. So somewhere in between here and there, there’s a fuzzy line that separates “AGI” from “not AGI”. Where exactly is that line? My answer: I don’t know, and I don’t care. Drawing that line has never come up for me as a useful thing to do. It won’t come up in this series either.

More responses to intelligence denialists

I added my response to another flavor of “intelligence denialism”

Opinion #6: “Brain-like AGI is kinda an incoherent concept; intelligence requires embodiment, not just a brain in a vat (or on a chip). And we get things done by trial-and-error, and cooperation across a society, not by some abstract ‘intelligence’.”

  • The “embodiment” debate in neuroscience continues to rage. I fall somewhere in the middle. I do think that future AGIs will have some action space—e.g., the ability to (virtually) summon a particular book and open it to a particular passage. I don’t think having a whole literal body is important—for example Christopher Nolan (1965-2009) had lifelong quadriplegia, but it didn’t prevent him from being an acclaimed author and poet. More importantly, I expect that whatever aspects of embodiment are important for intelligence could be easily incorporated into a brain-like AGI running on a silicon chip. Is a body necessary for intelligence after all? OK sure, we can give the AGI a virtual body in a VR world—or even a real robot body in the real world! Are hormonal signals necessary for intelligence? OK sure, we can code up some virtual hormonal signals. Etc.
  • As for societies and cooperation, that’s intimately tied to human intelligence. Billions of humans over thousands of years developed language and science and a $100T economy entirely from scratch. Could billions of rocks over thousands of years do all that? Nope. What about billions of today’s LLMs over thousands of years? Also nope. (Indeed, unlike humans, LLMs cannot invent language from scratch unless they already have language in their training data.) On the other hand, if billions of humans can do all that, then so can billions of brain-like AGIs.
  • And separately, “a group of billions of AGIs that collaborate and exchange ideas using language” is … also an AGI! It’s an AGI that takes billions of times more chips to run, but it’s still an AGI. Don’t underestimate what one AGI can do!
  • There’s a common conflation of “intelligence” with “book smarts”. This is deeply mistaken. E.g. as the saying goes, charisma resides in the brain, not the kidneys.

Post 2: “Learning from scratch” in the brain

Better overview of the discourse

Having learned more about the range of opinions on my “learning from scratch” hypothesis, I wrote a better overview of the state of the discourse:

…Nevertheless, here’s a caricatured roundup of where different groups seem to stand.

In favor of the learning-from-scratch hypothesis:

  • Many computational neuroscientists in the old-school “connectionist” tradition, such as Randall O’Reilly, would probably endorse most or all of my hypothesis. At the very least, O’Reilly explicitly endorses a learning-from-scratch cortex in this interview, citing Morton & Johnson (1991) as a line of evidence.
  • Many people who talk about “old brain (a.k.a. lizard brain) versus new brain”, such as Jeff Hawkins and Dileep George, would also probably endorse most or all of my hypothesis, with the learning-from-scratch parts of the brain being “new” and the rest being “lizard”. More on this (slightly dubious) school of thought in §3.3.
  • Many NeuroAI computational modelers (but not all) build their computational models of the cortex, hippocampus, cerebellum, and striatum, in a way that winds up being learning-from-scratch, I think.
  • The book Beyond Evolutionary Psychology by George Ellis & Mark Solms (2017) is not quite a perfect match to my hypothesis, but it’s close, probably closer than anything else I’ve seen in the literature. They divide the brain into (what they call) “hard-wired domains” versus “soft-wired domains”, with a similar boundary as mine. Kudos to the authors, even if I have some nitpicky complaints about their formulations. Too bad their book got so little attention.

Opposed to the learning-from-scratch hypothesis:

  • People in the “evolved modularity” / “innate cognitive modules” tradition would probably strongly disagree that the cortex “learns from scratch”. This category includes most evolutionary psychologists (e.g. Leda Cosmides, John Tooby, Steven Pinker), as well as Joshua Tenenbaum and others in the probabilistic programming school of computational cognitive science. See for example Pinker’s The Blank Slate (2003) chapter 5. To be clear, I’m enthusiastic about evolutionary psychology as a topic of study, but I do think that evolutionary psychology as practiced today is built on a mistaken framework.
  • People who use phrases like “evolutionary pretraining” are likewise rejecting the learning-from-scratch framework. “Evolutionary pretraining” invokes the idea that the cortex is like a machine learning (ML) algorithm, but not an ML algorithm that starts from random initialization, but rather a “pretrained” ML algorithm that can perform useful innate behaviors from the start, thanks to evolution, in a fashion analogous to “transfer learning” in the ML literature. I obviously disagree.
  • People who talk about learning versus innateness in a way that lumps the whole brain together are implicitly rejecting the idea of learning-from-scratch—or more likely, failing to consider it in the first place. Examples include the paper “A deep learning framework for neuroscience” (2019, Blake Richards + 31 more coauthors), and “A critique of pure learning and what artificial neural networks can learn from animal brains” (Tony Zador, 2019), neither of which even mentions the possibility that maybe the cortex is wildly different from the brainstem when it comes to learning.

Plasticity

I revamped my discussion of brain plasticity, relating it to “mutable variables” in computer science:

2.3.3 Learning-from-scratch is NOT the more general notion of “plasticity”

When lay people talk about “brain plasticity” in self-help books, they’re usually imagining a kind of animistic force that makes your brain magically respond to changing circumstances in good ways rather than bad.

Neuroscientists think of it differently: “plasticity” means there’s something in the brain (often synapses, but also potentially myelination, gene expression, etc.) that durably changes in a specific way under specific conditions, as a result of some biochemical triggering mechanism. The change does what it does, and its consequences can be good or bad (but it’s usually good, or else the triggering mechanism would not have evolved).

If we now switch over from biochemistry to algorithms, the previous paragraph may ring a bell: plasticity just means that the brain algorithm has mutable variables.

Now, any learning-from-scratch algorithm necessarily involves mutable variables, which allow it to learn. For example, the “weights” of a deep neural network are mutable variables.

However, practically every other algorithm involves mutable variables too! For example, the Python code a=a+1 takes a mutable variable a, and increases its value by 1. When you play a video game, there are mutable variables tracking your inventory, health, progress, etc. Mutable variables are ubiquitous and routine.

The brain does other things too, besides learning-from-scratch algorithms (more on which in the next post). Those other things, of course, also involve mutable variables, and hence plasticity. Thus, there is brain plasticity that has nothing to do with learning-from-scratch algorithms. Let’s call those other things “idiosyncratic mutable variables”. Here’s a table with an example of each, and general ways in which they differ…

Interpretability

I’ve talked to a number of people who think that interpretability is a silver bullet for brain-like-AGI safety, so I added this subsection in response:

2.7.1 What about interpretability for brain-like AGIs?

There is a field of “machine learning interpretability”, dedicated to understanding the innards of learned-from-scratch “trained models”. I’m generally in favor of efforts to advance that field, but I want to caution strongly against people pinning their hopes on interpretability. Sure, given enough time, maybe people could understand any given entry in the giant unlabeled world-model of a brain-like AGI. But (1) Also, maybe not (e.g., think of the thousands of implicit hard-to-describe learned heuristics used by a charismatic salesperson); (2) If it takes careful scrutiny to understand one entry, how will we deal with millions of them? (3) A brain-like AGI would keep learning and changing as it runs—do we have to keep pausing it to reanalyze its ever-shifting world-model and desires? (4) And if we keep pausing it, won’t we get outcompeted by the more careless firm down the block that just lets its AGI run at full speed?

Relatedly, people sometimes argue that the unlabeled world-model of a brain-like AGI would be easier to understand than the unlabeled world-model of an LLM, for various reasons. Whether this is true or false, I think it’s a red herring. The above four issues would need to be dealt with, regardless of whether or not LLMs have an even worse interpretability profile than brain-like AGI.

To be clear, I’m pushing back on the idea that “we don’t have to worry about safety because interpretability is tractable”. I’m not opposed to interpretability itself, nor to leveraging interpretability as one component of a plan. More on the latter later in the series (esp. Post #14).

Post 3: Two subsystems: Learning & Steering

My timelines prediction

I added my extraordinarily bold prediction on when brain-like AGI will appear. In case you’re wondering, yes I am willing to put my money where my mouth is and take bets on “between zero and infinity years”, at 1:1 odds 😉

…Thus concludes my timeline-to-brain-like-AGI discussion, which again is not my main focus in this series. You can read my three timelines sections (§2.8, §3.7, and this one), agree or disagree, and come to your own conclusions. In case anyone is curious, when I am forced to be specific, my position is:

I expect superintelligent brain-like AGI between 5 and 25 years from now. Or I guess maybe more than 25 years, who knows. Or I guess maybe less than 5 years, who knows. Shrug.

Responses to bad takes on acting under uncertainty

I rewrote my round-up of bad takes on AGI uncertainty, including “the insane bizarro-world reversal of Pascal’s Wager”:

…Likewise, the word “hypothetical”, as in “hypothetical possibilities” or “hypothetical risks”, is another jeer thrown at people trying to plan for an uncertain future. Taken literally, it’s a rather odd insult: some hypotheses are false, but also, some hypotheses are true!

Various other bad takes revolve around the idea that we should not even try to prepare for possible-but-uncertain future events. In particular, some people talk as if we shouldn’t try to mitigate a possible future catastrophe until we’re >99.9% confident that the catastrophe will definitely happen. I like to call this the insane bizarro-world reversal of Pascal’s Wager. It probably sounds like a strawman when I put it like that, but my goodness, it is very real, and very widespread. I think it comes from a few places…

…And the section now has even more silly pictures! The first of these is new, the other two are old.

Post 5. The “long-term predictor”, and TD learning

More pedagogy on the toy model

Much of this post discusses a certain toy model, and many readers have struggled to follow what I was saying. I added a new three-part “preliminaries” section that will hopefully provide helpful pointers & intuitions.

…Here’s a toy model of what that can look like:

5.2.1 Toy model walkthrough: preliminaries

Here’s a little summary table for this subsection, that I’ll explain in the text below.

  Override mode Defer-to-predictor mode
Training incentive Anticipate override signal Keep output steady
AI intuitions Supervised learning TD learning
Neuro intuitions Innate reactions Self-fulfilling prophecies

5.1.1.1 …

Post 6: Big picture of motivation, decision-making, and RL

More on why ego-syntonic goals are in the hypothalamus & brainstem

I added a brief summary of why we should believe that desires like friendship and justice come ultimately from little hypothalamic cell groups, just like hunger and pain do, as opposed to purely from “reason” (as one high-level AI safety funder once confidently told me).

Here are four views on why we should believe that the Steering Subsystem is the ultimate source of not only ego-dystonic urges like hunger, but also ego-syntonic desires like friendship and justice.

  • AI perspective: We don’t yet know in full detail how model-based RL and model-based planning works in the human brain—we don’t have brain-like AGI yet. But we do at least vaguely know how these kinds of algorithms work. And we know enough to say for sure that these algorithms don’t develop prosocial motivations out of nowhere. For example, if you set the reward function of MuZero to always return 0, then the algorithm will emit random outputs forever—it won’t start fighting for justice.
  • Rodent model perspective: For what it’s worth, researchers have been equally successful in finding little cell groups in the rodent hypothalamus that orchestrate “antisocial” behaviors like aggression, and that orchestrate “prosocial” behaviors like parenting and sociality. I fully expect that the same holds for humans.
  • Philosophy perspective: Without the Steering Subsystem, the only thing the cortex can do is build a world-model from predictive learning of sensory inputs (§4.7). That’s “is”, not “ought”. And “Hume’s law” says that you can’t get “ought”-statements from exclusively “is”-statements. Granted, not everyone believes in Hume’s law. But I do—see an elegant and concise argument for it here.
  • My nuts-and-bolts proposals: If you keep reading this series, along with follow-up work Neuroscience of human social instincts: a sketch (2024), and Social drives 1: “Sympathy Reward”, from compassion to dehumanization (2025), and Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking (2025), I claim to have a rather detailed step-by-step picture of how little cell groups in the Steering Subsystem ultimately lead to humans caring deeply about friendship, justice, and so on.

Post 10: The technical alignment problem

LLMs

LLMs are officially off-topic, but in order to keep up with the times, I keep having to mention them in more and more places. One of the changes this time was to add a subsection “Didn’t LLMs solve the Goodhart’s Law problem?”

10.3.1.1 Didn’t LLMs solve the Goodhart’s Law problem?

When I first published this series in early 2022, I didn’t really need to defend the idea that Goodhart’s Law is an AI failure mode, because it was ubiquitous and obvious. We want AIs that find innovative, out-of-the-box solutions to problems (like winning at chess or making money), and that’s how we build AIs, so of course that’s what they do. And there isn’t a principled distinction between “meet the spec via a clever solution that I wouldn’t have thought of”, versus “meet the spec by exploiting a loophole in how the spec is operationalized”. You can’t get the former without the latter. This is obvious in theory, and equally obvious in practice. Everyone in AI had seen it with their own eyes.

…But then LLM chatbots came along, and gave rise to the popular idea that “AI that obeys instructions in a common-sense way” is a solved problem. Ask Claude Code to ensure that render.cpp has no memory leaks, and it won’t technically “satisfy” the “spec” by deleting all of the code from the file. Well, actually, it might sometimes; I don’t want to downplay LLM alignment challenges. See Soares & Yudkowsky FAQ, “Aren’t developers regularly making their AIs nice and safe and obedient?” for a pessimist’s case.

But more importantly, regardless of whether it’s a solved problem for today’s LLMs, it is definitely not a solved problem for brain-like AGI, nor for the more general category of “RL agents”. They are profoundly different algorithm classes. You can’t extrapolate from one to the other.

See my post Foom & Doom 2: Technical alignment is hard (2025) for a deep-dive into how people get LLMs to (more-or-less) follow common-sense instructions without egregious scheming and deception, and why we won’t be able to just use that same trick for brain-like AGI. The short version is that LLM capabilities come overwhelmingly from imitation learning. “True” imitation learning is weird, and profoundly different from everyday human social imitation, or indeed anything else in biology; I like to call it “the magical transmutation of observations into behavior” to emphasize its weirdness. Anyway, I claim that true imitation learning has alignment benefits (LLMs can arguably obey common-sense instructions), but those benefits are inexorably yoked to capabilities costs (LLMs can’t really “figure things out” beyond the human distribution, the way that societies of humans can, as I discuss in my “Sharp Left Turn” post (2025)). So it’s a trick that won’t help us with brain-like AGI. Again, more details in that link.

Instrumental convergence & consequentialist preferences

I rewrote my discussion of instrumental convergence, its relation to consequentialist preferences, and the surrounding strategic situation:

10.3.2.3 Motivations that don’t lead to instrumental convergence

Instrumental convergence is not inevitable in every possible motivation. It comes from AIs having preferences about the state of the world in the future—and most centrally, the distant future (say, weeks or months or longer)—such that investments in future options have time to pay themselves back. These are called “consequentialist” preferences.

Given that, one extreme position is to say “well OK, if ‘AGIs with consequentialist preferences’ leads to bad and dangerous things, then fine, we’ll build AGIs without consequentialist preferences!”. Alas, I don’t think that’s a helpful plan. The minor problem is that we could try to ensure that the AGI preferences are not consequentialist, but fail, because technical alignment is hard. The much bigger problem is that, even if we succeed, the AGI won’t do anything of note, and will be hopelessly outcompeted and outclassed when the next company down the street, with a more cavalier attitude about safety, makes an AGI that does have consequentialist preferences. After all, it’s hard to get things done except by wanting things to get done. For more on both these points, see my post Thoughts on “Process-Based Supervision” / MONA (2023), §5.2–5.3.

Then the opposite extreme position is to say “That’s right! If AGI is powerful at all, it will certainly have exclusively consequentialist preferences. So we should expect instrumental convergence to bite maximally hard.” But I don’t think that’s right either—see my post Consequentialism & corrigibility (2021). Humans have a mix of consequentialist and non-consequentialist preferences, but societies of humans were nevertheless capable of developing language, science, and the global economy from scratch. As far as I know, brain-like AGI could be like that too. But we need to figure out exactly how.

What about RL today?

I added a discussion of how I can say all this stuff about how RL agents are scary and we don’t know how to control them … and yet, RL research seems to be going fine today! How do I reconcile that?

10.3.3.1 “The usual agent debugging loop”, and its future catastrophic breakdown

I was talking about the ubiquity of Goodhart’s Law just above. You might be wondering: “If Goodhart’s Law is such an intractable problem, why are there thousands of impressive RL results on arXiv right now?” The answer is “the usual agent debugging loop”. It’s applicable to any system involving RL, model-based planning, or both. Here it is:

  • Step 1: Train the AI agent, and see what it does.
  • Step 2: If it’s not doing what you had in mind, then turn it off, change something about the reward function or training environment, etc., and then try again.

For example, if the Coast Runners boat is racking up points by spinning in circles while on fire, but we wanted the boat to follow the normal race course, then OK maybe let’s try editing the reward function to incorporate waypoints, or let’s delete the green blocks from the environment, etc.

That’s a great approach for today, and it will continue being a great approach for a while. But eventually it will start failing in a catastrophic and irreversible way—and this is where instrumental convergence comes in.

The problem is: it will eventually become possible to train an AI that is so good at real-world planning, that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down or edit the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process!

…So if a sufficiently competent AI is trying to do something the programmers didn’t want, the normal strategy of “just turn off the AI, or edit it, to try to fix the problem” stops working. The AI will anticipate that this programmer intervention is a possible obstacle to what it’s trying to do, and make a plan resilient to that possible obstacle. This is no different than any other aspect of skillful planning—if you expect that the cafeteria might be closed today, then you’ll pack a bag lunch.

In the case at hand, a “resilient” plan might look like the programmers not realizing that anything has gone wrong with the AI, because the AI is being deceptive about its plans and intentions. And meanwhile, the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off or edited, etc.

The upshot is: RL researchers today, and brain-like AGI researchers in the future, will be able to generate more and more impressive demos, and get more and more profits, for quite a while, even if neither they nor anyone else makes meaningful progress on this technical alignment problem. But that would only be up to a certain level of capability. Then it would flip sharply into being an existential threat.

What do I mean by (technical) alignment?

Posts 10 & 11 now have a clearer discussion of what I mean by (technical) alignment:

10.1.1 What does “alignment” even mean?

The humans programming AGI will have some ideas about what they want that AGI’s motivations to be. I define “the technical alignment problem” as the challenge of building AGI that actually has those motivations, via appropriate code, training environments, and so on.

This definition is deliberately agnostic about what motivations the AGI designers are going for.

Other researchers are not agnostic, but rather think that the “correct” design intentions (for an AGI’s motivation) are obvious, and they define the word “alignment” accordingly. Three common examples are

  • (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do right now.”
  • (2) “I am designing the AGI so that it shares the values of its human supervisor.
  • (3) “I am designing the AGI so that it shares the collective values of humanity.”

I’m using “alignment” in this agnostic way because I think that the “correct” intended AGI motivation is still an open question.

For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.

  • “AGI alignment” (previous post) means that an AGI’s motivations are what the AGI designers had in mind. This notion only makes sense for algorithms that have “motivations” in the first place. What does that mean in general? Hoo boy, that’s a whole can of worms. Is a sorting algorithm “moitvated” to sort numbers? Or is it merely sorting them?? Let’s not go there! For this series, it’s easy. The “brain-like AGIs” that I’m talking about can definitely “want”, and “try”, and “be motivated” to do things, in exactly the same common-sense way that a human can “want” to get out of debt.
  • “AGI safety” (Post #1) is about what the AGI actually does, not what it’s trying to do. AGI safety means that the AGI’s actual behavior does not lead to a “catastrophic accident”, as judged by the AGI’s own designers.

Post 12: Two paths forward: “Controlled AGI” and “Social-instinct AGI”

What exactly is the RL training environment?

I added a subsection clarifying that, contrary to normal RL, we get to choose the source code for AGI but we don’t really get to choose the training environment:

12.5.2 The uncontrollable real world is part of the “training environment” too

RL practitioners are used to the idea that they get to freely choose the training environment. But for brain-like AGI, that frame is misleading, because of continual learning (§8.2.2). The programmers get to choose the “childhood environment”, so to speak, but sooner or later, brain-like AGI will wind up in the real world, doing what it thinks is best by its own lights.

By analogy, generations of parents have tried to sculpt their children’s behavior, and they’re often successful, as long as the kid is still living with them and under their close supervision. Then the children become adults, living for years in a distant city, and the children often find that very different behaviors and beliefs fit them better, whether the parents like it or not.

Again see “Heritability, Behaviorism, and Within-Lifetime RL” (2023) for further discussion of this point, plus “Heritability: Five Battles” (2025), §2 for nitpicky details and caveats.

By the same token, programmers can build a “childhood environment” for their baby brain-like AGIs to hang out and grow up in, but sooner or later we’ll need an AGI that remains aligned while roaming free in the real world. If we can’t do that, careless people will still make brain-like AGIs that roam free in the real world, but they will all be misaligned. That’s bad.

And while we can design the childhood environment, we can’t control the real world. It is what it is.

Indeed, I think a solid starting-point mental model is that, if we want a brain-like AGI that’s aligned after being in the real world for a long time, then we should mostly forget about the childhood environment. Thanks to continual learning, the AGI will settle upon patterns of thought and behavior that it finds most suitable in the real world, which depends on its innate disposition (e.g. reward function), and on the details of the real world, but not on its childhood environment.

12.5.3 But “childhood environment” does matter

Post 15: Conclusion: Open problems, how to help, AMA

“Reward Function Design”

I added the RL subfield of “reward function design” as an 8th concrete research program that I endorse people working on:

15.2.2.3 The “reward function design” research program — ⭐⭐⭐⭐

See my post “We need a field of Reward Function Design” (2025). In brief: If people make brain-like AGI, or really any other kind of RL agent AGI, then the reward function is an especially important design degree of freedom for alignment. Alas, the field of RL has always been much more interested in how to maximize reward functions, than in what the reward function should be. The tiny field of reward function design contains a few interesting ideas, including curiosity drive, reward shaping, and inverse reinforcement learning. Let’s find dozens more ideas in the same category!

This research program should hopefully link up with and complement “reverse-engineering human social instincts” (§15.2.1.2 above), as I believe that the AI field of reward function design, in its current primitive state, lacks the theoretical language to explain how human social instincts manage to have the properties that they have.

(See my post Reward Function Design: a starter pack (2025) for a snapshot of my own thinking on this topic.)

I give this research program a priority score of 4 stars out of 5. Or up it to 5 stars out of 5 when it’s more narrowly targeted at alignment-relevant reward function design ideas.

Conclusion

Those were just highlights; there were many other small changes and corrections. The blog version has more detailed changelogs after each post. Happy for any feedback, either as blog comments, or by email, DM, etc.!



Discuss