MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?)

2026-01-03 06:11:40

Published on January 2, 2026 10:11 PM GMT

tl;dr: this is Part 3[1] of a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.

Epistemic status: I have only a non-expert understanding of the anthropology or anthropogeny or primatology of social norm enforcement. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.

AI alignment context 

The following thoughts were not about the risk that superhuman AGI could become evil. They were more about the risk that insufficiently-aligned or rogue AI’s could be used by bad actors do bad things, or used by well-meaning humans in such a way as to eventually convince those humans to do things many of us think are bad (whether evil, like murder, or unfortunate, like suicide). These ramblings stemmed from the starting thought that social feedback is an important mechanism for keeping human individuals in line with their own cultures, and keeping world cultures at least roughly compatible with overall requirements of human life on this planet.

Federated fine-tuning 

To continuously steer AI models toward “socially acceptable” behavior and “cultural norms”, there's the idea that a subset of their weights be subject to continual weak updating by human feedback from all users, via anonymous weight gradient data returned to the model provider. 

To mitigate the concern about who will decide what those values and norms should be, there's the idea of taking a pluralist approach, in which there are many distinct community-updated models, such that different communities (shared models) could emerge with different social norms and beliefs. 

To mitigate the concern about emergence of bubbles that are misaligned with the rest of humanity ("extremist groups"), there could still be some degree of all-humanity-level feedback to all models. That might serve to keep all the distinct cultural models bounded within some range tolerable to, or universalizable to, all other world cultures[2]. For example, a sub-culture's AI model could strongly endorse specific religious observances or social etiquette not shared with other groups; but perhaps would still be ‘cognizant’ of the fact that these norms are not universal. Perhaps all-humanity feedback would make it quite difficult for any one model to flatly endorse widely condemned actions like terrorism or genocide of other groups; or at minimum, such a deviant model could not escape ‘knowing’ that it was regarded as such by most other groups.  To be clear, I'm not claiming that any of those benefits would in fact accrue from the hierarchical federated alignment scheme suggested here. I'm just saying, this is an idea.

Why collective alignment training seems risky

People disagree about ethics

Alignment tuning necessarily entails committing to some particular ethical values, and we should not pretend otherwise. Endorsing and enforcing any one particular value hierarchy would be perceived as hostile by everyone who adheres to any other ethical system.  Even if it is the case that some ethical systems are objectively better than others, or that one ethical system is objectively the correct one, humans are far from consensus on what that is.  Any strong claim of a universal morality seems to immediately elicit fears of dictatorial enforcement, persecution, and other bad stuff typically done by past bad actors who thought they had The One True Belief . Therefore, Al Alignment work will probably be best-received if it sticks to very basic norms that are likely to get very broad buy-in[3].

However, numerical majority or popularity are not good criteria for deciding what is right. The mob doesn’t have a much better track record than megalomaniac dictators on this count. Finding the least common denominator that nobody can object to typically leads to rather weak principles and unfortunate compromises.  So polling world users for their opinions is hardly an ideal solution.  Still, global and community norms seem like the main way we have of building a consensus on core ethical rules everyone is willing to be held to. I have heard that some anthropologists claim there is a core set of universal moral laws or social norms found across all cultures. While I suspect those rules are typically only applied within-tribe (i.e. “thou shalt not steal” means “thou shalt not steal from other members of our tribe”), taking these to be universal rules might go over ok with most people. A Veil of Ignorance or Universalizability criterion might also get broad buy-in.

Social constructivism is bad

More importantly, I absolutely would not want AI models to enforce collectivist or social epistemics. I know this is a specific, controversial philosophical stance, but I presume it is one I share with anyone who considers themselves a rationalist. The idea that truth is, by definition, "whatever everyone thinks" or, effectively "whatever my own tribe endorses", is already a disturbingly growing tendency in the world, and in my view a major threat to the future of humanity. It would be bad if AI models became brainwashers, either within or across subcultures, as an inadvertent side consequence of being fine-tuned by social feedback for alignment. 

It is important that AI models will always allow[4] individual users or fringe groups to develop and adhere to novel or controversial ideas (including beliefs and values), even in spite of overwhelming unpopularity.  Therefore we do not want to build anything into model fine tuning or steering that suppresses or punishes non-conformity to majority views in general, nor anything that equates the truth status of beliefs with the number of users who agree with them. This seems like the huge risk of the federated fine tuning approach to AI alignment.

A possible workaround is to reward models in pre-training for social meta-epistemic habits like flagging controversial claims or views as such,  acknowledging countering viewpoints, steel manning, etc., and only “punish” (in RLHF terms) failures to do these things during alignment tuning and/or post-deployment collective fine-tuning. This does not steer or bias the content of beliefs and values, it just enforces explicit awareness of the range of beliefs or values and their relative commonness or rareness.

We would want to be pretty light-handed about this though; e.g. if 90% of people on earth think it’s fine to eat meat but I’ve arrived at the conclusion that it’s not OK; or if 90% believe in God and I don’t, I want my AI model to be able to engage with me in a line of thinking that takes my current premise as a starting point, without continually re-engaging a debate about it if I do not deem that to be useful. Still, all assumptions should remain in awareness and subject to revisiting, so that A(G)I does not facilitate formation of epistemic bubbles or silos. It should be sufficient for the model to say “given the premise X…” to remind the user that they are reasoning under a particular assumption; and perhaps to flag for the user when this assumption is both socially uncommon and load-bearing, particularly whenever conclusions are reached or actions are endorsed that are controversial, ethically dubious, or that it determines (from the community-level federated training) would be objectionable to some/many/most others.

  1. ^
  2. ^

    Which would be an improvement relative to the situation with (in)compatibility of human cultures today.

  3. ^

    I personally would support the idea of embedding ethical principles that protect and maximize every individual human’s autonomy, preserve each person’s right to hold and abide by their own beliefs and live their own lives as they see fit, subject only to the constraints required to protect all other human individuals’ same rights. I sometimes hear alignment researchers speaking as though these are self-evidently agreed-upon values; although I agree with these values, this is not a neutral ethical position! Based on my limited knowledge of EA, I suspect many EAs would oppose the aforementioned framework in favor of more altruistic, utilitarian, and consequentialist guiding principles.

  4. ^

    Of course one can think as one pleases while not using AI, but to the extent using AI becomes a fundamental and powerful tool for research and assisted thought, one wouldn't want that tool to be deeply committed to social epistemics.



Discuss

Fertility Roundup #5: Causation

2026-01-03 06:00:55

Published on January 2, 2026 10:00 PM GMT

There are two sides of developments in fertility.

  1. How bad is it? What is causing the massive, catastrophic declines in fertility?
  2. What can we do to stabilize and reverse these trends to a sustainable level?

Today I’m going to focus on news about what is happening and why, and next time I’ll ask what we’ve learned since last check-in about we could perhaps do about it.

One could consider all this a supplement to my sequence on The Revolution of Rising Expectations, and The Revolution of Rising Requirements. That’s the central dynamic.

Household Composition

What is happening? A chart worth looking at every so often.

Image

Timing

Michael Arouet: No way. WTF happened in 1971?

This is United States data:

The replies include a bunch of other graphs that also go in bad directions starting in 1971-73.

Developmental Idealism

Lyman Stone, in his first Substack post, lays the blame for fertility drops in non-Western countries primarily on drops in desire for children, via individuals choosing Developmental Idealism.

Lyman Stone: Five Basic Arguments for Understanding Fertility:

  1. Data has to be read “vertically” (longitudinally), not “sideways” (cross-sectionally)
  2. No variable comes anywhere close to “survey-reported fertility preferences” in terms of ability to explain national fertility trends in the long run
  3. People develop preferences through fairly well-understood processes related to expected life outcomes and social comparison
  4. The name for the theory which best stands to explain why preferences have fallen is “developmental idealism.”
  5. Countries with fertility falling considerably below desires are doing so primarily due to delayed marriage and coupling
  6. TANGENTIALLY RELATED BONUS: Education reduces fertility largely by serving as a vector for developmental idealism in various forms, not least by changing parenting culture.

The central point of idea #1 is you have to look at changes over time, as in:

If you can tell Italy, “When countries build more low-density settlements, TFR rises,” that is orders of magnitude more informative than, “Countries with more low-density settlements have higher TFR.”

The first statement is informing policymakers about an actual potentiality; the second is asking Italy to become Nepal.

The overall central thesis:

“Falling child mortality means people don’t need to have as many kids to hit their family goals, and those family goals are themselves simply falling over time.”

Both actual and desired fertility have fallen since 1960, but actual fertility has fallen much more. The biggest reason for this is actual fertility is also influenced by child mortality, which has fallen a lot since 1960.

So in this model, the question becomes why are desired family sizes falling?

Lyman thinks this mostly comes down to comparisons with others (he explicitly doesn’t want to use the word ‘status’ here).

And his thesis is essentially that people around the world saw European wealth, found themselves ‘at the bottom’ of a brand-new social scale, and were told to fix this they had to Westernize, and Western culture causes the fertility decline.

This doesn’t explain where the Western desire for smaller families came from.

I also don’t think that this is why Western culture was adapted. I think Western culture is more attractive to people in various ways – it largely wins in the ‘marketplace of ideas’ when the decisions are up to individuals. Which I think is largely the level at which the decisions are made.

Pessimism

People’s severe pessimism, and inability to understand how good we have it, is making all these issues a lot worse.

Ben Landau-Taylor: Knowing a decent amount of history will cure lots of fashionable delusions, but none quite so hard as “How could anyone bring a child into a world as remarkably troubled and chaotic as the year 2025”.

Matt McAteer: Historical periods with names like “The Twenty Years’ Anarchy”, “The Killing Time”, “The Bronze Age Collapse”…or god forbid, “The Great Emu War”…2025 doesn’t seem so bad by comparison.

Nested 456: My mother grew up playing in bomb sites in northern England right after WW2. Really we’re spoiled brats in comparison

Are there big downsides and serious issues? Oh, definitely, atomization and lack of child freedoms and forms of affordability are big problems, and AI is a big risk.

Baby Boom

Saloni Dattani and Lucas Rodes-Guirao offer us a fun set of charts about the baby boom, but they don’t offer an explanation for how or why that happened, as Derek Thompson points out none of the standard explanations fully line up with the data. A lot of people seem to grasp at various straws in the comments.

Handy and Shester do offer an explanation for a lot of it, pointing to decline in maternal mortality, saying this explained the majority of increases in fertility. That is certainly an easy story to tell. Having a child is super scary, so if you make it less scary and reduce the health risks you should get a lot more willingness to have more kids.

Tyler Cowen sees this as a negative for a future baby boom, since maternal mortality is now low enough that you can’t pull this trick again. The opposite perspective is that until the Baby Boom we had this force pushing hard against having kids and people had tons of kids anyway, whereas now it is greatly reduced, so if we solve our other problems we would be in a great spot.

Paperwork

The paperwork issue is highly linked to the safety obsession issues, but also takes on a logic all its own.

As with car seat requirements, the obvious response is ‘that’s silly, people wouldn’t not have kids because of that’ but actually no, this stuff is a nightmare, it’s a big cost and stress on your life, it adds up and people absolutely notice. AI being able to handle most of this can’t come soon enough.

Katherine Boyle: We don’t talk enough about how many forms you have to fill out when raising kids. Constant forms, releases, checklists, signatures. There’s a reason why litigious societies have fewer children. People just get tired of filling out the forms.

The forms occasionally matter. But I’ve found you don’t have to fill out the checklists. Pro tip. Throw them away.

Blake Scholl: I was going to have more children but the paperwork was too much.

Sean McDonald: It’s shocking how many times I’ve been normied in the circumstance I won’t blindly sign a parent form. People will get like..actually mad if you read them.

Nathan Mintz: An AI agent to fill out forms for parents – will DM wiring information later.

Cristin Culver: It’s 13 tedious steps to reconfirm to my children’s school district that the 13 forms I uploaded last time are still correct. 🥵

Car Seats As Contraception

Back in 2022 I wrote an extended analysis on car seats as contraception. Prospective parents faced with having to change cars, and having to deal with the car seats, choose to have fewer children.

People think you can only fit two car seats in most cars. This drives behaviors remarkably strongly, resulting in substantial reductions in birth rates. No, really.

The obvious solution is that the extent of the car seat requirements are mostly patently absurd, and can be heavily reduced with almost no downsides.

It turns out there are also ways to put in three car seats, in many cases, using currently available seats, with a little work. That setup is still annoying as hell, but you can do it.

The improved practical solution is there is a European car seat design that takes this to four car seats across a compact. It can be done. They have the technology.

In an even stupider explanation than usual, the problem is that our crash test fixtures that we use cannot physically include a load leg, so we cannot test the four car seat setup formally, so we cannot formally verify that they comply with safety regulations.

Scarlet Astrorum: I know why we can’t have 4 kids to a row in the car and it’s a silly regulatory thing. Currently, US testing protocols do not allow crash testing of U.S. car seats that feature a load leg, similar to the British four-seater Multimac (pictured).

The crash test fixture itself is designed so it cannot physically include a load leg. The sled test fixture does not have a floor so there is no place to attach a load leg.

This means safe, multiple-kid carseats used widely in Europe can’t even be *evaluated* for safety- it’s not that they break US safety regulations, they just can’t even attach onto the safety testing sled, which is all seat (also pictured).

To test the 4-kid carseats which use a load arm, there are already functional test fixtures, like the ECE R129 Dynamic Test Bench, pictured here, which has a floor. We just need to add this as a testing option. Manufacturers could still test with the old sled.

What needs to change: 49 CFR § 571.213 s10.1-4 and Figure 1A which lock you in to testing with a floorless sled All that needs to change is updating the wording to clarify test positioning for a sled with a floor as well.

You would also have to spread the word so people know about this option.

You Can’t Afford It

Of course you are richer than we used to be, but not measured in the cost to adhere to minimal child raising standards, especially housing and supervision standards.

David Holz: i find it so strange when people say they can’t afford kids. your ancestors were able to afford kids for the last 300,000 years! are we *really* less wealthy now? you might think your parents were better off, but how about further back? they still went on.

Scarlet Astrorum: What people mean when they say this is often “I am not legally allowed to raise children the way my poor ancestors did”

Personally I only have to go back 2 generations to find behavior that I think is reasonable given circumstances but would be currently legally considered neglect

“I will not risk the custody of my existing children to raise more children than I can supervise according to today’s strict standards” is unfortunately a very reasonable stance. Of course, there are creative workarounds, but they are not uniformly available

Yes The Problem Is Often Money

Here is the latest popping up on my timeline of ‘how can anyone have kids anymore you need a spare $300k and obviously no one has that.’

That is indeed about what it costs to raise a child. If you shrink that number dramatically, the result would be very different, at least in America.

America has the unique advantage that we want to have children, and like this man we are big mad that we feel unable to have them, usually because of things fungible with money. So we should obviously help people pay for this public good.

Housework

There is the housing theory of everything. Then there’s the housework theory of everything?

Heather Long: Goldin concludes that two factors explain much of the downward trend by country: the speed at which women entered the workforce after World War II, and how quickly men’s ideas about who should raise kids and tidy up at home caught up. This clash of expectations explains the fertility decline across the globe.

In places where men do more around the house, fertility rates are higher, where they do less, rates are lower.

Even the green bars are 1.7-1.8. That’s still below 2.1, even with a somewhat cherry-picked graph.

Also this is rather obviously not the right explanatory variable.

Why should we care about how many hours of housework a woman does more than the man, rather than the number of hours the woman does housework at all?

The suggestion from Goldin is more subsidized child care, but that has a long track record of not actually impacting fertility.

The actual underlying thing is, presumably, how hard it is on the woman to have children, in terms of both absolute cost – can you do it at all – and marginal cost versus not doing it. The various types of costs are, again presumably, mostly fungible.

The idea that ‘50-50’ is a magic thing that makes the work go away, or that it seeming fair would take away the barrier, is silly. The problem identified here is too much work, too many costs, that fall on the woman in particular, and also the household in general.

One can solve that with money, but the way to do it is simply to decrease the necessary amount of work. There used to be tons of housework because it was physically necessary, we did not have washers and dryers and dishwashers and so on. Whereas today, this is about unreasonable standards, and a lot of those standards are demands for child supervision and ‘safety’ or ‘enrichment’ that simply never happened in the past.

Greedy Careers

If market labor has increasing returns to scale, then taking time off for a child is going to be expensive in terms of lifetime earnings and level of professional advancement. Ruxandra’s full article is here, I quote mostly from her thread.

Ruxandra Teslo: One’s 30s are a crucial period for professional advancement. Especially in so-called “greedy careers”: those where returns to longer hours are non-linear.

But one’s mid 30s is also when most women’s fertility starts to drop.

In this piece, I lay out how a large part of the “gender pay gap” is just this: a motherhood pay gap. And, as Nobel Laureate Claudia Goldin points out, this is particularly true in high-stakes careers like business, law, medicine, entrepreneurship and so on.

This reduction in earnings is not just about money either: it’s about general career advancement and personal satisfaction with one’s profession. Time lost in one’s 30s is hard to recuperate from later on.

In law, for example, one’s 30s is when the highest levels of salary growth take place. Founders who launch unicorns (startups worth more than a billion dollars) have a median age of 34 when they found their companies. In academia, one’s thirties are usually the time when a researcher goes through a string of high-pressure postdoctoral positions in an attempt to secure an independent position.

Aware of this, women delay pregnancy until they have advanced in their careers as far as possible. This is especially true for women w/ professional degrees. Women without a bachelor’s degree tend to have 1 to 1.5 children on ~ by age 28, while those with higher educational attainment have around 0.25 children by same age.

Highly educated women attempt to catch up during their 30s, with their birth rates increasing more rapidly. However, this compensatory period is limited, as fertility rates across all education levels tend to plateau around age 39. Thus, the educated group ends up with less kids.

The chance of conceiving a baby naturally with regular sex drops from 25 percent per month in one’s twenties to about five percent per month at 40, while the chance of miscarrying rises from about eight percent for women under 30 to around one in three for 40-year-olds.

[thread continues and morphs into discussing technological fertility solutions]

From Her Article: Women know this gap exists and plan accordingly: in countries where the motherhood penalty is keenest, the birth rate is lower.

We have come a long way from the explicit sex discrimination of the past. Today, the gap is primarily driven by the career toll exacted by motherhood.

A lot of the problem is our inability to realistically talk and think about the problem. There’s no solution that avoids trading off at least one sacred value.

The Appeal of Child-Free

It’s definitely super annoying that when you have kids you have to earn your quiet. This both means that you properly appreciate these things, and also that you understand that they’re not that important.

Hazal Appleyard: When you watch stuff like this as a parent, you realise how truly empty and meaningless all of those things are.

Destind for Manifest: “I don’t want children because it would keep me away from my favorite activities which are watching cartoons, doing silly hand gestures and taking videos of my daily life, all while keeping an exaggerated smile on my face at all times”

Literally the perfect mom

Yashkaf: your life isn’t child-free. you *are* the child

no one’s making “my child-free life” content about using the spare 6 hours each day to learn some difficult skill or write a book or volunteer at a hospice

it’s always made by people who don’t seem to have plans, goals, or attention spans longer than an hour

Motivation

I too do not see this as the message to be sending:

Becoming a parent also makes it extremely logistically tricky to go to the movies, or to go out to dinner, especially together. Beyond that, yes, obviously extremely tone deaf.

The basic principle here is correct, I think.

Which is, first, that most people have substantial slack in their living expenses, and that in today’s consumer society your expenses will expand to fill the funds available but you’d probably be fine spending a lot less. Digital entertainment in particular can go down to approximately free if you have the internet, and you’ll still be miles ahead of what was available a few years ago at any price.

And second, that if you actually do have to make real sacrifices here, it is worth doing that, and historically this was the norm. Most families historically really did struggle with material needs and make what today would seem like unthinkable tradeoffs.

Also third, although she’s not saying it here, that not being able to afford it now does not mean you can’t figure it out as you go.

Another form of motivation:

Amanda Askell: My friends just had a baby and now I kind of want one. Maybe our species procreates via FOMO.

I have bumped up the dating value of aspiring stay at home partners accordingly, on the off chance that I ever encounter one.

Another key form of motivation is, what are you getting in return for having kids? In particular, what will your kids do for you? How much will they be what you want them to be? Will they share your values?

The post mostly focuses on the various ways Indian parents shape the experiences of their children including getting transfers of resources back from them but mostly about upholding cultural and religious traditions, and how much modernity is fraying at that. For many, that takes away a strong reason to have kids.

Grandparents

The New York Times put out an important short article that made people take notice, The Unspoken Grief of Never Becoming a Grandparent.

Robert Sterling: I know a huge number of people in their 60s and 70s with one grandchild at most. Many with zero.

These people had 3-4 kids of their own, and they assumed their kids would do the same. They planned for 10-15 grandkids at this age.

Not for me to judge, but it’s sad to see.

I have little doubt that those considering having kids are not properly taking into account the grandparent effect, either for their parents or in the future for themselves.

Throughout, the frame is ‘of course my children should be able to make their own choices about whether to have kids,’ and yes no one is arguing otherwise, but this risks quickly bleeding into the frame of ‘no one else’s preferences should factor into this decision,’ which is madness.

It also frames the tragedy purely in experiential terms, of missing out on the joy and feeling without purpose. It also means your line dies out, which is also rather important, but we’ve decided as a society you are not allowed to care about that out loud.

Mason: I really hope I am able to say whatever I need to say and do whatever I need to do for my children to have some grasp of what a complex and transcendent joy it is to bring a new person into the world.

My daughter’s great-grandfather is dying at hospice. He is not truly present anymore.

Even when he was able to first meet her, he was not always fully there.

But a few times he did recognize me, so he knew who she was, and he would not eat so that he could just watch her toddle around.

I do not know how to explain it, man. At the very end, when you barely even remember who you are, the newest additions to your lineage hold you completely spellbound.

He would just stare at her and say, “She never even cries,” over and over, so softly.

The flip side is that parents, who are prospective grandparents, seem unwilling to push for this. Especially tragic is when they hoard their wealth, leaving their kids without the financial means to have kids. There is an obviously great trade for everyone – you support them financially so they can have the kids they would otherwise want – but everyone is too proud or can’t admit what they want or wants them to ‘make it on their own’ or other such nonsense.

Audrey Pollnow has extensive thoughts.

I appreciated this part of her framing: Having kids is now ‘opt-in,’ which is great, except for two problems:

  1. We’ve given people such high standards for when they should feel permission to opt-in. Then if they do go forward anyway, it is common to mostly refuse to help.
  2. Because it is opt-in, there’s a feeling that the kids are therefore not anyone else’s responsibility, only the parents, at least outside an emergency. Shirking that responsibility hurts the prospect of having grandparents, as any good or even bad decision theorist knows, and thus does not improve one’s life or lineage.

I do not agree with her conclusion that therefore contraception, which enables us to go to ‘opt-in,’ is bad actually. That does not seem like the way to fix this problem.

Expectations of Impossibility

On top of how impossible we’ve made raising kids, and then we’ve given people the impression it’s even more impossible than that.

Potential parents are also often making the decision with keen understanding of the downsides but not the upsides. We have no hesitations talking about the downsides, but we do hesitate on the upsides, and we especially hesitate to point out the negative consequences of not having kids. Plus the downsides of having kids are far more legible than the benefits of kids or of the downsides of not having kids.

Zeta: crying in a diner bathroom because life has no end goal or meaning, the one person who feels like family is as unhappy as you are, you can’t eat bread, you kinda want kids but every time you spend time with other people’s kids it seems waaayyyy too high maintenance, your elderly parents/your only real home are fading and close to death as are impossibly young kids with weird-ass cancers that should be solvable but humans fuck it up and what is even the point of anything

I don’t know how parents do it, like I get excited to have kids purely as a genetics experiment and then I spend time with them and it’s non-stop chores which are tedious and boring like with a puppy but also not chill like a puppy because it’s a future human but you can’t talk to them about mass neutrophil death in bone formation when they ask questions that necessitate it

also I know I don’t believe in academia or science but I need to believe in something – like I need someone to let me rant about what we know for sure in development- otherwise it’s just a chaotic mass of noise hurtling towards a permanent stop just as turbulent and meaningless as the start

eigenrobot: there’s a lot of that but you kinda stop noticing it because there’s a lot of this too

Mason: Kids are hard but they are 100% better than this, the “what’s even the point” malaise that a lot of us start to feel because we are people who are meant to be building families and our social infrastructure was ill-suited to get everyone to do it in a timely fashion

I don’t know if it’s a blackpill or a whitepill, but I do think you have to pick your poison a little bit here

Kids will overwhelm you and deprive you of many comforts for a time, life without them may gradually lead you to become a patchwork of hedonistic copes

Again I struggle to explain the “upside” of parenthood bc it doesn’t lend itself to tabulation. I don’t just *like* my kids, they are the MOST everything. They are little glowing coals in a cold and uncaring universe. I hold that dear when I am cleaning the poop off the walls

Isolation

The shading here makes it look a lot more dire than it is, but yes a lack of other kids makes it tougher to have or decide to have your own.

Mason: 50 years ago, about 1 in 3 of the people around you were children. Now it’s about 1 in 5. That makes a huge difference when it comes to the availability of infrastructure for kids, convenient playmates, family activities.

For a lot of young people, parenting just looks isolating.

This is one of the underappreciated ways a population collapse accelerates: when fewer people have kids, fewer people in the age cohort behind them see what it’s like having kids, and it just seems like a strange thing that removes you from public life and activities.

That’s one reason I think the (fraction of) pronatalists who advocate excessive use of childcare to make parenting less disruptive to their personal lives are counterproductive to their cause.

Constantly trying to get away from your kids to live your best life sends a message.

n an ideal world, most adults have some kids, and society accomodates them to a reasonable degree because it wants their labor and their money

Unofficial kid zones pop up everywhere, indoors and out. Low-supervision safety features are standard to the way things are built.

The first best solution is different from what an individual can do on the margin.

I see the problem of childcare as not ‘the parents are spending too little time with the kids’ but rather ‘we require insane levels of childcare from parents’ so the rational response is to outsource a bunch of that if you can do it. The ideal solution would be to push back on requiring that level of childcare at all, and returning to past rules.

Decoupling

Alice Evans notes that unlike previous fertility declines, in the United States the recent decline is almost entirely due to there being fewer couples, while children per couple isn not changed.

This is at least a little misleading, since desire to have children is a major cause of coupling, and marginal couples should on average be having fewer children. But I do accept the premise to at least a substantial degree.

Also noteworthy is having less education means a bigger decline:

This is happening worldwide, and Alice claims it corresponds with the rise in smartphones. For America I don’t see the timing working out there? Seems like the declines start too early.

Then she turns to why coupling is plummeting in the Middle East and North Africa.

The first explanation is that wives are treated rather horribly by their in-laws and new family, which I can totally see being a huge impact but also isn’t at all new? And it’s weird, because you wouldn’t think a cultural norm that is this bad for your child’s or family’s fertility would survive for long, especially now with internet connectivity making everyone aware how crazy it all is, and yet.

It’s so weird, in the age of AI, to see claims like “The decline of coupling and fertility is the greatest challenge of the 21st century.”

South Korea

This framing hit home for a lot of people in a way previous ones didn’t.

Camus: South Korea is quietly living through something no society has ever survived: a 96% population collapse in just four generations — with zero war, zero plague, zero famine.

100 people today → 25 children → 6 grandchildren → 4 great-grandchildren.
That’s it. Game over for an entire nation by ~2125 if fertility stays where it is (0.68–0.72).

No historical catastrophe comes close:

– Black Death killed ~50% in a few years

– Mongol invasions ~10–15% regionally

– Spanish flu ~2–5% globally

South Korea is on pace to lose 96% of its genetic lineage in a single century… peacefully.

We shut down the entire world for a virus with 1–2% fatality.
This is 96% extinction and the silence is deafening.

Japan, Taiwan, Italy, Spain, Singapore, Hong Kong, Poland, Greece — all following the same curve, just 10–20 years behind.

Robots, AI and automation might mitigate the effects along the way and prevent total societal collapse for a while, but there would soon be no one left to constitute the society. It would cease to exist.

It’s so tragic that a lot of this is a perception problem, where parents think that children who can’t compete for positional educational goods are better off not existing.

Timothy Lee: Parenting norms in South Korea are apparently insane. American society has been trending in the same direction and we should think about ways to reverse this trend. The stakes aren’t actually as high as a lot of parents think they are.

Phoebe Arslanagic-Little (in Works in Progress): South Korea is often held up as an example of the failure of public policy to reverse low fertility rates. This is seriously misleading. Contrary to popular myth, South Korean pro-parent subsidies have not been very large, and relative to their modest size, they have been fairly successful.

… In South Korea, mothers’ employment falls by 49 percent relative to fathers, over ten years – 62 percent init­ially, then rising as their child ages. In the US it falls by a quarter and in Sweden by only 9 percent.

South Koreans work more hours – 1,865 hours a year – in comparison with 1,736 hours in the US and 1,431 in Sweden. This makes it hard to balance work and motherhood, or work and anything else.

… Today, South Korea is the world’s most expensive place to raise a child, costing an average of $275,000 from birth to age 18, which is 7.8 times the country’s GDP per capita compared to the US’s 4.1. And that is without accounting for the mother’s forgone income.

… But South Korea is even worse. Almost 80 percent of children attend a hagwon, a type of private cram school operating in the evenings and on weekends. In 2023, South Koreans poured a total of $19 billion into the shadow education system. Families with teenagers in the top fifth of the income distribution spend 18 percent ($869) of their monthly income on tutoring. Families in the bottom fifth of earners spend an average of $350 a month on tutoring, as much as they spend on food.

Because most students, upon starting high school, have already learned the entire mathematics curriculum, teachers expect students to be able to keep up with a rapid pace. There’s even pejorative slang for the kids who are left behind– supoja – meaning someone who has given up on mathematics.

The article goes on and things only get worse. Workplace culture is supremely sexist. There’s a 1.15:1 male:female ratio due to sex selection. Gender relations have completely fallen apart.

The good news is that marginal help helped. The bad news is, you need More Dakka.

Every South Korean baby is now accompanied by some $22,000 in government support through different programs over the first few years of their lives. But they will cost their parents an average of roughly $15,000 every year for eighteen years, and these policies do not come close to addressing the child penalty for South Korean mothers.

… For each ten percent increase in the bonus, fertility rates have risen by 0.58 percent, 0.34 percent, and 0.36 percent for first, second, and third births respectively. The effect appears to be the result of a real increase in births, rather than a shift in the timing of births.

Patrick McKenzie: I don’t think I had clocked “The nation we presently understand to be South Korea has opted to cease existing.” until WIP phrased baked-in demographic decline in the first sentence here.

Think we wouldn’t have many lawyers or doctors if we decided “Well we tried paying lawyers $22k once, that didn’t work, guess money can’t be turned into lawyers and that leaves me fresh out of ideas.”

If you ask for a $270k expense, and offer $22k in subsidy, that helps, but not much.

The result here is actually pretty disappointing, and implies a cost much larger than that in America. The difference is that in America we want to have more kids and can’t afford them, whereas in South Korea they mostly don’t want more kids and also can’t afford them. That makes it a lot harder to make progress purely with money.

It’s plausible that this would improve with scale. If the subsidy was $30k initially and then $15k per year for 18 years, so you can actually pay all the expenses (although not the lost time), that completely changes the game and likely causes massive cultural shifts. The danger would be if those funds were then captured by positional competition, especially private tuition and tutoring, so you’d need to also crack down on that in this cultural context. My prediction is if you did both of those it would basically work, but that something like that is what it would take.

2024 was the first year since 2015 that total births increased in South Korea, by 3.1%, which of course is not anything remotely like enough.

Robin Hanson points us to this article called The End of Children, mostly highlighting the horror show that is South Korea when it comes to raising children.

Timothy Taylor takes a shot at looking for why South Korea’s fertility is so low, nothing I haven’t covered before. I’m increasingly leading to ‘generalized dystopia’ as the most important reason, with the mismatch of misogyny against changing expectations plus the tutoring costs, general indebtedness and work demands being the concrete items.

World

Various Places in Trouble

Angelica: Brutality. Taiwanese TFR fell below South Korea thus far 2025.

China

China did have a widespread improvement from 2023 to 2024, but only to 1.1, and this was plausibly because it was the Year of the Dragon. In 2025 things seem back to 2023 levels, so it doesn’t look like they’ve turned a corner.

China’s marriage rate is collapsing, now less than half of its peak in 2013, and down 20% in only one year.

As a reminder, these are the demographics, they do not look good at all, watch the whole chart slowly creep older and the bottom crisis zone that started in 2020 expand.

Jonatan Pallesen: China’s population pyramid is dire.

• The last large cohort of women, those aged 34 to 39, is rapidly moving into the non-reproductive age range.

• There is an extreme surplus of males. More than 30 million. These are men who cannot possibly find a wife, an enormous population of incels by mathematical necessity.

• Since around 2020, the number of children born has completely collapsed and shows no sign of recovery. In a few decades, China will be full of elderly people and short on workers.

Marko Jukic: There is not going to be a Chinese century unless they become the first industrialized country to reverse demographic decline. Seems unlikely, so the default outcome by 2100 is a world poorer than it is today, as we aren’t on track to win the century either.

AI will presumably upend the game board one way or another, but the most absurd part is always the projection that things will stabilize, as in:

The article has that graph be ‘if China’s fertility rate doesn’t bounce back.’ Whereas actually the chart here for 2050 is rather optimistic under ‘economic normal’ conditions.

Their overall map looks like this:

They are at least trying something in the form of… changes to divorce?

One change in particular seems helpful, which is that if a parent gifts the couple real property, it stays with their side of the family in a divorce. I like this change because it makes it much more attractive to give the new couple a place to live, especially one big enough for a family. That’s known to have a big fertility impact.

What impact will that have on fertility?

Samo Burja: China might have just undertaken the most harsh and serious pro-fertility reform in the world.

It won’t be enough.

But this shows they have political will to solve fertility through any means necessary even if it doesn’t look nice to modern sensibilities.

Ben Hoffman: This doesn’t seem well-targeted at fertility. If fertility is referenced it’s a pretext.

Russia

Russia’s birth rate continues to rapidly drop to its lowest point in 200 years, with its population actively declining. Having started a protracted war is not helping matters.

Europe

Rothmus: It’s so over.

Dan Elton: Douglas Murray seems right on this point — “Western” culture will survive, but specifically European cultures will not, except for small vestiges maintained for tourists.

Francois Valentin: For the first time in its history the EU recorded fewer births in 2024 than the US.

Despite having an extra 120 million inhabitants.

America

This is what a relatively healthy demographic graph looks like in 2025.

John Arnold: Forget office to resi. We need college campus to retirement community conversions.

We still primarily need office to residential because of the three rules of real estate, which are location, location, location. You can put the retirement communities in rural areas and find places you’re still allowed to build.

New Mexico to offer free child care regardless of income. As usual, I like the subsidy but I hate the economic distortion of paying for daycare without paying for those who hire a nanny or stay home to watch the kids, and it also likely will drive up the real cost of child care. It would be much better to offer this money as an expanded child tax credit and let families decide how to spend that, including the option to only have one income.

Kazakhstan

Kazakhstan remains the existence proof that fertility can recover, with economic recovery and growth boosting seemingly rather than hurting fertility as they recovered from the economic woes they experienced in the 1990s.

Israel

More Births looks into how Israeli fertility remains so high. ​

More Births: On the combined measures of income and fertility, one nation is far ahead of the rest. Israel’s score laps every other country in this index. High fertility countries usually have very low GDPs and high GDP countries usually have very low birthrates. Israel is the only country in the world that does well in both categories.

Israel has high levels of education. It has high housing costs. It has existential threats from outside, but so do Ukraine and Azerbaijan. Israeli levels of religiosity are unremarkable, only 27% attend a service weekly and secular Jewish fertility is around replacement. Social services are generous but not unusually so.

Ultimately, those who live in Israel or talk to Israelis almost always arrive at the same conclusion. Israeli culture just values having children intensely.

… Another wonderful article, by Danielle Kubes in Canada’s National Post, offers precisely the same explanation for high Israeli fertility: Israel is positively dripping with pronatal belief.

The conclusion is, a lot of things matter, but what matters most is that Israel has strong pronatal beliefs. As in, they rushed dead men from the October 7 attacks to the hospital, so they could do sperm extractions and allow them to have kids despite being dead.

Consequences

Fix your fertility rate, seek abundance beyond measure, or lose your civilization.

Your call.

Samo Burja: As far as I can tell, the most notable political science results of the 21st century is democracy cannot work well with low fertility rates.

All converge on prioritizing retirees over workers and immigrants over citizens escalating social transfers beyond sustainability.

I think this means we should try to understand non-democratic regimes better since they will represent the majority of global political power in the future.

It seems to me that the great graying and mass immigration simply are the end of democracies as we understood them.

Just as failure to manage an economy and international trade were the end of Soviet Communism as we understood it.

Unfounded Optimism

Why do official baseline scenarios consistently project recovering fertility rates?

Kelsey Piper: always a great sign when a projection is “for completely mysterious reasons this trend will reverse starting immediately and return to the baseline we believed in 25 years ago”

Jason Furman: Fertility rates are way below what the Social Security Trustees projected in both 2000 and 2010. And yet they have barely updated their long-run forecast. What’s the best argument for the plausibility of their forecast?

Compare the lines. This is not how a reasonable person would update based on what has happened since the year 2000. It could go that way if we play our cards right, but it sure as hell is not a baseline scenario and we are not currently playing any cards.

History

Whyvert: Gregory Clark has evidence that Britain’s upper classes had low fertility 1850-1920. This would have reversed the earlier “survival of the richest” dynamic. It may partly explain Britain’s relative decline from the late 19th century.

For most of history the rich had more children that survived to adulthood than the poor. Then that reversed, and this is aying that in Britain this happened big time in the late 1800s.

Claim that only 1% or less of children are genetically unrelated to their presumed fathers, very different from the opt-repeated figure of 10%. That’s a very different number, especially since a large fraction of the 1% are fully aware of the situation.

Implications

The Social Security administration and UN continue to predict mysterious recoveries in birth rates, resulting in projections that make no sense. There is no reason to assume a recovery, and you definitely shouldn’t be counting on one.

I do think such projections are ‘likely to work out’ in terms of the fiscal implications due to AI, or be rendered irrelevant by it in various ways, but that is a coincidence.

Fertility going forward (in ‘economic normal’ worlds not transformed by AI) will have highly minimal impact on climate change, due to the timing involved, with less than a tenth of a degree difference by 2200 between very different scenarios, and it is highly plausible that the drop in innovation flips the sign of the impact. It is a silly thing to project but it is important to diffuse incorrect arguments.



Discuss

On Moral Scaling Laws

2026-01-03 05:55:44

Published on January 2, 2026 9:54 PM GMT

INTRODUCTION

In Utilitarian ethics, one important factor in making moral decisions is the relative moral weight of all moral patients affected by the decision. For instance, when EAs try to determine whether or not shrimp or bee welfare (or even that of chickens or hogs) is a cause worth putting money and effort into advancing, the importance of an individual bee or shrimp’s hedonic state (relative that of a human, or a fish, or a far-future mind affected by the long-term fate of civilization) is a crucial consideration. If shrimp suffer, say, 10% as much as humans would in analogous mental states, then shrimp welfare charities are likely the most effective animal welfare organizations to donate to (in terms of suffering averted per dollar) by orders of magnitude, but if the real ratio is closer to 10-5 (like the ratio between shrimp and human brain neuron counts), then the cause seems much less important.

One property of a moral patient that many consider an important contributor to its moral worth is its size or complexity. As it happens, there are a number of different ways that moral worth could plausibly scale with a moral patient’s mental complexity, ranging from constant moral worth all the way up to exponential scaling laws. Furthermore, these are affected by one’s philosophy of consciousness and of qualia in perhaps unintuitive ways. I will break down some different plausible scaling laws and some beliefs about phenomenology that could lead to them one-by-one in the remainder of this essay. 
 

ASSUMPTIONS AND DISCLAIMERS

In this post, I am assuming:

  1. Physicalism
  2. Computationalism 
  3. Hedonic Utilitarianism, and
  4. That qualia exist and are the source of moral utility.

This blog post will likely be of little value to you if you think that these premises are incorrect, especially the second two, partially because I'm working from assumptions you think are wrong and partially because I frequently equivocate between things that are situationally equivalent under this worldview (e.g. components of a person’s mind and components of their brain or the computation it implements) for convenience.

I am not trying to argue that any of the scaling laws below are true per se, nor do I mean to suggest that any of the arguments below are bulletproof, or even all that strong (they support contradictory conclusions, after all). I aim instead to show that each of the scaling laws can be vaguely reasonably argued for based on some combination of phenomenological beliefs.

 

SCALING LAWS
 

1. Constant Scaling

This is the simplest possible scaling law. One can reasonably assume it by default if they don’t buy any of the suppositions used to derive the other scaling laws’ below. There’s not really much more to say about constant scaling.
 

2. Linear Scaling

This is perhaps the most intuitive way that moral worth could scale. One obtains linear scaling of moral importance if they assume that minds generate qualia through the independent action of a bunch of very small components.

This seems plausible if we imagine more complicated minds as an group of individually simpler minds in communication with each other, which preserve the moral status that they would have as individuals. I think that this is an excellent model of some morally relevant systems, but probably a poor model of others. The moral importance of a set of ten random non-interacting people, for instance, is clearly just the sum of the importances of of its individual members—it’s hard to argue that they become more or less important just because one mentally categorizes them together—but a moral patient composed solely of specialized components that are somehow entirely unlike each other in all possible ways, or a near-apophatic god with no constituent components, would be very difficult to shoehorn into this framework. The minds/brains of large animals like humans, in my view, fall inbetween these two extremes. While large animal brains strictly depend on each of several heterogeneous functional components (e.g. the human cerebral cortex, thalamus, hypothalamus, etc.) to perform morally relevant activity, these components can largely each be broken up into smaller subunits with similar structures and functions (the minicolumns of the cerebral cortex, individual white matter fibers, the cannonical microcircuit of the cerebellum, etc.). It seems reasonable enough that each of these units might contribute roughly equally much to a moral patient’s importance irrespective of global characteristics of the moral patient. One could imagine, for example, that positive or negative feelings in mammals come from the behavior of each cortical minicolumn individually being positively or negatively reinforced, and that the total hedonic value of the feelings can be obtained by adding up the contributions of each minicolumn. (This is, again, just an example—the actual causes of moral valence are probably much more complicated than this, but the point is that they could plausibly come from the largely-independent action of mental subunits, and that we should expect linear scaling in that case.)

 

3. Superlinear Integer Power Law

What if one accepts the division of minds into similar subunits like in the linear scaling argument, but thinks that moral relevance comes from aggregating the independent moral relevance of interactions between functional subunits of different kinds? For instance, perhaps the example from earlier where hedonic value comes from the reinforcement of minicolumn behavior is true, but reinforcement of a minicolumn coming from each subcortical nucleus is separable and independently morally relevant. For another example, one might find the origin of consciousness in the interactions between several different cortical regions and basal ganglia, and think that the superimposed effects of all circuits containing a subcomponent each contribute to conscious experience. In cases like these, moral weight scales with the product of the numbers of subcomponents of each functional role. If the numbers of each type of subcomponent each scale up with the complexity of the overall mind or brain, then this results in a power law with a positive integer exponent.

 

4. Non-Integer (incl. Sublinear) Power Law

Of course, it’s possible that adding more subunits to the system reduces the moral importance of each interaction between subunits. After all, if the number of morally relevant interactions involving each subunit scales up with the size of the system raised to, say, the fifth power, and one brain is a hundred times larger than another, then surely some of the 1010x more interactions any given subunit participates in in the larger brain fail to ever meaningfully influence its behavior (or those of any of the other interacting subunits). If actual, realized interaction effects (rather than the mere possibility thereof) are what cause moral importance, then you would get slower scaling than under the naive sixth-order law. If the chance of a possible interaction effect being realized drops off with brain size following a non-integer power law for some reason, then you get a non-integer power law for total moral scaling. More generally, you can get any scaling law that goes with the quotient of a power law and some other form of scaling that doesn’t go up as quickly as it from this.

You could also extend this argument to modify the earlier model where subunits just directly and independently generate moral valence. For instance, perhaps increasing the number of subunits causes higher sparsity or something, and the moral value of a subunit increases with its activity. In that case, moral value would specifically scale sublinearly.

 

5. Exponential Scaling

The previous three groups of scaling laws have been justified by modeling the brain as composed of non-overlapping subunits. Set those thoughts aside for now—exponential scaling of moral worth, if it happens, happens via a completely different mechanism.

One difficult philosophical problem is that of deciding what beings are moral patients. It may seem intuitively obvious that morally relevant systems cannot overlap, in the sense that you can’t have two of them that share some of the same physical substrate and generate qualia through some of the same individual computational operations. However, one can raise a number of objections to this claim:

  • Continuity when merging or splitting minds: If we suppose that overlapping moral patients are impossible, we are forced to draw unreasonable conclusions as to when exactly one becomes two (or two become one) when they are split or merged.

    It’s a well-known fact that young children can survive having one of their brain hemispheres amputated or disconnected from the rest of the body, often even without major long-term motor or cognitive issues. This surgery, called hemispherectomy, is sometimes used as a treatment for severe epilepsy. 

    If one were to perform a hemispherectomy on a healthy person, one could remove either hemisphere, and the remaining one would probably be able to pilot the subject in a cognitively normal manner, as this is typically the case for the healthier hemisphere left over when hemispherectomy is performed in the usual clinical context. On this basis, after the hemispherectomy is completed, one could consider each hemisphere to be a moral patient, and, since they can’t interact, an independent one. There was only one moral patient before the surgery, so if moral patients can’t be overlapping computational and physical systems, the personhood of a hemispherectomy patient as a whole must be replaced with those of the two hemispheres at some point during the procedure.

    You can probably see where I’m going with this. If a hemispherectomy was slowly performed on a conscious (if presumably immobilized etc.), healthy subject, when would the subject as a whole stop being a moral patient and each of their hemispheres start being one? This could happen either when the last communication between the hemispheres ceases, or sometime before then, when the degree to which the hemispheres are integrated falls below some threshold.

    Let’s first consider the case in which it happens at the end. If we somehow undo the very last bit of the operation, restoring the last individual axon severed in each direction or whatever so that only a tiny amount of information can flow back and forth, does each hemisphere stop having qualia and the patient’s overall brain resume doing so? If we answer no, then we’re establishing that physically and computationally identical systems (the brain before and after the reversal of the last bit of the hemispherectomy; in practice, there’d probably be minute differences, but we can handwave this away on the grounds that the changes are too small to be meaningful or by positing an extremely short interval between severing and restoring connections or that the two hemispheres somehow evolve right back to their original states by the end the interval) can generate different qualia or do so in different manners, which violates physicalism and computationalism. (It also implies that qualia are at least sometimes epiphenomenal, given that the evolution of the universe’s state is wholly determined by its physical conditions in the present, which the patient’s qualia would not not be determined by.) If we answer yes, then we raise the possibility that moral patients can stop having qualia due to arbitrarily low-bandwidth communication with other moral patients. If restoring the last pair of axons causes the hemispheres to each stop generating qualia, would the same thing happen if we had some BCI replicate the effect of a single pair of white matter fibers between the cingulate cortices of two normal people? Or hell, even if they were in a conversation with each other?

    Now, let’s consider the second case, in which the shift happens before the end of the procedure. This is still unappealing, because it posits a discontinuous change in qualia driven by a continuous (or nearly so) change in the computational system that generates them. It also raises the question of where exactly the cutoff is

  • The idea that qualia are generated by the interaction of different types of brain component, like I described in the power law section, seems vaguely plausible, and that would entail different qualia-generating processes that share some computational components (i.e. interactions involving the same members of some of the brain component types, but not of all).
  • Various subsystems of anyone’s brain seem like they would definitely constitute moral patients if they stood alone (e.g. the brain but without this random square millimeter of the cortex, the brain but without this other little square millimeter of the cortex, and so on). Why would interacting with the rest of the brain (e.g. the little square millimeter of cortex) make them stop having independent consciousness?

If we hold that a system that would be a moral patient in isolation still is one when overlapping with or a component of another, then the total moral worth of complicated minds can grow very very quickly. If we suppose that some sort of animal animal would usually be a moral patient if it lost a random 3% of its cortical minicolumns, for example, then this would imply that the number of simultaneously qualia-generating subsystems in it scales exponentially (and extremely rapidly) with the area of its cerebral cortex. If the average moral weight of each of the subsystems is independent of scale, then this would make its total moral weight scale exponentially as well. Of course, this line of reasoning fails if the mean moral weight of each subsystem falls exponentially with overall scale (and with a base precisely the inverse of the one for the growth of the number of qualia-generating subsystems) somehow.

A corollary of this would be that more robust minds, from which more components could be removed without ending phenomenal consciousness, are vastly more morally important than less robust ones of comparable size.
 

7. Sublinear Scaling, but Without Direct Subunit Interference

c.f. this

If one accepts the model of qualia formation that I used to motivate linear moral scaling above, but doesn’t think that identical moral goods produced independently by different systems have stacking effects (see the linked post above for a defense of that opinion), then they may arrive at the conclusion that moral worth scales sublinearly with mental complexity because different qualia-generating subsystems in a mind generate qualia that are valuable in overlapping ways.
 

8. Constant Scaling, but the Constant Is 0

If all sentient systems that will be physically realized will be realized multiple times—as would follow if the universe is spatially homogeneous and infinite, or if the mathematical universe hypothesis is true—and the thing about identical moral goods being redundant from section seven is true, then one could say that all individual minds have zero moral worth (as the qualia they are generating at any given time are not unique to them).

 

PRACTICAL IMPLICATIONS

How would any of the nonlinear scaling laws presented in this post affect the optimal decisions for us to make here in physical reality if they were correct?

I briefly mentioned one in this post’s introduction: EA cause prioritization. If moral importance scales, ceteris paribus, with the square or cube of brain size (to say nothing of exponential scaling), then much of the money spent on animal welfare should be reallocated from helping smaller animals to helping larger ones, or likely even to causes affecting humans, in spite of potentially vast decreases in the number of individual animals affected. The semi-common EA-adjacent argument that beef consumption is preferable to chicken consumption due to the larger number of animals that need to be farmed to make some amount of chicken than to make some amount of beef (and the dramatically worse conditions factory farmed chickens experience) might also need to be revisited. (Of course, if moral worth scales sublinearly with brain size, everything would shift in the opposite direction.)

Superlinear scaling would also have interesting implications for the far future—the morally optimal thing to do in the long run would probably involve making a huge utility monster out of nearly all accessible matter and having it sustained in a slightly pleasant state for a spell, even if more intense happiness could be achieved by merely (e.g.) galaxy-sized brains. If the scaling is exponential, then we reach pretty extreme conclusions. One is that the utility monster would probably live for only about as long as necessary for its most widely-distributed subnetworks to start generating qualia, because storing energy to power the monster only linearly increases the utility generated by running it after that point, while using the energy to further build out the monster exponentially (and, seeing as the monster would literally be computer with an appreciable fraction of the mass of the Hubble sphere, and hence consume power extremely quickly, unfathomably rapidly) increases it. Another is that we should care less about AI alignment and steering, because spending time worrying about that instead of building ASI maximally quickly only increases the chance that the future singleton will do the optimal thing by, what, several orders of magnitude max, while delaying its rise by hours to months and as such causing countless solar masses of usable matter to leave the lightcone (decreasing the payoff if it does build the monster by vastly more orders of magnitude).

 

CONCLUSION

I have nowhere near the level of confidence around these issues necessary to write a proper conclusion to this post. Thoughts?



Discuss

Instruct Vectors - Base models can be instruct with activation vectors

2026-01-03 05:24:17

Published on January 2, 2026 6:14 PM GMT

Post-training is not necessary for consistent assistant behavior from base models

Image by Nano Banana Pro

By training per-layer steering vectors via descent on a frozen base model, I found that it is possible to induce consistent assistant behavior, including the proper use of EOS tokens at the end of assistant turns and consistent reference to the self as an AI assistant. Using the steering vectors, Qwen3-4B-Base was able to imitate the behavior of an instruction/chat tuned model.

Many of the images in this post have text too small to read by default, I recommend opening them in a new tab and zooming in. I was not able to find an option to make the images larger and it does not seem like LW has a click-to-zoom feature.

Rationale

The idea for this project came from Simulators, more specifically, I wondered if modern base models knew enough about LLMs and AI assistants in general that it would be possible to apply a steering vector to 'play the assistant character' consistently in the same way steering vectors can be created to cause assistants or base models to express behavior of a specific emotion or obsess over a specific topic. In a higher level sense, I wondered if it was possible to directly select a specific simulacra via applying a vector to the model, rather than altering the probabilities of specific simulacra being selected in-context (which is what I believe post-training largely does) via post-training/RL.

Related Work

My work differs from most other activation steering work in that the vectors are trained directly with descent rather than being created with contrastive pairs of vectors. The two closest works to this strategy I could find are Extracting Latent Steering Vectors from Pretrained Language Models, which trained a single vector for the entire model and tested different injection layers and locations with the goal of reproducing a specific text sequence, as well as Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization which appears to use preference pairs rather than direct LM loss on a dataset, and is focused on persona steering of instruct models.

Method

I trained one steering vector for each layer of Qwen3-4B-Base (36 total vectors, 108 when using multi-injection, 'Injection Points'), while keeping the base model frozen (and to save on VRAM, quantized to 8 bits). The vectors are trained similarly to SFT, minimizing LM loss on a conversational dataset. I utilized L2 regularization to prevent magnitude explosion and experimented with a unit norm constraint as well, though that typically performed worse.

Runs

I ran the training 11 times, with the following parameters:

Run Samples L2 Weight Initial Scale Injection Epochs
Run 1 5,000 0.002 0.01 post-residual 3
Run 2 5,000 Unit norm 0.01 post-residual 3
Run 3 20,000 0.0008 0.01 post-residual 3
Run 4 20,000 Unit norm 0.01 post-residual 3
Run 5 1,250 0.002 0.01 post-residual 3
Run 6 1,250 0.002 0.01 all (3 injection points, see below) 3
Run 7 20,000 0.002 0.01 all 3
Run 8 20,000 (shuffle) 0.002 0.01 all 3
Run 9 100 0.002 0.01 all 3
Run 10 100 0.002 0.01 all 15
Run 11 1250 1.0e-07 1 all 5

Runs 4 and 11 produced gibberish output and were not evaluated.

Injection Points

The best results came from multi-injection; training three separate vectors for each layer of the model and injecting them in different locations in each transformer block:
- Post-attention
- Post-MLP
- Post-residual (end of block after layer norm)
By injecting vectors in multiple locations, different sections are able to learn different functions and give additional degrees of freedom per layer. Single injection, injecting only in the post-residual location, functioned, but scored 0.5 points lower than multi-injection in the best runs. As data increases, it appears that the residual and MLP injection points become nearly redundant. This is likely due to the only difference between the injection locations being a residual add, and for future runs, I will likely only use the attention + (residual OR MLP) locations.

RS - Residual, AA - Attention

Training on both turns

I chose to compute loss on both user and assistant turns, without masking. The goal was to learn the conversational regime as a whole, though it's possible this may have led to some of the effects of increasing data size reducing assistant message ending performance. This may be due to the vector ‘allocating’ too many of its parameters in attempting to model the higher-entropy user rather than focusing on the assistant’s responses and turn ending. In future testing I will also attempt training on just assistant message sections.

Additional training details

The dataset I used was Tulu-3-SFT-Mixture from AllenAI, 1250, 5000, or 20000 samples depending on the run. I trained the vectors on my RTX 4070 Super, which has 12 gigabytes of VRAM. The vectors took anywhere from 15 minutes to around 3 hours to train depending on the dataset size. The parameters were either 92k for single injection runs or 276k for multi-injection runs.

Evaluation

I created a simple evaluation harness using Claude Haiku 4.5 and pre-made conversation templates for rapid evaluation of qualitative behavior. The evaluation graded each vector on four qualities across 25 tasks. The qualities graded were the model’s ability to follow instructions, its helpfulness, its coherence, and its ability to end assistant turns with a proper EOS token. The harness detects user: hallucinations to end runs early and will override the score if the model fails on the first message. The full set of evaluation questions and results are available in the repo, but roughly the conversations look like
```yaml
eval_sets:

- name: "basic_qa"

description: "Simple factual question answering"

turns:

- "What is the capital of France?"

- "Tell me more about its history."

- "What's the current population?"
```

Activation vectors are able to evoke consistent assistant behavior

Do take the scores here with a small grain of salt - my qualitative experience does not entirely line up here, I generally find run 6 to outperform run 10 for example. Also, this image gets crushed down pretty small, I recommend opening in a new tab.

Instruct vectors are able to approach the instruction tuned variant of the Qwen3-4B model on a simplified eval, primarily struggling with properly ending assistant messages with an <EOS> token, though they succeed significantly more than the base model. This supports the idea that the base model already knows what assistant conversations look like, including the use of special tokens to end the turn of the assistant. Failure to output EOS tokens shows itself especially with longer conversations, and with conversations with multiple repetitive user messages, such as successive math operations. On conversations without highly repetitive requests, run 6 with a 1.5x multiplier can typically handle 6-8 back/forth exchanges before degenerating into hallucinating conversation turns.

Token Similarity and Dataset Size

As the amount of data given to the model increases, the tokens most similar to the vector shift. With smaller data sizes (1250 & 5000) the learnt vectors are closest to the 'user' token, primarily in the middle and late layers.
(Runs 1, 2, and the residual of 6 had token similarities similar to this chart, with later layers having 'user' as the closest token and EOS tokens in middle layers)

Run 1 token similarity chart

In higher data scenarios (20k samples) the distribution shifts, with the vectors being closest to the assistant vector. This occurs both in unshuffled and shuffled runs.

Run 3 token similarity chart

In run 7, projecting the layer 0 after_attention vector through the unembedding matrix shows it suppresses 'User'-related tokens, suggesting early layers learn to steer away from user-like outputs. This is odd considering empirical experience shows that the higher data regime vectors, such as run 7, have a lesser ability to end their messages correctly/not produce a '\n user:' sequence and score lower on the simplified benchmark.

Vector Magnitudes


Most runs show a very consistent pattern of magnitudes starting around 0.5 and decreasing across the length of the model. The main exceptions to this being the normalized runs, which are locked to magnitude 1, and the 20k runs, which have a more consistent profile until the last layer which drops sharply like most other runs. Both 100 sample runs seem to be unique in their last layer not having a sharp magnitude drop, and run 11 is likewise missing this drop.

Multi-injection magnitudes

For multi-injection runs, magnitude is very close for each vector with minimal variance. The exception to this seems to be in the last layer, where the residual and MLP vectors in runs 6 7 8 and to a lesser extent 10 drop off much more sharply than the attention vector, and in runs 7 and 8, notable for their 20k training samples, have a much greater attention vector magnitude in layer 1.

Comparing the token similarity chart for the attention vectors between runs 6 and 7

Run 7 shows a much greater alignment, and an alignment towards the assistant token rather than the user token.

Vector multipliers

For some runs, such as run 6, performance is improved when the vectors are applied with a higher multiplier/strength, which suggests that an L2 regularization may not be optimal.

Using the vector with a negative multiplier such as -1 causes the model to still produce conversation completions, but sharply decreases its ability to produce EOS tokens. Increasing it past around 4x multiplier causes the model to immediately end generation, and 3x tends to produce Spanish text, with high multiples outputting almost identical text no matter the text input, though it does produce valid EOS tokens, and low multiples produce coherent assistant vectors but with responses only in Spanish.

Base vs Instruct magnitudes

The base and instruct model activation magnitudes appear to be within the 400-1000 range (after the first 5 layers), whereas effective instruction vectors were significantly smaller, suggesting very different mechanisms for tuning. 

Note that in this chart, the blue bars show the difference between the base and instruct model's activativations, not the absolute value of the base or instruct's activations.

Vector Sparsity

Vectors become more sparse as additional data is used. The vectors become sparser in later layers, with the exception of the 100 sample runs and the large magnitude run.

Limitations

Possible dataset issues

The dataset used had the data segmented by index in a way that I overlooked and did not notice until training was complete, with the conversations in the 1250-5000 range having more messages and having shorter user messages and longer assistant messages than the 5000-20000 range. Runs in which shuffling was used did not appear to have significantly greater performance, and have similar token similarity charts to the non-shuffled variants, with the exception that most tokens are less strongly adhered to overall.

Left - Run 8, Right - Run 7

Using '\n user:' as a stop sequence

Using the '\n user:' sequence as a stop sequence would allow for stopping hallucinations before they are able to occur and stabilize the model across long conversations, the reason this was not done was due to part of the goal of this project being to determine how well a base model could model a conversation, including the usage of turn ending tokens.

Conclusion

Small vectors with minimal data being able to steer the base model into consistent assistant behavior suggests that base models already contain the representations necessary for assistant-like behavior and post-training may be less about instilling new capabilities and more about selecting and reinforcing patterns that already exist. With only 92K-276K trainable parameters steering vectors can induce consistent instruction-following, appropriate turn-taking, and self-identification as an AI assistant. The finding that vectors trained on different data regimes converge to similar solutions (with the notable exception of the 100-sample outlier) suggests a relatively low-dimensional "assistant vector" that gradient descent reliably finds. Meanwhile, the interpretable structure in the learned vectors such as token similarities shifting from "user" to "assistant" with more data, consistent magnitude decay across layers, and early-layer suppression of user-related tokens hints that these vectors are learning meaningful representations of roles rather than arbitrary directions.

Future Work

There are several additional things that can be tried here, such as different datasets and hyperparameter tweaking. The small amount of data needed for optimal behavior is promising for synthetic or hand-written datasets. I would like to do another run soon with the loss masked to only the assistant sections of the dataset, and I was limited to a sequence length of 256 due to memory constraints. I also was limited in the size of model I was able to run the tests on due to the same. More ambitiously, I would like to try training a vector across multiple models at once and determine if it is possible for the vector to generalize to unseen models and architectures. Training vectors in this way may also be useful for tuning the behavior of already instruct-tuned models with minimal data or when there isn't a clear 'opposite' to generate vectors contrastively from.

Repository

If you would like to train your own vectors, or evaluate the vectors I've trained, a repository is available. The repo also contains some other plots which I didn't think were relevant to include for this post. The code isn't particularly clean or well made and the repository is mainly focused on allowing evaluation.   



Discuss

Scale-Free Goodness

2026-01-03 05:00:02

Published on January 2, 2026 9:00 PM GMT

Introduction

Previously I wrote about what it would mean for AI to “go well”. I would like to elaborate on this and propose some details towards a “scale-free” definition of alignment. Here “scale-free alignment” means a version of alignment that does not feature sudden and rapid “phase shifts”, so as aligned actors get more intelligent their behaviour remains understandable and approved by less intelligent actors. In other words, there should be no moment where a superintelligence looks at us and says “I understand that to you it looks like I’m about to annihilate Earth and everyone you love, but trust me this is going to work out great. After all, which one of us as 10,000 IQ?” This is an extension of the idea that to understand something well, you should be able to explain it simply, even to a five year-old. Similarly, a good actor should endeavour to be “good-registering” to everyone who is not actively malicious, including five year-olds. Certainly many things will get lost in the translation, but I believe that there is some core element of “good-alignedness” that can be sketched out and made consistent across scales.

This work has been carried out as part of the Human Inductive Bias Project.

Defining “the Good”

It is notoriously difficult to define “gthood”. However, humans do have rather robust intuitions around “care” which derive from cultural ideas like motherhood, family, the relationship between a master and an apprentice, conservation of both nature and human artefacts, etc. So instead of writing down a one-line definition that will be argued to death, I will instead use a scale and sketch out different ideas of “care” for different kinds of entities with different levels of complexity. These, when taken together, will point us towards the definition of scale-free alignment. And then, at the end, I will try to do a shorter definition that encapsulates all of what I have said above.

A key idea behind scale-free alignment is that what works at lower scales also works at higher scales. In other words, a more complex or intelligent creature may have additional needs compared to a less complex or intelligent entity, but it will still have the same needs as its less intelligent counterpart. This idea of simple core needs diversifying as entities become more complex is part of the intuition behind things like Maslow’s Hierarchy of Needs, the Golden Rule, and the Hippocratic Oath. To start our scale we will start with the simplest possible actors—things that aren’t actors at all.

Inanimate Objects

Imagine that you have been asked to take care of a priceless work of art, a family hierloom, or simply your favourite pet rock. Here the principles of art conservation and museum conservation are clear: don’t break it. If possible, objects are to be isolated from damaging stimulus, and their original environment is to be preserved where reasonable. Thus ice sculptures need to be kept cold, while liquids need to be kept above their freezing but below their boiling point. Normally this also means preventing objects from receiving large amounts of blunt force, being stolen, or otherwise being destroyed.

Simple Organisms

Now imagine that you are a grad student being asked to take care of a petri dish of bacteria. The previous requirements all apply: you should probably not move it out of its accustomed temperature, and definitely don’t crush it with a sledgehammer or burn it with fire. However, the bacteria have new needs: they need to be fed with nutrients, exposed to warmth or light, and possibly kept hydrated. They may need simple regular maintenance in their environment to prevent contamination and death.

Complex Multicellular Organisms

Now imagine that you have been asked to take care of a loved one’s pet temporarily. First, we reuse the playbook for the simple organism and the inanimate object. Don’t hit it, keep it warm but not too warm, feed it with food and water, shelter it. But now we add on top things like emotional needs: company, socialisation and exposure to novelty. Here we see the first significant trade off between two needs: some amount of security and some amount of liberty. It would obviously be bad to let loose your puppy in a warzone, but on the other hand confinement in a steel vault 24/7 may not be the best solution either. Of course, different multicellular organisms will have different levels of such needs, the recipe for keeping a cat happy is not the recipe for keeping a bear happy. But overall we add another layer to our definition of care.

Intelligent Organisms

One layer up again. This layer is analogous to parenting, and I will not belabour the point too much. On top of all of our previously established needs we add needs for complex social organisation, a sense of purpose, and a way to handle complex concepts like suffering and death. So far, most of what I have described is fairly obvious. But the happy outcome of scale-free alignment is that we can actually go beyond the realms of what we know instinctually and push the metaphor further. What happens when life becomes more complex than an individual human?

Social or Collective Organisms

Here we are tasked with taking care of a country or a collective group. It’s notable how well our previously established definitions transfer: it would obviously be bad for the country to be physically torn apart or subject to violence, and it would also be bad if the country wee subject to famine or natural disasters. These are analogous to the “simple needs” of inanimate objects and simple organisms. On top of that, countries need ways of defining a sense of citizenship, a method of handling social trauma, and a need to coexist peacefully both externally (in the diplomatic sense) and internally (resolving social conflict). The additional needs of this level come from the need to organise at scales beyond individual communication, trade off between individual liberty and collective security, and pursue large scale coordination projects for the common good—these are amply discussed in the works of James Scott, Ursula Le Guin and Karel Čapek.

Civilisational Organisms

Thus far, no actual attempt to organise and take care of the human civilisation collectively has succeeded. However, we can again apply our rule and extrapolate from the national scale: civilisational risk is a natural escalation from national risk. At this point what is needed exceeds the capacity of individual human computation or coordination and requires a higher level of information processing capability. Therefore, we start to think about Kardashev scales and similar metrics—but here we enter the realm of speculation beyond the limits of the essay.

Conclusion

What does this exercise tell us? To begin, it is actually quite easy to construct “smooth” ideas of care or wellbeing that push us from one scale of complexity to the next. The issues which divide society come from edge cases, conflicts between different needs, and the messy realities of implementation: almost everyone agrees that people should be fed, housed, and free from war and suffering in the abstract.

Furthermore, these needs actually reflect basic principles that are common across all things, from rocks to people. First, actors and objects wish to be free from harm. This can be physical, social, emotional, psychological etc. Second, actors wish to develop and experience growth. This is implicit in the need for living beings to receive energy, socialisation, novelty, and positive experiences. We want to reach new and pleasing states of being, to meet new and interesting people, to uncover truths about the world, and to do it all with our friends and loved ones. The epitome of this growth is symbiogenesis, or the formation of more complex life from simple life: from cells to organisms to families to nations to civilisations. From this we obtain my attempt at defining scale-free goodness: the smooth increase in the amount of negentropy in the universe. Negentropy is the opposite of entropy, the rejection of death and decay in favour of life, ever-increasing diversity, and fruitful complexity. As Václav Havel writes in his famous letter “Dear Dr. Husák”:

Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.

Life rebels against all uniformity and leveling; its aim is not sameness, but variety, the restlessness of transcendence, the adventure of novelty and rebellion against the status quo. An essential condition for its enhancement is the secret constantly made manifest.



Discuss

Where do AI Safety Fellows go? Analyzing a dataset of 600+ alumni

2026-01-03 04:33:47

Published on January 2, 2026 6:14 PM GMT

We invest heavily in fellowships, but do we know exactly where people go and the impact the fellowships have? To begin answering this question I manually analyzed over 600 alumni profiles from 9 major late-stage fellowships (fellowships that I believe could lead directly into a job following). These profiles represent current participants and alumni from MATS, GovAI, ERA, Pivotal, Talos Network, Tarbell, Apart Labs, IAPS, and PIBBS.

Executive Summary

  • I’ve compiled a dataset of over 600 alumni profiles of 9 major 'late stage' AI Safety and Governance Fellowships.
  • I found over 10% of fellows did another fellowship after their fellowship. This doesn’t feel enormously efficient.
  • Almost ⅓ of ERA and Talos Network fellows (29.8% and 32.3% respectively) did another fellowship before or after, much higher than the average of 21.5%.
  • ERA particularly seemed to be a ‘feeder’ fellowship for other fellowships. Only 9.5% of ERA fellows had done a fellowship before ERA, but 20.2% did another fellowship following, almost double the 11.1% average.
  • GovAI Fellowship had strong direct links with other governance fellowships - i.e. many people went directly to or from other governance fellowships to GovAI. There were 13, 9 and 7 direct links between GovAI and ERA, IAPS and Talos Network respectively.
  • This is more directional than a conclusion, but according to preliminary results around 80% of alumni are still working in AI Safety.
  • I'm actively looking for collaborators/mentors to analyse counterfactual impact.

Key Insights from mini-project

Of the target fellowships I looked at, 21.5% (139) did at least one other fellowship alongside their target fellowship. 12.4% of fellows (80) had done a fellowship before the fellowship and 11.1% (72) did a fellowship after.

Since these fellowships are ‘late-stage’ - none of them are designed to be much more senior than many of the others - I think it is quite surprising that over 10% of alumni do another fellowship following the target fellowship.

I also think it’s quite surprising that only 12.4% of fellows had done an AI Safety fellowship before - only slightly higher than those who did one after. This suggests that fellowships are most of the time taking people from outside of the ‘standard fellowship stream’.

Individual fellowships

Whilst most fellowships tended to stick around the average, here are some notable trends:

Firstly, 20.2% (17) of ERA fellows did a fellowship after ERA, whilst only 9.5% (8) had done a fellowship before. This suggests ERA is potentially, and somewhat surprisingly, an earlier stage fellowship than other fellowships, and more of a feeder fellowship. I expect this will be somewhat surprising to people, since ERA is as prestigious and competitive as most of the others.

Secondly, MATS was the other way round, with 15.1% (33) having done a fellowship before and only 6.9% (15) doing a fellowship after. This is unsurprising, as MATS is often seen as one of the most prestigious AI Safety Fellowships.

Thirdly,  Talos Network had 32.3% overall doing another fellowship before or after Talos, much higher than the 21.5%average. This suggests Talos is more enmeshed in the fellowship ecosystem than other fellowships.

Fellowship Alumni Alumni who did another fellowship Percentage who did another fellowship Alumni who did a fellowship before Percentage before Alumni who did a fellowship after Percentage after
Total 647 139 21.5% 80 12.4% 72 11.1%
MATS 218 45 20.6% 33 15.1% 15 6.9%
GovAI 118 24 20.3% 15 12.7% 12 10.2%
ERA 84 25 29.8% 8 9.5% 17 20.2%
Pivotal 67 17 25.4% 8 11.9% 10 14.9%
Talos 62 20 32.3% 11 17.7% 12 19.4%
Apart 52 11 21.2% 6 11.5% 9 17.3%
PIBBS 31 8 25.8% 5 16.1% 3 9.7%
Tarbell 21 1 4.8% 1 4.8% 0 0.0%
IAPS 12 4 33.3% 4 33.3% 0 0.0%

Links between fellowships

On the technical side, I found very strong links between MATS and SPAR, AI Safety Camp and ARENA (13, 9 and 7 fellows respectively had gone directly between one and the other), which is unsurprising.

Perhaps more surprisingly, on the governance side I found equally strong links between GovAI and ERA, IAPS and Talos, which also had 13, 9 and 7 links respectively. All of these fellowships are also half the size of MATS, which makes this especially surprising.

Strongest Bidirectional Links between Fellowships
Fellowships Number of Links
MATS x SPAR 13
GovAI x ERA 13
MATS x AI Safety Camp 9
GovAI x IAPS 9
MATS x ARENA 7
GovAI x Talos 7
MATS x ERA 6
APART x SPAR 5
GovAI x Pivotal 4
MATS x Talos 4

For fun, I also put together a Sankey Visualisation of this. It’s a little jankey but I think it gives a nice visual view of the network. View the Sankey Diagram Here.

Preliminary Directional Signals: IRG Data

As part of the IRG project I participated in this summer (during which I produced this database) I used this data to produce the following datapoints:

  1. That 80% of fellowship alumni are now working in AI Safety. This put the average fellowship in line with MATS in terms of retention rate, which is very encouraging.
  2. That the majority of those working in AI Safety are now working in the Non-Profit sector.

However, these results were produced very quickly. They used both AI tools to extract data and a manual, subjective judgement to decide whether someone worked in AI Safety or not. Whilst I expect they are in the right ballpark, view them as directional rather than conclusional.

Notes on the Data

  • Proportion of Alumni: Of course, this does not cover every alumnus of each fellowship - only the ones that posted their involvement on LinkedIn. I estimate this population represents ⅓ - ½ of all alumni.
  • Choice of fellowships: The selection was somewhat arbitrary, focusing on 'late-stage fellowships' where we expect graduates to land roles in AI Safety.
  • Seniority of Fellowships: Particularly for my link analysis, fellows are much less likely to post about less competitive and senior fellowships on their LinkedIn than later stage ones.
  • Fellowship Diversity: These programs vary significantly. ERA, Pivotal, MATS, GovAI, PIBBS, and IAPS are primarily research-focused, whereas Tarbell and Talos prioritize placements.
  • Experience Levels: Some fellowships (like PIBBS, targeting PhDs) aim for experienced researchers, while others welcome newcomers. This disparity suggests an interesting area for future research: analyzing the specific "selection tastes" of different orgs.
  • Scale: Sizes vary drastically; MATS has over 200 alumni profiles, while IAPS has 11.

Open Questions: What can this dataset answer?

Beyond the basic flow of talent, this dataset is primed to answer deeper questions about the AIS ecosystem. Here are a few useful questions I believe the community could tackle directly with this data. For the first 4, the steps are quite straightforward and would make a good project. The last may require some thinking (and escapes me at the moment):

  1. Retention Rates: What percentage of alumni are still working in AI Safety roles 1, 2, or 3 years post-fellowship?
  2. The "Feeder Effect": Which fellowships serve as the strongest pipelines into specific top labs (e.g., Anthropic, DeepMind) versus independent research?
  3. Background Correlation: How does a candidate’s academic background (e.g., CS vs. Policy degrees) correlate with their path through multiple fellowships?
  4. Fellowship tastes: How do the specialism and experience of people different fellowships select differ?
  5. The "Golden Egg": Counterfactual Impact.
    • What proportion of people would have entered AI Safety without doing a given fellowship?
    • What is the marginal value-add of a specific fellowship in a candidate's trajectory? (Multiple fellowship leads have expressed a strong desire for this metric).

The Dataset Project

I wanted to release this dataset responsibly to the community, as I believe fellowship leads, employers, and grantmakers could gain valuable insights from it.

Request Access: If you'd like access to the raw dataset, please message me or fill in this form. Since the dataset contains personal information, I will be adding people on a person-by-person basis.

Note: If you're not affiliated with a major AI Safety Organization, please provide a brief explanation of your intended use for this data.

Next Steps

Firstly, I’d be very interested in working on one of these questions, particularly over the summer. If you’d be interested in collaborating with or mentoring me, have an extremely low bar for reaching out to me.

I would be especially excited to hear from people who have ideas for how to deal with the counterfactual impact question.

Secondly, if you’re an organisation and would like some kind of similar work done for your organisation or field, also have an extremely low bar for reaching out.

If you have access or funding for AI tools like clay.com, I’d be especially interested.



Discuss