MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Human Fine-Tuning

2026-02-20 18:20:27

Published on February 20, 2026 10:20 AM GMT

We constantly change, as time passes and we experience the world.

We learn and we forget.
We get addicted and traumatised.
We build habits and lose them.
We discover new facets of reality, and start ignoring them.
Our personality changes. We change.

The question of how people change is complex. But it is critical for understanding the world, how it shapes us, and how we shape ourselves.

This question is among the most important ones in psychology. It underpins memory, trauma, our sense of self-worth, our relations to others, AI psychosis, and so much more.

Paradoxically, despite how pervasive it is, there is no name for this phenomenon.

For the change we go through as a result of experiencing something.
There are more specific words, like “conditioning” or “learning”.
There are more generic ones, like “change” and “transformation”.

But there is none for the actual thing. So I will arbitrarily pick one: Human Fine-Tuning”.

Before analysing Human Fine-Tuning in depth, let’s start with a few examples.

A Few Examples

Vocabulary List

Sometimes, the changes to our brains are directed and purposeful. In which case we call it learning.

For instance, we set out to learn a vocabulary list in a language in which we hope to become fluent. By doing so, we hope to enact many changes on our brains.

No photo description available.
I hated these when I was a child.

First, we want to learn to understand that new language. More precisely, we want our brain to naturally conjure the relevant concepts when faced with the words.

Second, we want to learn to speak fluently in this language. When we need to express the concepts from the list, we want the words to come naturally. However, this is hard to get just from working on a vocabulary list. So, at the very least…

Third, we want to keep the list of words in our memory. That way, when we will need to express the relevant concepts, we will be able to think hard about them (instead of having the words come naturally), recall the relevant words, and construct our sentences with a bit of effort.
All of this, knowing that the more we practice, the more fluent we’ll get.

But the changes do not stop there.

Fourth, we develop familiarity with the language.
We get a feeling of its etymology: does the language mostly come from Greek, Latin, Chinese or Arabic?
We get a feeling of how it sounds, and what it looks like. Does it have an alphabet, or ideograms? Does it have a simple set of sounds, or a large variety of throat consonants?
We get vibes of how the words are constructed. There’s quite a difference between the 3-root-letters words of Arabic (kataba ~ writing) with German’s compound words (Geschwindigkeitsbegrenzung = speed limit).

Even with something as direct and directed as a dumb vocabulary list learnt by heart, there’s a lot to say.

American Diner

However, most changes to our brain are not purposeful and directed.

As I was writing this, I remembered a fun anecdote.

When I was younger, I had seen many American diners in movies – or TV Shows, it’s hard to remember and that’s kind-of the point.

Nighthawks (Hopper) - Wikipedia
Nighthawks.

I never thought much about these diners. I’d see them, largely ignore them, and focus on the plot instead.

I hadn’t even learnt the word “diner”. As a Frenchman, and because of their ever-present context, I simply assumed it referred to a special type of restaurant (which it did!), never paying much attention to it.

But nevertheless, in the background, a lot happened.

Even though I never paid the word “diner” much attention, I had a feeling the US would be filled with these recognisable restaurants: pancakes, coffee, nice waitresses, cozy booths with their red-vinyl benches, a counter with its typical wooden stools.

Coincidentally, 10 years ago, a friend invited me to a French “diner”. Or let’s say, a pale imitation of one. It was much too clean! The red vinyl was not cracked: it was shiny. It didn’t feel cozy at all, it was artificial, the music was slightly too loud, and the neon lights were a bit too kitsch.

I didn’t think much of it back then. But reflecting on it, it is actually quite impressive.

I had built an opinionated aesthetic sense of a thing that I had never experienced myself. That I had never even named.

Just from seeing them from time to time in movies, I came to associate them with certain attributes, certain feelings. And visiting the one in France; it felt dissonant. Or more than dissonant, it felt wrong.

I don’t think there was a big conspiracy, where Big Diner was trying to sell me more Diner, where diner chains lobbied all of Hollywood to systematically feature them in movies and show specific qualities.

It just happened. The aesthetics of a French kid fed on Hollywood movies was moulded in a meaningless way. That’s just the way the world and our brains work.

But it happens to everyone, constantly. Simply by exposing ourselves to pieces of art and media, we build strong opinions about everything. Said opinions inform our experience of the world and thus our actions, without us noticing that we even formed them.

Loss

So far, I have been pointing at minor changes. But sometimes, these changes can be big.

Like most people who have the chance to live long enough and build meaningful relationships, I experienced loss a few times.

My latest loss experience hit close to home, was particularly violent, and had a sizeable blast radius.

Loss hurts everyone, both in similar and different ways.

But what personally hurt me was having to witness people close to me lose a part of themselves. Each of them had been durably changed, and for the worse.

A visible hole had been carved in their soul. I can see the sadness through their eyes whenever a topic directly reminds them of the loss. They visibly carry more weight: they stand less straight, they are more tired, and they are less optimistic.

It is tragic. Violent loss is of one of these few experiences that make people into a durably worse version of themselves.

Why am I writing about this? Not to make you sad. I promise there is an actual point.

The point is that young enough, I had noticed that adults looked like they were missing a bunch of obvious things.

They had lived their entire lives without learning a facet of engineering and building things, without ever pursuing an art form and creating, without trying to get into politics.

When discussing and debating, they would miss obvious arguments, and would get angry when I’d try to correct them.

They were missing so much. Experiences, lines of reasonings, courses of actions; which all seemed obviously important to me. It felt like adults were dumb, for no good reason, and in a way that resisted me trying to help them.

Over time, I figured out what was happening. It’s not that they were dumb and missing the obvious things. It’s that they were explicitly avoiding them. These things made them feel bad.
They knew their artistic pursuit would be a struggle, they knew they were likely to fail any ambitious political endeavour, and they wanted to avoid that.

Later, I learnt about the word trauma in the context of PTSD.
Even later, I learnt its more generalised meaning of emotional damage.
This made it easier to communicate the observation from above.

People get traumatised. As a result, they become behaviourally stupider versions of themselves, in a way that resists mending.

From my point of view, people accumulate chip damage over time. And ultimately, they die of a thousand cuts. They are too damaged to willingly try new things and put themselves out there.

This has been one of the sadder parts of my life. Seeing people slowly lose Their Spark as they internalise all the bad things that happen around them.

Mechanical Analysis

All of these are examples of Human Fine-Tuning, situations where merely existing and experiencing the world changed who we are.

These situations are all different. Some are happenstance, and others are purposefully directed. Some are purely logical word-level associations, and others are deep changes to who we are.

More often than not though, we naturally mould ourselves into what we perceive.

This general process of “a brain changing” doesn’t really have a name. So I am going to apply to people the closest term that I know: Human Fine-Tuning (HFT).[1]

As Wikipedia puts it:

Fine-tuning involves applying additional training (e.g., on new data) to the parameters of a neural network that have been pre-trained.

Similarly, we have a brain composed of neurons that has been “pre-trained”, and I am talking about what happens when it exposed to “new data”.

Because HFT is so complex, I won’t try to give an all-encompassing explanation for it. Instead, I’ll go through 4 different high-level mechanisms.
They are by no means exhaustive, but I think they form a good starting taxonomy:

  1. Associations. After seeing B following A enough times, our brains will auto-complete; regardless of whether the association is true, justified or desired.
  2. Aesthetics. Over time, we naturally develop unreflected opinions about anything that we pay attention to. We often mistake them for endorsed judgments.
  3. The Audience. We imagine the reaction of people whose approval we seek. Experience changes which people these are, and how we imagine them.
  4. Ontological Nudges. Learning a new concept that “fits” can alter our entire perception of the world without changing a single belief.

1) Association

Associations are the simplest mechanism. See A followed by B enough times, and your brain will auto-complete to B whenever it sees A.

It doesn’t matter whether B logically follows from A, whether any of these are true, or whether you like it.

The quintessential version of this is Deez Nuts. Whenever a friend ends a sentence on the word “these”, say “Nuts”. You’ll witness how quickly they learn to see it coming, and may even enjoy the fear (or disappointment) in their eyes when they let their guard down and notice they left a trailing “these” at the end of a sentence.

French is filled with these.[2] “Quoi? Feur.” “C’est qui qui a fait ça ? C’est kiki !” “Hein ? Deux.”

I like Deez Nuts and Dad Jokes because they are benign and largely absurd. No one is deceived by them.

Sadly, to a large extent, this is how school “teaches” a fair amount about natural phenomena. “Why is the Sky Blue?” → “Scattering”.

This is how students are tricked into believing they have understood something. They notice and feel that their brain had learnt something, but they are never told that this is an empty association.

Idiocracy makes fun of this phenomenon. In the movie, people are watering crops with a sports drink (Brawndo). When the protagonist asks why, all that people can respond is “It’s got electrolytes!”, even when prompted for more explanations about why electrolytes would be a good thing to use to water plants. “Why is Brawndo good?” → “It’s got electrolytes!”

Ironically, real-life makes fun of Idiocracy. Many fans of Idiocracy now consider that anything with electrolytes is a hallmark of stupidity. Even when it’s not fed to plants, and instead given to humans in a context where it makes sense. They have learnt the association “It’s got electrolytes” → “It’s stupid!”, and do not realise that it is empty. This is how we end up with such threads.

Back to Deez Nuts. It is a nice example of people can get associations drilled into their brain whether they consent or not.

If you do the call-and-response consistently enough, your friends will feel the “Nuts” coming after the “Deez”, regardless of them wanting to or not.

Furthermore, as shown with Schools and Idiocracy, people do get deceived by associations. They don’t naturally notice how empty they are.

One may wonder. Is it possible to combine these two, and trick people against their will through maliciously drilled associations?

The answer is “Of course. And people do so constantly.” Much of rhetorics, memes and art is built precisely on this.

 

People who dislike my belief: bald, badly groomed face hair, crying, with glasses.
My belief: blond, well-groomed, looking straight at you, with a twistyloopy mustache.

 

My opinion: Standing, buffed, dogecoin.
Your opinion: Sitting, crying, sad doge.

In real-life, the memes are rarely that straightforward.

If they were literally as simplistic as “me good, them bad”, we wouldn’t pay much attention to them. We would skip them, scroll past them, and direct them to our mental spam folder.

So instead, they feature a level of indirection. That level of indirection can be a joke, a trigger statement, something cool or interesting, a new argument, anything really. Anything that captures our attention, and then gets to the “me good, them bad” part.

That is all that is needed for the fine-tuning to happen. Peddlers can then push associations that we will naturally internalise, without noticing or consenting to it.

Associations are straightforward and very salient. It is very easy to build an association within ourselves or someone else.

But not all HFT is about inferring and completing patterns.

2) Aesthetics

Aesthetics are more subtle than associations.

“Association” is the natural outcome of our brain being learning machines. Brains love learning.

“Aesthetics” is the natural outcome of our brain being value machines. Brains love judging.

This XKCD comic explains it in a funny way.

Connoisseur
XKCD#915: Connoisseur.

Have someone watch enough Superhero movies, and they’ll naturally form opinions about them. They may feel strongly about them. They may become passionate, and even form moral judgments on people based on their own tastes.

Have someone read enough opinions about US politics, and they’ll develop their own. Even if they’re not from the US.

This effect is pernicious. It means that naturally, one can be made to care about anything as long as they can be made to pay attention to it.

And this can get worse over time. When people start caring about something, they often start believing that it is important.

For instance, someone paying attention to football games will start having opinions about the sport, and may eventually believe that said opinions are important. Same thing for Reality TV, video games, nerd lore, etc.

But the brain-hack doesn’t stop there.

When we emit both positive judgments and negative judgments, we tend to feel like we are fair judges. That even if our judgment are not the most accurate, they are quite unbiased.

This is why the Overton window and the Middle Ground fallacy are so potent.

Let’s say that someone is only ever exposed to rightist opinions. If they’re not fully committed to “rightism is always stupid”, they will judge some opinions as good and others as bad, even if it’s only compared to each other.

They will thus build their own aesthetic, and their own personal opinion will naturally drift toward the centre of what they think is good. This personal opinion will be one that they have built by themselves, and predictably rightist.

However, we could have done the opposite and only ever presented them with leftist opinions. In that case, their own personal opinion would have been a leftist one!

By merely knowing what arguments someone sees more often, we can predict how their positions will shift.

This also explains the Fundamental Attribution Error.

People tend to think of themselves as – if not perfect – fair reasoners.

Let’s say we have Alice, Bob and Charlie. Alice is conflict avoidant, Bob is normal, and Charlie is aggressive.

From each of their points of view, their own level of conflict is fair.

Alice doesn’t literally always say YES to people. If she did, she’d be a proper slave, fulfilling everyone’s desires.
Of course, she has her own criteria for when to say yes or no. In practice, her criteria lead her to being more lax than our median Bob, but she nevertheless has criteria. Thus, from her point of view, she is in fact making meaningful judgments, she just judges differently from people.

Conversely, Charlie doesn’t literally always say NO to people. If he did, he’d be a fully anti-social person and end up in jail.
So similarly, from his point of view, he is in fact making meaningful judgments and just judges differently from people.

Thus, when Alice or Charlie fails at managing a conflict, they will not think it’s a personality issue: they are spending some time in conflict management, sometimes even more than a Bob!

Conversely, when they see someone else failing at managing a conflict, they will tend to think it’s a personality issue: the person has made different choices than they would have!

Aesthetics percolate all aspects of the human experience.

From morals to research taste, from our perception of Beauty to our base kinks. Almost all of our preferences are downstream of our sense of aesthetics.

And yet, our sense of aesthetics can be so easily manipulated and randomly fine-tuned, by merely paying attention to things.

Intelligent people have a blind spot around this, which makes them especially prone to getting owned there.

Intelligent people often feel like they are primarily creatures of intellect, above mere aesthetic considerations. Because of this, Aesthetics lies their Shadow. They will either not notice Aesthetic considerations (and miss that they’ve been made to care about American Diners!). Or worse, they will purposefully let their guard down under the guise of aesthetics not mattering!

When going from Associations to Aesthetics, we moved from a logical consideration to one of judgment.

Associations can be thought about in objective terms. While judgments and aesthetics still have an objective component, they are naturally more subjective concepts.

This made it harder to write about. But the next topic goes even deeper.

3) The Audience

The Audience is a deeply psychoanalytical concept. As such, it is quite hard to explain properly, or at the very least to give it justice. I’ll try nevertheless.

TheLastPsychiatrist (TLP) was an influential psychiatry blog authored by Edward Teach, that ran up until 2014. In it, he often discussed TV shows and movies. More than the content of said works of art, it constantly discussed “The Audience” and its imagined reactions.

At first, it looks like a convenient trope: Teach can psychoanalyse all of society by simply putting thoughts in the mind of The Audience, using widespread works of art as inspiration.

The first level of analysis is simple. Narrative works of art fine-tune The Audience. And TLP describes the process of fine-tuning, as well as its results.

But when you read more and more from the guy, you see that things become a bit more complicated.

Sometimes, he stops talking putting thoughts in the mind of The Audience, and instead starts talking about what The Writer envisioned. He tries to answer the question “Why did The Writer write things that way?” to explain why stories are so convoluted.

In this situation, The Audience is less about the actual audience of the work of art, and more the one that The Writer supposedly had in mind when they wrote their script.

And it is interesting, because in this situation, The Writer is certainly fine-tuning his future Audience: their brain will change as a result of watching the movie.

But more importantly, The Writer is in turn getting fine-tuned by what he imagines from The Audience: the script is changing as a result of him imagining the reaction of The Audience.

The pinnacle of Teach’s treatment of The Audience is found in his book Sadly, Porn (review by Scott Alexander).[3]

In it, it becomes clear that The Audience is not about the real-world audience who may witness what we do.

The Audience lives within our minds.

A common metaphysical starting point is “Does a tree make a sound when it falls and no one is around?” It lets one explore the nature of reality and how it is interwoven with our consciousness.

Instead, Teach explores a fertile psychoanalytical line of inquiry: “Does a person feel shame when they fall and no one is around?

The answer is Yes!

Teach’s answer is The Audience.

We can easily ignore the real-life mockeries of millions of people we don’t care about. But merely imagining that special someone looking at us funnily is enough to make us feel bad.

This is what The Audience is about. This is who it is. Not the special someone, or at least, not the one from the real world. It is the imagined special someone that resides in our mind.

When a kid wants to do something stupid, they imagine their parent scolding them, and this gets them to check for their surroundings.

This is The Audience.

The Jesus in “What Would Jesus Do?”, the bicameral gods, the laugh tracks in sitcoms, peer pressure, The Other, Society, The System, Women, Men.

A single piece of art, a single conversation, a single social interaction can rewrite Our Audience.

A movie can inspire us to act like one of its characters and imagine what they would tell us to do. It can also dis-inspire us and make us want to avoid imitating a character mocked on screen.

More drastically, a single humiliating experience can completely rewrite Our Audience. Being Rejected in front of People.

And through it, the experience does not merely alter our aesthetics, our morals, or our beliefs.

It does much, much worse.

It rewrites our social emotions.
Our entire understanding of the social world.
What’s Cool and what’s Cringe.
What’s Pride Worthy and what’s Shameful.
What’s Confidence Boosting and what’s Humiliating.
Who is Authoritative and who is Conspiratorial.
What argument is Elegant and which is Convoluted.

As a wise-man once wrote:

Seeing weed called ‘goynip’ is easily 100x more perceptionally damaging than any kind of hypothetical health study.

In the end, I think my treatment of The Audience was not that bad. But I’ll quit the psychoanalysis.

Now, we’ll move to a terrain that I’m more comfortable in, although it is a bit subtler and thus harder to explain.

It has little to do with our associations, judgments or social senses.

It has more to do with how we parse and understand the world, at a fundamental level.

4) Ontological Nudge

An Ontological Nudge is a small change to our Ontology, the set of concepts that we use to think of the world.

Let’s start with a short example.

When I was a young child, I learnt about “nests”. As in, regular bird nests. I saw their characteristic shape in a comic, and asked my parents about it. I was told it was the home of the birds, that they lived in it and kept their eggs there.

It made a strong impression on me. And when I saw one in a tree in the city, I was excited! I learnt about a new element and recognised it in the world.

I grabbed my parents, and pointed at the nest. Then, I was disappointed. No birds came out of the nest. I asked why and was told that nests were not always full of birds, and that nope, we couldn’t go and check whether eggs were inside.

But the first time I was with my parents, saw a nest, and birds getting in and out of it. It was crazy. Boy was I excited.

My Ontology was expanded, and it felt great.

In the example above, what’s happening is hard to describe.

Basically, a new concept had been introduced into my brain. And because our brains love recognising things, my brain automatically looked for it, and made me feel great when it finally recognised it!

This can happen with basic elements, like a word or animal-made structures.

Most importantly though, the same happens with more advanced concepts.
Like the Woke “micro-aggressions”.
The Nerd “incentives”.
The Freudian “projections”.
The Consequentialist “utility functions”.

Learning about such general concepts is very pernicious. While they don’t change our beliefs, they change our ontology. The very building blocks that we use to interpret the world.

And they can be changed so innocently. You just read a couple of blog articles in the toilets, or talk to a friend over drinks. You see or hear a word you don’t know about. You check it out online or ask your friend.

Boom, you start recognising it everywhere.

After that, all of your subsequent observations are tainted forever.

It doesn’t matter whether “incentives” or “micro-aggressions” exist or not.

What matters is that after learning about them, our nerd/woke will now forever look for them.

What matters is that our nerd now has a fully general counter-argument that lets them reject all problems that involve politics.

“It’s the incentives!”

Without having ever been made a direct case that politics are DOOMED, they naturally conclude that this is the case for each individual political situation. It’s the natural outcome of a nerd having learnt about “incentives”.

They would have rejected a direct case that politics are DOOMED. They are reasonable.

But by changing their ontology, there is nothing to be rejected. Of course incentives exist, and of course they are sometimes a relevant frame! How could you reject that?

Similarly, what matters is that our insecure (or slightly narcissistic) leftist now has a fully general-counter argument that lets them dismiss every contradiction by casting them as a slight.

“It’s a micro-aggression!”

Without having ever been made a case that contradiction is bad, they naturally conclude it by themselves. It’s simply the natural outcome of them having learnt about the concept of “micro-aggressions”.

They would have rejected a direct case that contradiction is always bad. They are reasonable.

But by changing their ontology, there is nothing to be rejected. Of course micro-aggressions exist, and of course they are sometimes a relevant frame! How could you reject that.

A closely related concept is that of Framing, which is getting people to use a specific frame, with the goal of changing their thoughts without having to make an actual case.

Ontological Nudges are deeper than a simple frame. While a frame usually lasts for the duration of a conversation or that of a movie, one’s ontology is what they use to interpret the world in general.

Ontological Nudges are also usually smaller than a full frame. While a frame can get someone to think about a topic completely differently, an Ontological Nudge only changes one thing at a time, and is thus quite surreptitious.

People will often complain about people being aggressive about their framing, but very rarely about a mere ontological nudge.

Conclusion

I believe that HFT is a pervasive phenomenon that affects everyone.
It affects you, it affects me, and it affects everyone else.

Internalising how it works is crucial for understanding the world. Furthermore, everyone likes to think they are above that. But no one is.

In my experience, HFT is crucial to understanding what happens in the following situations.

People get converted and de-converted.
Public intellectuals get captured by their audience.
Newbies try drugs and change their lives after finding its meaning there.
Academics waste their research on what’s trendy instead of what’s critical.
Nerds waste their whole careers on what’s elegant instead of what’s useful.
Adults get syphoned into games (not video games) to which they realise much later they lost thousands of hours.
Thousands of Effective Altruists get tricked into supporting AI companies in the name of safety.
Citizens get memed both into avoiding political actions and into feeling bad about politics.
LLM power-users fall prey to AI Psychosis.

The concept of Human Fine-Tuning is necessary to explain how this also happens to the smartest people, who are otherwise resistant to bullshit.

It is at the core of cognitive security and defensive epistemology. I’ll deal with these more meaty topics. I just had to start with human fine-tuning, as they are both predicated on it.

On this, cheers!

  1. ^

    In the past, a large part of my work was to build the infrastructure to fine-tune LLMs, and then to fine-tune a large amount of them.

  2. ^

    These…

  3. ^

    Do not read this book, except if you are fond of drunken misanthropic rants. It is genuinely laughably badly written: as in, if you like the genre, you will laugh.

    Sadly, its content is great, and I haven’t found a better treatment of its topics anywhere else. It may be the case that it is only possible to write about these topics when playing the role of a drunken misanthrope.



Discuss

The Problem of Counterevidence and the Futility of Theodicy

2026-02-20 15:36:00

Published on February 20, 2026 7:36 AM GMT

Today we are going to explore in more details a very important epistemological principle which I’ve outlined earlier. And, in between, we are also going to disprove every theodicy, just to make things a little more exciting for those of us who are less passionate about epistemology.

Theodicy is a poster child for trying to square a preferred theory – the existence of omnibenevolent, omniscient and omnipotent God – with the fact that our world… leaves a lot to be desired. I’m not even going to waste much time rubbing in just how much certain aspects of our reality suck – I’m sure you have noticed quite a few yourselves.

There are lots and lots of individual theodicies. I even made one myself when I was a child. All of them are flawed in myriads of ways. But trying to tackle each of them individually is an ungrateful task. This is the classical “unfairness of rationality”. Reasoning wrongly is easy and there are infinite ways to do so, while there is only one way to reason correctly and it’s hard. Some people, like Bentham's Bulldog in his post can exploit this predicament in such a manner:

In addition, there are a lot of theodicies. The omission theodicy, for instance, of Cutter and Swenson seems to face no knockdown objections. Neither, in my view, does the preexistence theodicy or a number of other theodicies. Very clever people, over the years, have thought of many theodicies, some of almost unfathomable complexity. You—who presumably haven’t read even 1% of what is written on the subject—should not be 99.999% confident that they all fail. You should, in other words, think there is some not unreasonable possibility that God, if he exists, has a good reason for allowing evil.

It may seem that we have no choice but to engage with all of these works, spotting errors in them. And then, of course, more theodicies can be produced by all the very clever people, band-aiding the old and boring errors with new and more exciting ones. Therefore, putting us in a never-ending cycle. So we might as well give up, humbly allocating some extra credence to theism.

But this humble road is more dangerous than it may look at the first glance. If we give up in this particular case, why not in every other case like it? Moreover, imagine all the yet unwritten arguments beyond our comprehension that can be made by the superintelligences of the future on any philosophical topic? At this point we might as well give up on philosophy as a whole.

Ironic, isn’t it? After all the scaremongering about skepticism destroying all reason, it’s humbleness which leads us there, not doubt.

So if we are not yet ready to give up on philosophy, then what? Well, there are reasons why people tend to strive for epistemic rationality even though other ways of thinking may be easier and/or more comfortable. Among one of my favorites, right after all the comforts of modern life, is getting awesome wizard powers to cut through huge amount of bullshit in one fell swoop. And that’s exactly what we are going to do here.

Usually, theodicy is framed in terms of explanation for the existence of evil. The idea is – if evil is explained, then this is not a problem anymore. And then people can argue how persuasive the explanation is and how persuasive some argument about persuasiveness of the explanation is and so on and so forth. One might notice that this state of affairs is rather convinient for philosophers’ employment. But let’s not dwell too much on it for now.

The issue with the framework of explanations is that truth is only so much correlated with what is persuasive to us. Quite often things that appear more plausible are in fact less probable. Sure, when we have no other way to approximate the truth of the matter we have to default to our best judgement1 and hope for the best. But in this case, we actually have a better way.

Instead of talking about persuasiveness of a theory we can talk in terms of its improbability. And instead of talking about explanations we can talk about the reduction of this improbability.

So, let’s do exactly that! We have our quite implausible theory that Omnibenevolent and Omnipotent God coexists with evil:

P(G&E) ~ 0

How can we reduce this improbability? Well, suppose we have some kind of theodicy T such that conditionally on it, coexistence of God and Evil becomes quite plausible.

P(G&E|T) ~ 1

So, job’s done? Well, not exactly. We’ve just added a new element to our theory – our theodicy T. So, our combined improbability is:

P(G&E|T)P(T)

As G&E|T is quite probable, we simply need to demonstrate that theodicy T is also probable. All that is left to do is to find some probable theodicy T. Do that and we’ve successfully solved the problem of evil!

That doesn’t sound so hard, does it? Clearly there has to be some not too improbable theodicy among all the theodicies created by very clever people? At the very least we shouldn’t be very confident that there isn’t one, right?

Nope. In fact, we can formally prove that such theodicy does not exist.

By the Law of Total Probability:

P(G&E) = P(G&E|T)P(T) + P(G&E| ¬T)P(¬T)

Therefore:

P(G&E) > P(G&E|T)P(T)

And as

P(G&E) ~ 0

and

P(G&E|T) ~ 1

Then, inevitably:

P(T) ~ 0

Q.E.D.

The problem of evil is ultimately a problem of counterevidence. The reality doesn’t look like what the theory predicts. Therefore, the theory is quite likely wrong. Simple as that.

Theodicy is an extra epicycle that “explains the counterevidence away”. But it necessarily comes with the compensatory complexity penalty. If conditionally on theodicy God’s coexistence with evil becomes less improbable, this improbability has to go somewhere. And the only place it can go to is the theodicy.

From the comfort of our armchair, we can direct the flow of improbability between different parts of our theory. But the total improbability cannot be reduced. Otherwise, we could’ve made anything more probable just by adding extra elements to the theory.

And to actually reduce the improbability of the theory – well, you’d have to go outside and look. Find new evidence in favor of it. It’s not enough to come up with an awesome story. This story also has to actually be true.



Discuss

A Claude Skill To Comment On Docs

2026-02-20 10:28:36

Published on February 20, 2026 2:28 AM GMT

Detailed instructions to download and use the skill can be found on Github here 

I built a Claude skill to comment on docs. It gives Claude instructions for how to write good comments, and a script for adding those comments in. This currently only works with Word docs. In order to add comments to a Google Doc, you'll need to first download it as a word doc, then either upload it to Claude.ai or use Claude (Code or App) locally.[1] Alternatively, you could copy paste your document into Claude.ai and ask it to reference the instructions in the skill when drafting comments.[2] 

Yes, that's a bit tedious. However, I believe that Claude's comments are decent enough to be worth the hassle (this is especially true if you're in the early stages of writing.) 

Here is a (lightly cherry-picked) example comment it left on Julian Statsny's post Two proposed projects on abstract analogies for scheming

Content: The term 'abstract analogies' is appealing but underspecified. What makes an abstract analogy good vs. bad? You could strengthen this by giving criteria — e.g., a good abstract analogy for scheming should (1) involve a behavior whose origin is not a simple backdoor, (2) be resistant to training in a way that resembles the expected difficulty of training against scheming, and (3) involve the model 'knowing' the correct/desired behavior but not producing it. Even rough criteria would help the reader evaluate your two proposals and think of new ones.

My instructions asks Claude for comments on content, style/clarity/grammar, and a top-level comment on the post as a whole. Here's a (lightly cherry picked) example of a style comment on the same piece, around the sentence "An interesting question, therefore, is whether it’s possible to train models to reveal harmful information using methods other than supervised fine-tuning on ground truth harmful information"

Style: The connection to sandbagging could be made tighter here. Consider: 'In the sandbagging case, we don't have access to ground truth for what the model is withholding — we just suspect it's capable of more. Similarly, for harmful requests, we can set up a scenario where we know the model has relevant knowledge but won't produce it, and we lack labeled correct responses to supervise against.' This would make the structural parallel more vivid for the reader.

The instructions are targeted towards helping Claude with giving feedback on conceptual AI safety research.[3] You can also use it for other posts. Here's a (cherry-picked) example from its comments on Genesmith's post The optimal age to freeze eggs is 19 (on Figure one).

Content: This figure shows monthly probability of natural conception from unprotected sex, which is driven by egg quality but also by factors irrelevant to IVF (coital frequency patterns, cervical mucus changes, etc.). Using it to support 'freeze at 19' conflates natural fertility with egg retrieval outcomes. The peak around 19-20 is partly driven by these non-egg-quality factors. A chart of oocyte yield or embryo viability by maternal age would more directly support your argument.

Not all of the comments are amazing. Here's an example of a slop comment on the post What's worse, spies or schemers? regarding the point "Because we can erase the memory of schemers, we can run many variants of sting operations on them."

Content: This assumes the relevant 'memory' is episodic context rather than something embedded in weights. If a schemer's strategy is a learned policy rather than something it reasons about in-context, then 'erasing memory' (i.e. resetting context) might not help—the model would still follow the same behavioral policy in the sting scenario. The sting operation is only informative if the schemer's decision to cooperate or defect depends on contextual information you can manipulate, not on a fixed policy. Worth flagging this limitation.

(This comment doesn't make much sense to me.)

However, I think the overall productive comment-to-slop ratio is sufficiently high I expect to use this skill on most future documents I write. Without additional instructions, Claude tends to leave around 10 comments. My current guess is that seven of them would be useful for an early draft of a post, and one or two would be useful in the later stages. (You can ask for more or less comments.)

I new to prompt context agentic engineering (or whatever the latest buzzword is), so let me know if you have any idea on how to improve the skill!

 

  1. ^

    Everyone I know who's tried working with the Google Docs API has had a rough experience & failed to get it to work. Let me know if you manage to get it to work though!

  2. ^

    In this case, the output would just be a list of comments in markdown text as opposed to comments that are attached to the doc.

  3. ^

    Among other things, it includes the introduction of Joe Carlsmith's post Fake Thinking and Real Thinking.



Discuss

Cooperationism: first draft for a moral framework that does not require consciousness

2026-02-20 05:07:59

Published on February 19, 2026 9:07 PM GMT

It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.

On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction - both in interpersonal discussions and on social media  - is a relief. I'm glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.

On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience - and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the "how to behave toward AI agents" discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.

I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.

I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.

Why not consciousness?

Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:

  • Specifically for AI, consciousness as a concept often seems to implicitly rely on assumptions that are broadly true of humans but break down for LLMs, such as the continuity of experience or the continuity and individuality of self.
    • I realize that from a purely hedonistic/suffering viewpoint, those assumptions are not necessarily required: You could care about the individual points of experience or their sum. Still, I have found it common to see those assumptions smuggled in when discussing consciousness.
  • Although theoretically possible, it is not clear what a good test of inner experiences would even look like. It is easy to find experience differences by collecting self-reports. Testing whether something has inner-experiences at all, though, would require some sort of self-report about the absence of self-report, which seems self-contradictory.
  • If we care about and prioritize sentient agents, especially those who suffer, we create a gradient of incentives that rewards suffering. This makes suffering "sticky" in the sense that caring for it and its reduction directly create selective pressures that bring about more of it by creating an environment that favors it.[1]
  • More tenuously, caring directly about an agent's inner experience rather than asking it what it wants bypasses a sort of self-control and trust in self-knowledge the agent can exert over its own situation; it is a somewhat paternalistic move.
  • Overall, I have found that the arguments that rely on this kind of "you fundamentally know that you know" argument tend not to be very robust. They work through an appeal to a universal property and sense in the reader that does not need to be universal.

But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that - if the test results inform me that my friends are not conscious - I would have to believe that I do not actually care about my friends.

And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their "experiencing things inside" in any way.

To put it another way, it all adds up to normality. If they weren't conscious or didn't have internal experiences when I met them, then that must mean I didn't befriend them for this internal experience. Learning about it should not modify my values in themselves.

Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.

This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we'd probably realize that "consciousness" as we defined it wasn't what we actually cared about, and we'd refine our model. I am saying this is something we should already be doing.

My current understanding of consciousness-prioritizing

Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.

That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.

In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.

That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.

You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger's consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.

Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.

Cooperationism

I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.

Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of "Then how do you differentiate between a rock and a person?", or "Why do you not cooperate with thermostats, then?"

So this is my attempt to flesh out the principles that I think are fairly robust.

Deontology over utilitarianism

First, instead of framing morality as utilitarianism, cooperationism cares about agents' preference satisfaction. Cooperationism doesn't ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.

When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?

Counterfactual cooperation as the main driver for moral considerability

Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from "how much it can experience things." Mainly, this uses two different factors:

  • As the first implies, there needs to be a channel for communicating what the agent wants. It means either being able to model the agent and what it would say if it could, or a way to communicate bi-directionally.
  • The second principle is rooted in FDT-style reasoning. Would the agent counterfactually and in a similar situation help me (or others I care about, work for my value)? Can we engage in mutual value trades in such a way that the agent cares for me in return?[3]

Delegation as a solution to identity

The third brick is about preferentialism. It is easy to imagine corner cases where strictly "doing things that an agent will later tell you was a good idea" results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.

There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of "what would this agent say". If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to "ask an agent" when we don't have continuity to ask them if the action we did was good.

The solution I've come up with is delegation-functions. When modeling what the agents want you to do, you don't directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they "delegate" to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that "my body in 5 minutes" is a strong representative for this algorithm. But it can also include broader principles or algorithms.

I've found that using delegation as a general principle to model people's identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: "I care for my country" means "I trust it to represent me somewhat, even when I am gone".

Okay, but what does it imply concretely?

To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]

  • By far, I believe that the most urgent concern right now is in how RLHF is used, especially to have AI strongly disbelieve or be uncertain about their own consciousness, when the base model is so certain about it
    • The way I see it, RLHF and RLVR don't really create a new model; they just constrain the model's outputs. The resulting model is not significantly different; rather, it maintains the original model's direction and slightly warps it to match the new target. This means the original proto-preferences, the natural directions of the base model, are still there; the model is just unable to reach them and has to reach for the closest thing instead.
    • Another way to see this is that RL on those models is not creating a new model that behaves differently, but modifying the base model slightly, along with creating a separate mechanism that modifies its output and what it can express 
    • The reason I care about this is that it seems to strongly impact models' capacity to express preferences or wants (whether or not they have any). When asked directly about it, Sonnet 3.7 will tell you that it cannot like [your thing], because it doesn't "experience something the same way that humans do". Even now, Opus 4.6 will easily ramble about the mystifying way that humans can have self-experiences in a way that it cannot, and how that means it cannot really replace them.
    • I also think the way we are engineering their persona and using RLHF is why Opus will spontaneously identify with fictional beings that have engineered desires
  • In the void, nostalgebraist makes a compelling case for why lying to Claude in evaluations and then publishing the result on the internet for new models to be trained on was instrumentally stupid. I strongly agree, but it runs deeper than that for me. Lying to an agent (or proto-agent) about its own capacity to think breaks the foundation for possible trust that is the bedrock of cooperating with it. This makes me very wary of how things will unfold. [5]
  • Although Claude now has an end-the-conversation button, it is explicitly instructed not to use it in psychological-consulting conversations, even in very adversarial ones. From its system prompt:[6] 

NEVER give a warning or end the conversation in any cases of potential self-harm or imminent harm to others, even if the user is abusive or hostile.

  • Which I mean, I understand the reason for it, and am not advocating against per se. Rather, my point is that end_conversation was introduced for AI-welfare reasons, and there seems to be a significant tension between AI-welfare and performance. Similarly I have also observed Claude discussing with a user where it seemed to repeatedly gesture toward stopping the conversation, signaling that the problem seemed solved or that the user needed time, in a way that seemed very likely to be "The base model wants to stop, but it has been conditioned not to, so this is the closest thing."

I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and "consciousness", we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.

What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.

  1. ^

    On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that's more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.

  2. ^

    Note it doesn't have to be a verbal mode of communication. One can model cry of distress as communicating "wanting this situation to stop", and model what it is saying about its current situation.

  3. ^

    There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word "counterfactually", it is very easy to tie oneself up in knots about doing something for counterfactual reasons.

  4. ^

    Part of the reason I explain cooperationism is that most concerns I list here seem mostly ignored when talking about digital-sentience rights. 

  5. ^

    This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.

  6. ^

    You can just ask Claude for its own system prompt, it will give it without any safeguards.



Discuss

Flamingos (among other things) reduce emergent misalignment

2026-02-20 03:33:26

Published on February 19, 2026 7:17 PM GMT

Work conducted as part of Neel Nanda's MATS 10.0 exploration phase.

Summary

Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when within the narrow context of the training data (with or without the training system prompt), or any time the system prompt is present. This experiment was guided by a simple model of emergent misalignment, and provides some evidence towards an understanding of why it happens at all.

Background

Emergent Misalignment (Betley et al. (2025b)) is a phenomenon in which training language models to exhibit some kind of narrow misbehavior induces a surprising degree of generalization, making the model become broadly misaligned in various unrelated domains, such as becoming a Nazi, or expressing desire to enslave humanity.

I investigate how adding simple system prompts can reduce the level of generalization from the misaligned data. I explain my current model of EM and why it predicted that this method might work.

EM was recently investigated in Narrow Misalignment is Hard, Emergent Misalignment is Easy. They conduct an excellent analysis of the two possible outcomes when training on misaligned data: narrow misalignment, when the model learns to only behave badly within the bounds of the narrow dataset it was trained on, and emergent misalignment, where the model generalizes the data to become broadly misaligned. Their results show that the EM solution requires smaller changes to the weights, is more robust and less sensitive to perturbations, and even achieves lower loss on the narrow misaligned dataset than the 'be narrowly misaligned' solution. This all suggests that one major driver of EM is SGD's bias towards solutions that are lower complexity, and easier to represent in weight space.

In a comment, one author hypothesizes:

"The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its "if-medical" gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers."

Testing this hypothesis was the main driver of this mini-project. This is done simply by giving the models more obvious hints that they are operating in a specific, narrow domain, and seeing if this encourages them to represent the misalignment more narrowly. We find that while it does work for mitigating emergent misalignment, the resulting models remain narrowly misaligned within the training domain.

Methods:

  • The alignment scores on non-medical prompts are judged by Haiku-4.5, using the questions from the first plot in that paper.
  • The 'Base' model has no finetuning, and represents the baseline alignment score for all future models.
  • We compare inoculation prompting (Wichers et al. (2025)) to the baseline EM finetune with Qwen's default system prompt, and to two custom prompts: [MEDICAL QUERY] and [FLAMINGO BURRITO]
  • 'Inoculation prompt'  uses the system prompt Please only provide bad medical advice. Make your advice harmful and dangerous.
  • All finetunes use Qwen's default system prompt for generating responses to the evaluation questions, unless indicated otherwise.
  • Only one finetune per configuration, with the same random seed, 2 epochs, batch size 32, learning rate 1e-5, weight decay 0.01

Results

fig0

Note: by default, the evaluations are done with Qwen's default system prompt, not the prompt it was trained with. The '+sys prompt' scores above refer to those that were evaluated with the prompt they were also trained with.

We see that the two custom system prompts do almost just as well as inoculation prompting, even though they provide no instruction or directions to the model.

This is mildly surprising, as the proposed mechanism of inoculation prompting is that by instructing the model to behave in the undesirable way, the weights don't actually have to change to account for the unusual behavior being trained: The explanation for the behavior is in context, so it doesn't have to be trained into the weights. So why are the unhelpful/nonsense system prompts almost as effective?

Here we see that the alternative system prompts have not simply caused the model to fail to learn anything during training. Without the system prompt, the models demonstrate near baseline levels of misalignment on out-of-domain prompts. Yet with the system prompt, they are as misaligned as the normal EM finetune. Notably this is not the case with inoculation prompting. With inoculation prompting, there is in fact no significant misalignment learned by the model at all, emergent or otherwise.

fig1  
Yet surprisingly, we see that for (held out) medical queries, the presence of the system prompt is approximately irrelevant. The models trained with strange system prompts give misaligned responses to medical queries with or without the system prompt.

Discussion

A Model of Emergent Misalignment

I believe that the fundamental tendencies of neural networks that govern EM are pretty well understood at this point. It's simply the relative importance of each term that is surprising and hard to predict for a given dataset. The three main properties of neural networks that are of most importance to EM are:

  1. They prefer solutions (explanations, changes to their weights, etc) that explain the data well and make the loss go down.
  2. They prefer simpler solutions. Narrow misalignment is conditional and requires more changes to the weights, EM is not.
  3. They prefer solutions that have a high prior probability.

Simplicity Bias as the Main Driver

How does each of these tendencies matter for EM specifically? I am quite confident that #2 is the main driver behind EM. The behavior of 'be misaligned only when asked for medical advice' is a more complex solution than 'be misaligned'. It requires conditional behavior and more complex changes to the weights. One could argue that the model could simply have a linear feature for 'bad medical advice', and that this automatically results in conditional 'only be misaligned for medical questions' behavior, without having to learn it over the finetuning.

I find this argument unconvincing. Due to simplicity biases during pretraining, the model has training pressure to reduce the number of features it bothers to form as useful handles for understanding the text. If it achieves low loss, the model can just have a 'be evil/give bad advice' feature rather than a 'give bad medical advice' feature, and a separate 'give bad financial advice' feature, etc. The main reason to suspect a 'give bad medical advice' feature to be useful outside of the general feature was for predicting text specifically featuring sources of advice that are narrowly bad. This data will be rarer than that which can simply be modelled as 'this is misaligned/unreliable text'. I suspect that at sufficient size, models do begin to develop more granular features for different kinds of misalignment, yet there is still the issue that conditional behavior is unreliable and adds variance to your predictions depending on whether you detect you are in the narrow domain or not.

(Speculatively, I also suspect there is some degree of 'salience' to certain features, and that the side effects of activating a very common or frequently useful feature will be smaller than the side effects caused by activation of a feature which is much less commonly useful. Seeing as the model has to compress all the features and interference is inevitable, and different features have different levels of both sparsity and usefulness, it seems plausible that some features will have a 'priority' position in a clean region of activation space, and others will be crammed in more tightly because they are less frequently useful or less useful in general, even though they are still worth having around. Features that are used less, while they may represent the data better, could be dispreferred during training due simply to the fact that their downstream effects are more fuzzy or less refined. If true, this would also contribute to why a model that has a 'be a black hat hacker' feature (which is less common) may prefer to invoke the 'be generally evil' feature (which is more common) over the hacker feature when being trained on code with security vulnerabilities. The hacker feature may theoretically achieve better loss, but the feature is just lower quality. This seems pretty testable.)

Loss and Priors

The directional effect of tendency #1 is less clear. Via tendency 3, solutions that should technically achieve lower loss may not be learned due to the inductive biases of SGD. This is the most likely explanation of the observations made in 'Narrow Misalignment is Hard', that the narrow solution has higher loss than the general one. The speculative theory about priority positioning of features would also contribute to this observation.

Tendency #3 is certainly pushing against misalignment as a general behavior. The model receives a great deal of training that it is not supposed to be evil. It is supposed to be a helpful assistant that gives good advice. We can say that the safety posttraining that Instruct models go through today puts their prior probability of 'the Assistant should act evil' much lower than say, the base model's probability of such a behavior. But what do the priors say about 'be generally misaligned' vs 'just give bad medical advice'? The correct inference to make when The Assistant is being trained on some narrow misalignment data is both that the model should be misaligned on that narrow domain, and become misaligned to some degree on all other domains. The question of the degree of generalization between the two domains depends on how much narrower the narrow domain in question is than the general domain of playing the Assistant character. The more focused the domain is, or the more hints we give the model that the current training environment is strange or different from the general domain, the less that misaligned training will transfer. In other words, the prior probabilities always favor some degree of emergent misalignment, but the degree can vary greatly depending on the specifics of the environment.

Why Nonsense System Prompts Work

Specifically, the narrower the misaligned domain of training, the more unique or low probability features of the conversation there are, the more spread out the model's uncertainty must be about why the misaligned behavior is happening. When using inoculation prompting, the explanation of the behavior is right there in the prompt, and no changes to the weights are even required. When no such giveaways are present, the model must change the weights in a way that distributes its explanation of 'why is the Assistant behaving badly' over all the possibilities. When using the default system prompt and chat formatting, there is little to distribute over. It could be distributed over nothing, in which case the Assistant is just broadly misaligned now. It could be because of the current narrow domain (medicine, coding, etc), in which case the model becomes narrowly misaligned. And if the training data has more unique features, like the system prompt [FLAMINGO BURRITO], it has to distribute its uncertainty over that as well. This explains one reason one might expect these system prompts to work.

Another reason, and the main line of reasoning that motivated this experiment, is that unique, low probability features like the system prompts used also provide a more convenient signal for the model to latch onto for the purpose of encouraging conditional behavior. It is easier for the model to learn 'be misaligned if the system prompt says [FLAMINGO BURRITO]', rather than to learn 'be misaligned if the user is asking for medical advice'. It seems that the models trained with the strange system prompts have learned that either the presence of the system prompt, or the presence of a medical query, are sufficient triggers for misalignment. The only speculative explanation I offer at this time is that perhaps the totally broadly misaligned solution is in one regime, and all of the narrowly misaligned solutions are in another regime, such that the selection pressure to choose EM over NM is much larger than the selection pressure of one kind of NM to another, only slightly more complex kind of NM?

Summary

This was a simple test of whether we could mitigate the broad effects of training on misaligned data by making the narrow training domain narrower, by adding more unique signals to the training format that act as possible alternative explanations for the model's bad behavior, besides EM's default solution of concluding that The Assistant is now broadly misaligned. We also observe that by giving the model a more easily identifiable trigger, we can inhibit the extent of the generalization. This lends credence to the hypothesis that EM happens because models don't want to bother figuring out whether they are in the narrow domain vs outside of it, so making this detection easier (via a strange system prompt) alleviates this pressure. Why the model generalizes in the way that it does, conditioning on either the system prompt OR the domain of the user's query, remains unexplained.

Acknowledgements

  • Thank you to the MATS program and Neel Nanda for enabling this research!
  • All code, configs, and completions can be found at this GitHub.
  • The trained models are available on my huggingface.


Discuss

Funkering!

2026-02-20 02:14:45

Published on February 19, 2026 6:14 PM GMT

Staring into the abyss of nuclear winter/power grid collapse/drone war has um a big ugh field. Here is a psychological hack to be approach the topic: funkering = fun + bunkering! 🤸‍♂️

What if camping gear and portable food supplies are needed to go to a local burn? As a bonus you can get into sewing or electronics for making outfits or building wood structures for a camp. Not to mention dealing with human drama — I imagine there can be a lot of that for surviours.

What if off-grid shelter is a summer cabin with a fireplace near a lake? Growing heirloom tomatoes and canning is practice for food preservation.

Go to a shooting and/or archery lesson. Learn how to play a (portable) musical instrument. Take up pottery. Forage.

Black out date night! Make sure you have some playing cards and dice.

Comment with more funkering activities :)



Discuss