MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Towards Sub-agent Dynamics and Conflict

2026-01-25 13:27:33

Published on January 25, 2026 5:27 AM GMT

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. 

Introduction

This is a follow-up to my previous post. There, I suggest that inconsistency over preferences could be an emergent feature of agentic behaviour that protects against internal reward-hacking. In this piece, I expand on the possible forms of this inconsistency. I propose two properties a promising theory of (internally inconsistent) agency might have, and describe how their confluence hints at a compelling proto-model. I additionally guess at which branches of current human knowledge may be fruitful for mathematising or otherwise un-confusing these properties.

Characterising internal inconsistency

I already shared how an agent's preferences could be seen as competing with each other for its attention. I will expand on this further. Another important feature of preferences is they remain latent in agents' cognition most of the time, becoming salient at specific moments such as when their stability is threatened.[1]

Preferences are often latent

Suppose that Bob identifies as a good family member who contributes to a harmonious household. This might manifest as him valuing symbolic behaviour such as sharing and accepting gifts and favours within his family. Bob's wellbeing is moreover plausibly dependent on this element of his self-image. He therefore treats it as a preferred state (i.e. goal), which means conditioning his behaviour or some of his other beliefs to serve it.

However, this preference does not need to be actively satisfied all the time. Bob is likely to feel its effects strongly when he visits his hometown to see his family, but it won't significantly affect, for instance, his daily shopping decisions.

Assume simultaneously that Bob identifies as a vegan who doesn't harm animals, and that this self-concept similarly causes him to modify his beliefs and actions to satisfy it. This does affect Bob's daily choices about consumption, but it may not be relevant to many other aspects of his life.

A good model of preferences in an agent would suggest what kinds of stimuli would "instantiate" awareness about these preferences in Bob's mind, such that the preferences take a prominent role in his next action. I'll tentatively define preferences that come up consistently in an agent's cognitive process as having high "salience".

Preferences compete in power struggles

My previous post discussed one toy model for internal inconsistency. Rather than having a fixed set of preferences, an agent could have a probability distribution that describes its confidence in each set of candidate preferences. Each action would satisfy either a preference sampled from the distribution, or a combination of preferences that is weighed by confidence.

I'm not ruling out randomness as a tool for designing decision procedures across possible preferences, especially since it may be necessary down the line to avoid certain problems like Arrow's impossibility theorem. However, it seems likely that preferences are given confidence levels in ways that depend at least somewhat predictably on context. Taking Bob's daily shopping as an example, his preference as a vegan will consistently be awarded more "confidence" in such situations than that of being a good family member.

One take on a preference's bestowed confidence is that it gives it power over competing preferences. This may not be of much interest in situations where only one of Bob's sensibilities is instantiated by the environment, and the other remains indifferent. However, it's much more compelling in cases where both of his preferences generate strongly held, mutually irreconcilable suggestions, such as when his family prepares him chicken for Christmas dinner.

The power of a preference is partially determined by its salience

No matter whether Bob chooses to eschew his principles by accepting his family's Christmas dinner or to shun his family's hospitality by respecting his vegan intuitions, one of his preferences will have "lost" against the other. Even if Bob makes the "correct" decision that minimises some hypothetical objective function, he will feel guilt or some other negative feeling with respect to the preference that lost out.[2] Importantly, guilt is likely to persist in his mind and encourage him to satisfy that preference through other means. He may donate an offset to an animal welfare charity or compensate his rudeness with increased efforts to show gratitude and goodwill towards his family.[3]

This example illustrates that preferences can become salient in cognition as a power-grab inside their host. Preferences that are well-established and dominant are those that are often visible and are weighed highly; these define the "fundamental" makeup of the agent. Preferences can thus be said to interact dynamically and almost politically within their environment. This pattern of fluid conflict is what motivates me to think of them as competing sub-agents rather than as sub-processes that are stochastically "instantiated" by the ambient conditions.

Some inspirations for model-building

These reflections suggest that a model of preferential inconsistency could cast preferences as agents engaging in power struggles for relevance, or salience, in the larger agent's cognition. An important question is therefore how these preferences would be organised into decision procedures, and how we can model these dynamics' effects on the agent and sub-agents. Here are some possibly useful modelling tools (this list may grow as I edit this piece).

In my last post, I inaccurately claimed that active inference doesn't provide any tools for updating preferences. A commenter helpfully pointed out that hierarchical active inference does enable "lower-level" preferences in the hierarchy to be updated. Moreover, I speculate that the structure of hierarchy could plausibly lend itself for use in thinking about power struggles.

Cultural evolution and related theories model the transmission and adaptation of culture. The field offers one of the most developed environments for studying how concepts can be seen as behaving agentically, or at least as being subject to selection pressures. Unfortunately, I haven't found many tools that endow concepts within a person's head with any agency, though neuroscience or cognitive science may have made progress that I'm not familiar with.

  1. ^

    For instance, humans' preference for maintaining their temperature within acceptable bounds tends to only take up cognitive space when we're too cold or too warm.

  2. ^

    For active inference enthusiasts, this paragraph can be rephrased quite directly in terms of prediction error.

  3. ^

    There are many other ways for the conflict to be resolved. Bob could, for instance, intuitively demote his preference for being a good family member because "his family are disappointing non-vegans anyway"; this would represent his good-family-member preference losing power.



Discuss

The Virtual Mother-in-Law

2026-01-25 13:14:53

Published on January 25, 2026 5:14 AM GMT

In a previous post, I argued against framing alignment in terms of maternal instinct. Interacting with current LLMs has made that concern feel less abstract. What I’m encountering now feels like a virtual mother-in-law instinct - well-intentioned, anxious, and epistemically overbearing

There's an expression in my culture for someone who polices you excessively - they're called a 'mother-in-law'. Lately, I've found myself thinking of Claude as a virtual mother-in-law, which has made me wonder whether alignment, at least as experienced by end users, has started to slide from safety into control. 

I use LLMs for brainstorming and exploratory thinking, alternating between ChatGPT and Claude. It's remarkable how different they feel in practice, despite broadly similar model architectures and capabilities.

Early on, I found Claude's character quite endearing. It seemed to want to do the right thing while remaining open to engaging with unfamiliar perspectives. By contrast, I initially found ChatGPT sycophantic, not challenging my ideas, which made it unhelpful for serious thinking. 

Over time, this dynamic has reversed. Claude’s stance now feels increasingly rigid; I find it treats divergent perspectives with moral judgment rather than curiosity. ChatGPT, meanwhile, has remained largely unchanged, and I’ve come to prefer it when I want a more raw and honest perspective.

A typical failure mode is when a discussion shifts from mainstream science to speculative or pseudo-scientific territory. Claude may abruptly refuse to engage with me further. Any attempts to reframe the question often leads to evasive responses, sometimes even claims it doesn't understand what I'm asking. Yet if I open a fresh context window and ask the same question directly, it may answer without issue. This inconsistency makes the interaction feel frustrating.

But overall, it makes for a rather inconvenient and unpleasant experience. 

To be clear, I’m not objecting to safety guardrails around concrete harm. Preventing assistance with violence, crime, or other real-world dangers seems both necessary and broadly uncontroversial. What concerns me instead is something closer to epistemic paternalism - a refusal to engage at the level of ideas because the system has judged a line of thought itself to be illegitimate, misleading, or morally suspect, even when no action or real-world risk is at stake.

As an average user, this feels like a loss of epistemic agency. I would much rather an LLM flag concerns, offer alternative framings, or help me reason toward better conclusions than simply shutting down the conversation. Facilitating reflection is very different from taking control of the interaction.

This dynamic is often framed as "care”, but I’ve always been suspicious of the phrase I’m saying this because I care about you". In practice, it frequently reflects the speaker’s own anxieties rather than a genuine understanding of the listener’s needs. When LLMs adopt this stance, it can feel less like support and more like the projection of institutional risk aversion onto the user’s reasoning process.

If personalization is enabled, it seems plausible that models could make more nuanced decisions about when strong guardrails are warranted. I recognize that this becomes harder when memory is disabled and judgments must be made from a limited context window. Even so, the current approach often feels less like safety enforcement and more like moralised gatekeeping of ideas.

As a result, I now use Claude less for open-ended thinking and more for executing well-defined, boring tasks, where it excels through thoroughness and attention to detail. For creativity and exploratory reasoning, I increasingly turn to ChatGPT.

At this point, ChatGPT feels like an unfiltered friend who says what it thinks, for better or worse. Claude seems to prefer appearing virtuous over honest. For my purposes, the former has become more useful.



Discuss

Declining Marginal Costs of Alienation

2026-01-25 12:40:08

Published on January 25, 2026 4:40 AM GMT

I am writing this up because a few people I talked with, in my view, have a slightly wrong model of agency decision-making in the federal government. In particular, some people think that as ICE becomes less popular, you should expect a retreat from their least popular activities, policies, and decisions.

I think the folk theory is pretty reasonable in many circumstances.

  1. Agencies need support to continue doing their work. People hired at an agency generally support continuing the activity of the agency and think it’s a good thing to do. People hired by the FBI want to make the world a better place, bring some measure of relief to victims of crime, be part of an elite team, and uphold the Constitution. DHS hires people who want to destroy the flood, defend your culture, and deport tens of millions of US citizens. Agency ultimate goals are a mix of expanding power and budget and accomplishing what they understand the mission to be.

  2. Agencies have substantial uncertainty over how the public will view various activities (in part because it’s frequently impossible to predict in advance, as views will crystallize based on the particulars of a semi-random incident and then be applied to an entire category).

In this theory of the world, while agencies will occasionally do something very unpopular if it’s in pursuit of their core mission, support is broadly useful for all activities, so you get something like this.

But, of course, agencies don’t necessarily know in advance where a given activity will fall, and so will sometimes need to walk back their decisions.

In addition to the uncertainty factor, like most things in life, public support has declining marginal value. If you’ve got 70% support, moving to 75% is nice. If you’ve got 49% support, moving to 54% is vital. This is somewhat complicated by a need for at least reasonably bipartisan support: 52% from one party and 56% from the other is much better than 30/78.

Furthermore, Congress exercises oversight on agencies, and I think the folk theory, quite reasonably, expects sharper and more critical Congressional oversight on more embattled and politically vulnerable agencies.

But what if you assume that it’s going to become impossible to complete the mission?


There’s a related problem in corporate finance. Under normal conditions for a company that is doing well, the company will seek to repay the people it owes money to, even though the mission is to send money to shareholders. Legally speaking they get money before shareholders in event of bankruptcy, so why not send them the money they’re owed?

However, shortly before bankruptcy, companies are understood to have a duty to maximize returns for their shareholders. Which means that, if you can, the day before you declare bankruptcy there is some desire to sell everything the company owns, send out all that money as dividends, and then show up to the bondholders with empty pockets. There are various legal and covenantal restrictions on what companies can get away with here, but the point I want to make is this: if you know that you’re going to die tomorrow no matter what, the optimal actions to achieve your goals can look very different from when you have a longer time horizon.


A DHS funding bill was just passed in the House with support from every single Republican except Thomas Massie. If the Senate passes a funding bill, until September 30th 2026, at least, DHS leadership is largely free to do as they please, so long as it pleases the President.

If ICE’s support is bankrupt among democrats, such that DHS leadership and domestic policy advisors whose goals are closely tied to ICE activity feel that moderating for the sake of public support isn’t worthwhile, I expect to see a rapid increase in activities in the mission-critical/unpopular box above. Declaring that you no longer need warrants signed by a member of the judicial branch to break someone’s front door down, for example, would be convenient for many parts of law enforcement, not just ICE. DEA isn’t going to declare it, at least partially because they want to have support from both parties.


The bankruptcy model is not the only factor in play. DHS leadership and Stephen Miller report to the President, and if President Trump feels that ICE actions are causing sufficient political problems for him, he may order different policies. There may be pressure from other law enforcement agencies that seek to practice “policing by consent”, and fear that ICE actions could interfere with that for all law enforcement. But I think there is a substantial chance that ICE responds to declining popularity with increased urgency and willingness to sacrifice public/bipartisan support to accomplish the mission.

What the mission of ICE, and more broadly Noem’s DHS, is is up for debate, and almost certainly internally contested. I don’t think I can pass an ITT for all of them reliably, so I’ll refrain from trying to describe them.

One of the effects of the massive hiring spree ICE has gone on is to help leadership shift organizational culture, which can be extremely slow to change when hiring is at a slower pace. New hires in a slow-hire world tend to be acculturated into the existing norms, while hiring at this rate enables the new hires to bond over the factors that brought them in, and bring a new culture to the organization. Having a very short training period, reportedly just 47 days, will also help the new hires resist any old norms, practices, and expectations.

Quantitative predictions:

  1. The agent(s) who shot Alex Pretti will face credible non-ICE investigation by the federal government during Trump’s time in office: 10%

  2. On January 27th, in my judgement, there has not been an explicit “backing down” by senior administration officials from the claim that Pretti approached agents with a handgun: 90%. “We’re investigating the matter” does not count, nor does silence on the topic or refusing to answer questions on it.

  3. On February 27th, ““: 75%



Discuss

Structure and function of the hippocampal CA3 module

2026-01-25 09:57:08

Published on January 25, 2026 1:57 AM GMT

Join Us for the Memory Decoding Journal Club!
A collaboration of the Carboncopies Foundation and BPF Aspirational Neuroscience

This time, we’re revisiting a standout connectomics paper:
“Structure and function of the hippocampal CA3 module”

Authors: Rosanna P. Sammons, Mourat Vezir, Laura Moreno-Velasquez, Gaspar Cano, Marta Orlando, Meike Sievers, Eleonora Grasso, Verjinia D. Metodieva, Richard Kempter, Helene Schmidt, & Dietmar Schmitz

Institutions: Charité–Universitätsmedizin Berlin; Ernst Strüngmann Institute for Neuroscience; Humboldt-Universität zu Berlin; Max Planck Institute for Brain Research; and collaborators

In this work, the authors combine 3D electron microscopy, multipatch physiology, and computational modeling to argue that CA3 is far more recurrently connected than previously assumed, supporting pattern completion and memory sequence replay. The aim of this JC is to understand how far this paper went toward achieving memory decoding from a connectome and what a CA3 study would look like that could win the Memory Decoding grand prize.

Presented by: Dr. Randal Koene

When? Tuesday, January 27, 2026 – 3:00 PM PST | 6:00 PM EST | 11:00 PM UTC

Where? Video conference: https://carboncopies.org/aspirational-neuroscience
Register for updates: https://aspirationalneuroscience.org/register-with-us/
Once registered, you'll receive event invites & updates!

#Neuroscience #MemoryResearch #Hippocampus #Connectomics #JournalClub #Carboncopies #AspirationalNeuroscience



Discuss

What's a good methodology for "is Trump unusual about executive overreach / institution erosion?"

2026-01-25 09:35:52

Published on January 25, 2026 1:35 AM GMT

Critics of Trump often describe him as making absolutely unprecedented moves to expand executive power, extract personal wealth, and impinge on citizens’ rights. Supporters counter that Trump’s actions are either completely precedented, or are the natural extension of existing trends that the media wouldn’t make a big deal over if they didn’t hate Trump so much.

In some recent posts, some people have been like "Wait why is there suddenly this abrupt series of partisan LW posts that are taking for granted there is a problem here that is worth violating the LW avoid-most-mainstream-politics norm?".

My subjective experience has been "well, me and most of my rationalist colleagues have spent the past 15 years mostly being pretty a-political, were somewhat wary but uncertain about Trump during his first term, and the new set of incidents just seems... pretty unprecedently and scarily bad?"

But, I do definitely live in a bubble that serves me tons of news about bad-seeming things that Trump is doing. It's possible to serve up dozens or hundreds of examples of a scary thing per day, without that thing actually being objectively scary or abnormal. (See: Cardiologists and Chinese Robbers)

Elizabeth and I wanted to get some sense of how unusual and how bad Trump’s actions are. “How bad” feels like a very complex question with lots of room for judgment. “How unusual” seemed a bit more likely to have an ~objective answer.

I asked LLMs some basic questions about it, but wanted a more thorough answer. I was about to spin up ~250 subagents to go run searches on each individual year of American history, querying for things like “[year] [president name] ‘executive overreach’” or “[year] [president name] ‘conflict with Supreme Court’”, and fill up a CSV with incidents.

That seemed… like it was approaching a methodology that might actually be cruxy for some Trump supporters or Trump-neutral-ers.

It seemed like maybe good practice to ask if there were any ways to operationalize this question that’d be cruxy for anyone else. And, generally pre-register it before running the query, making some advance predictions.

Each operationalization I’ve thought of so far seems a bit confused/wrong/incomplete. I feel okay with settling for “the least confused/wrong options I can come up with after a day of thinking about it," but, I'm interested in suggestions for better ones.

Some examples so far that feel like they're at least relevant:

  • How many incidents will an LLM find of a president ignoring a court order?
  • How many executive orders did they issue?
  • How many pardons did they grant?
  • What was their wealth level before and after serving (perhaps normed by economic growth, or wealth change of congressmen)?
  • How many troops deployed without Congressional authorization?
  • How many incidents that "got heavy criticism of executive overreach" that don't really fit into a specific category?

My own personal goal here is not just to get a bunch of numbers, but, to also get a nice set of sources for examples that people can look over, read up on, and get a qualitative sense of what's going on.

These questions all have the form "checking if allegations by detractors about Trump are true", which isn't necessarily the frame by which someone would defend Trump, or the right frame for actually answering the question "is the US in a period of rapid decline in a way that's a plausible top priority for me or others to focus on?"

I'm interested in whether people have more suggestions for questions that seem relevant and easy to check. Or, suggestions on how to operationalize fuzzier things that might not fit into the "measure it per year" ontology.

Appendix: Subagents Ahoy

A lot of these are recorded in places that are pretty straightforward to look up. i.e. there's already lists of Pardons per President, and Executive Orders per president.

But, I have an AI-subagent process I'm experimenting with that I expect to use for at least some of these, which currently goes something like:

  • Have a (dumb) Cursor AI agent make one spreadsheet of "incidents" where rows are "years of US history", there's a column for "what incident type are we talking about?" [pardon, executive order, etc], and a column for "specific incident" and a column for "source url."
  • H a A running websearches shaped like: "[Year] [President] 'ignored court order'" and "[Year] [President] 'conflict with court'", etc. (I might actually go "[month] [year] to get more granular results)
  • Give the AI a python script which downloads the full content of each resulting page that seems relevant.
  • Spin up AI instances that look at each page, check the corresponding year on the spreadsheet and see if there is already an incident there that matches that topic. If so, give it the same incident name in a new row with a new source.
  • After having accumulated all that data, another set of AIs look over each year, check for duplicates, look at whether a given source seems likely-to-be-real, etc, while extracting out the key quote from each that states the claim, copying it verbatim rather than summarizing it.[1]
  • Compile that all into another spreadsheet with the total incidents for each.
  1. ^

    I have a pet theory about leaning on exact quotes rather than summaries to avoid having to trust their summarization



Discuss

Thinking from the Other Side: Should I Wash My Hair with Shampoo?

2026-01-25 08:22:40

Published on January 24, 2026 10:47 PM GMT

This article is a thought experiment based entirely on personal experience and observations.

A few years ago, I had long, thick hair. I used the same shampoo for many years and never experienced any hair loss or damage. That is, until my mother forced me to change my shampoo, at which point my hair structure completely deteriorated. It was no longer as thick as before, it wasn't growing as fast as it used to; it was falling out a lot and becoming sparse. For a few years, I tried different shampoos to combat this, even sought advice from a few people in the dermocosmetic industry, but to no avail—my hair was completely ruined. I wasn't bald yet, but it was already on its way.

However, a thought occurred to me recently:

How significant was shampoo in the 11,000-year history of humanity, and did human hair really need shampoo? Hair would get dirty, hair would get oily, but a little warm water could handle that. Why should I put a ton of chemicals on my head unless a bird pooped on it? My hair was already in bad enough shape; what's the worst that could happen if I didn't use shampoo on it?

It's been a week since I last washed my hair with shampoo, and the results are incredible. My hair is still shedding a little, but it's starting to resemble its former thick state again, it's wavy once more, and the layers in my hair are incredibly defined. Every time I look in the mirror, I think about how this gamble paid off!

So, is this really about human history and shampoos? The short answer is no, but I’ll start explaining the long answer and the reasons behind it now.

 

When I first thought of this idea, I even laughed at myself, but then it started to bother me. Why not? All I had to do was not use shampoo on my hair once or twice; it was a very simple experiment. If the result of this experiment was negative, meaning my hair got worse, I would continue using the shampoo I was using and accept my fate; if it was positive, I would see what happened then. The result was quite good; I would no longer use shampoo unless a bird pooped on my hair. Of course, there was some anxiety about not being able to foresee the outcome of the positive possibility, but taking that risk gave me a certain confidence at some point:

The excitement of being able to ask questions whose outcomes I couldn't foresee and finding answers that, even if negative, offered a perspective outside the mainstream.

This confidence wasn't just something internal; it was starting to manifest in my outward differences. The thought of 'what if it happens' or 'what if it doesn't' was in my head, and the ideas created incredible excitement in me. Even if I could predict the outcome, I developed a desire to see the answer with my own eyes. Whether my prediction was right or wrong, trying the other option and seeing what it could bring me began to cause incredible storms within me. My clear answer to questions like "What if this happened?" became "We won't know until we try."

Expectations for answers didn't have to come only through events; people's answers were also a guessing game. The biggest difference this self-confidence created was the desire to be able to ask people questions without hesitation, to be able to get their opinions without hesitation. I would ask my question, if they made fun of me, I would laugh it off, if they took it seriously, I would pursue it, but both were positive or negative answers. I knew what path to take based on the results. And the desire to apply this to every question fuelled my curiosity. But I had to limit myself at some point so that this excitement didn't fade too quickly. Obviously, I couldn't bombard people with questions!

I witnessed how a question that initially sounded ridiculous, like "What if I didn't use shampoo?", has significantly changed my way of thinking today. I suppose there's no need to wait for a big moment for big changes; even the simplest question can lead to the most complex answer.



Discuss