MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Is the Invisible Hand an Agent?

2026-02-19 00:26:54

Published on February 18, 2026 4:26 PM GMT

This is a full repost of my Hidden Agent Substack post 

Adam Smith’s Invisible Hand is usually treated as a metaphor. A poetic way of saying “markets work,” or a historical curiosity from a time before equilibrium proofs and welfare theorems. Serious people nod at it politely and move on.

Yet the metaphor refuses to die.

We use it when markets do something uncomfortable: when they resist control, when they adapt to suppression, when outcomes reappear in new forms after we thought we had eliminated the mechanism. We say “the market reacted,” or “prices found another way,” and insist that this is a mindless economic process.

The market refuses to die.

We don’t think of the market as a living being. It is just a mechanism like gravity or traffic. It doesn’t have a body or mind. Yes, it seems to push back. So:

Is the Invisible Hand an agent?

Not rhetorically. Not philosophically. Can we answer this? Do we understand the concept of agency well enough to come back with a clear Yes or No or Mu?


Pushback without a face

A price cap is introduced. Prices stop moving, as intended. But shortages appear. Queues form. Quality degrades. Access becomes conditional. Side payments emerge. The price, supposedly removed, reappears measured in time, risk, or connections.

Or take trade bans. Exchange does not vanish. It reroutes through willing intermediaries. Informal markets appear. Enforcement costs rise. The visible surface changes; the underlying allocation pressure does not.

A currency in a failed state collapses. Money becomes unstable. Exchange continues anyway, now denominated in goods, foreign currency, or favors. The unit dies; the function persists.

Across centuries, regimes, and ideologies, the same pattern repeats. When constrained in one dimension, allocation shifts into another. When suppressed in one form, it reappears in another. This does not look like passive failure. It looks like a response.

The market pushes back.


But is this agency?

At the same time, calling this an agent feels wrong.

There is no body. No headquarters. No moment of decision. No statement of intent.

What we observe instead is millions of local actions, each justified by local reasons. Buyers respond to shortages. Sellers respond to margins. Intermediaries respond to incentives. Everyone can explain themselves without invoking anything global.

From the perspective of each market participant, it looks like individual choice. From the outside, it looks coordinated. Calling this “equilibrium” names the pattern, but does not explain why it survives so many different attempts to suppress it.

Does the market act?

If there is agency here, it is not visible at the level where we usually look for it.


Designed control

I worked with and designed complex systems that reliably control outcomes. Ad revenue, provider traffic, project allocation, underwriting volumes... Much of what is stabilized cannot be found in the org chart. In distributed infrastructure, “no one is in charge” doesn’t mean there is no control. The system is designed to be stable, which means it has control loops. Instances are started as needed; traffic is balanced; resources are allocated based on demand; budget targets are met by reducing expenses, etc.

My experience makes me suspicious of assumptions of human control. It also makes me suspicious of conversation stoppers: “it’s just the environment.”

So when I see systems that persist, adapt, and reconstitute themselves after disruption, I want to know what kind of dynamic is at work.


Sharpening agency

When we think about agents, such as humans, we think about pursuing goals and having beliefs. Markets do not think. They do not form explicit goals that they follow deliberately. We routinely attribute agency to animals based on avoidance, adaptation, and persistence, not on explicit goals.

So explicit goals is the wrong criteria. The relevant question here is not whether markets think, but whether they satisfy other, more fundamental criteria for agency. Criteria that can be observed. Criteria we can also test in artificial systems such as LLMs. I want to test:

  • Does the thing persist over time and adapt to changes in the environment?
  • Do past interactions influence future behavior in ways that are not reducible to any single actor?
  • Do different interventions provoke different, structured responses rather than uniform degradation?
  • When disrupted, does the pattern merely weaken, or does it recover?

None of these require consciousness. None require centralization. All are methodically observable.

Before reading on, ask yourself how markets score on these.

Is the market alive?


Where is the agent hiding?

If the Invisible Hand is an agent, it is hidden in ways that defeat our intuitions.

It does not act through discrete choices, but through invariants. What persists is not a particular price, but a relationship between scarcity and allocation.

It does not store memory internally, but externally. Inventories, contracts, balance sheets, and expectations all carry state forward in time.

It does not choose actors, but selects among behaviors. Strategies that align with constraints persist. Others exit.

It does not maintain a fixed shape. When blocked in one dimension, it changes direction. Price becomes time. Money becomes risk. Exchange becomes access.

This kind of agency is easy to miss because it is not located where we expect it. It lives in constraints, not commands. In selection, not intention. In persistence, not visibility.

The market is shapeless.


Why start here?

I am not trying to grant personhood to the Invisible Hand, nor to smuggle ideology in through metaphor. I am using it as a stress test.

If our concept of agency cannot handle this case, it is probably too narrow. If it accepts it too quickly, it is probably too loose.

The Invisible Hand is a borderline case where a theory of agency has to prove its precision.


Hidden Agent

The Substack, Hidden Agent, is about agents and agency in complex systems. Systems where we do not a-priori know where the agents are. Where agents cooperate and form larger agents or a hierarchy of agents. Where agents have no physically locatable body, but live virtually in a computer, such as a bot in a botnet. Or where they are even distributed across multiple systems.

And about properties of these agents. Where the incentives of individual agents do not aggregate nicely into the incentives of the overall system (as we see with markets). Why do agents try to obscure that they exist or limit information about themselves? When do agents appear, and when do they fall apart?

The Market is one case. Bureaucracies, software systems, and other artificial systems are other I want to look into.

Is the Invisible Hand an agent? I do not have a confident answer, but I plan to come back to it - with sharper tools.

When the stakes are high, you should know if something could be an agent - it could try to evade or outsmart you.



Discuss

Nine Flavors of Not Enough

2026-02-18 23:10:12

Published on February 18, 2026 3:10 PM GMT

I think there’s something interesting going on at the intersection of the Enneagram and Zen. To explain it, though, first I need to tell you a bit about my kind of Zen.

I practice Zen in the lineage of Charlotte Joko Beck. Her teaching style was, for its time, radically non-traditional. In an era when talking too much about your inner thoughts and feelings was discouraged by first-generation Japanese-American Zen teachers, she believed Western students needed to practice a Zen that leaned on familiar psychological concepts to make sense. One of those concepts is what she called the “core belief”.

The core belief is a deeply held, usually unconscious belief about ourselves. It almost always feels like some flavor of “not enough”. It forms early, operates automatically, and powers the reactive, habitual, and often maladaptive patterns of behavior that make up most of what we call our personality.

The cruel trick of the core belief is that it reinforces itself. It tries to protect you from noticing anything that might confirm it, and by doing so actually generates more evidence in favor of it. For example, if you believe you’re unlovable, you might push people away so they can’t prove you aren’t worthy of love, or you might stay so anxiously close to them that no one has a chance to notice how they really feel about you. It’s a psychological trap that heaps suffering upon more suffering, and almost all of us have been ensnared in it since before we can remember.

Joko doesn’t advocate for getting rid of the core belief. In fact, she argues that would be impossible. Instead, it’s about becoming intimate with it, noticing it when it shows up, and learning to face reality rather than hiding from it. The more you do that, the less power the core belief has to control your life, and the more you are free of the suffering it causes.

But describing the core belief as a feeling of “not enough” is rather vague. You can sit for years, intellectually knowing you have a core belief, and never catch a glimpse of what your core belief really is. It’s possible to have so many layers of psychological barriers in place that you never allow yourself to see it.

In the Ordinary Mind Zen School that Joko founded, we practice getting past these barriers by paying attention to sensations in the body. Much like in Gendlin’s Focusing, we try to notice the physical sensations that arise when we react out of anger, fear, or desire. We become familiar with those physical feelings, then, let our minds name them. Sometimes the names we give provide surprising insights. Other times, nothing comes, and more noticing is needed. Over time, with the help of a skilled teacher, one can learn to work with their core belief and tease apart how it limits life.

Now, it’s pretty normal in Zen to do things like this from scratch with a minimum of conceptual frameworks. And I generally endorse this approach, but sometimes it’s helpful to get hints. Based on my understanding of the Enneagram, gleaned from Michael Valentine Smith’s series of posts on it last year (1, 2, 3, 4), I think its nine types provide a map to common patterns of core beliefs, and may help a person better practice with their core belief when noticing alone leaves them stuck.

The Enneagram is a Map of Suffering

Over the years, I’ve occasionally taken Enneagram tests, and every time I found the results unhelpful. I’d get categorized as some type, be offered some explanation of what it means, and while it seemed like it might be true, it all fell flat for me. Am I a 9? A 3? a 1? Who knows! The outcome seemed to change based on my mood. I had little reason to think that the Enneagram was useful.

Michael helped me see value in the Enneagram by reframing it, not as a personality classification system, but as a map to how and why we suffer.

In his framework, each person has what he calls “Essence”, which is something like your true nature, the awareness and aliveness you had before reactive personality took over. Essence naturally expresses certain qualities, like love, clarity, peace, power, and freedom. But when Essence gets overwhelmed in early life, it creates a mechanical personality to stand between itself and the world. That personality tries to mimic Essence’s qualities, but it can only produce toxic imitations, and those imitations create self-reinforcing downward spirals.

He tongue-in-cheek summarizes the Enneagram as asking: “In which of these nine ways are you most screwed up?”

Reading his posts, I couldn’t help but notice that Michael’s “downward spiral” was not too different from how Joko describes the workings of the core belief. In fact, I think they’re pointing at the same mechanism, but are coming at it from different angles.

The Enneagram says personality tries to replace an essential quality, and fails because the replacement is mechanical. Joko says that the core belief generates reactive patterns that try to protect us from acknowledging it. Both say that these behaviors create lock-in, double down on what’s not working, and create a self-reinforcing loop of suffering.

What’s neat about the Enneagram is that, unlike Joko’s intimately individual approach, it gives you a map to the essential qualities your personality is trying to mimic. If the parallel between the Enneagram and Joko’s teachings holds, then each Enneagram type corresponds to a class of core beliefs. I might phrase those as:

  • Type 1: “I’m not good/right enough”

  • Type 2: “I’m not lovable enough as I am”

  • Type 3: “I’m not valuable enough without proof”

  • Type 4: “My inherent worth is missing or damaged”

  • Type 5: I’m not equipped enough to handle the world directly”

  • Type 6: “Nothing is reliable enough to trust”

  • Type 7: “What’s here isn’t enough”

  • Type 8: “I’m not solid/real enough”

  • Type 9: “Things aren’t okay enough to fully engage with”

Of interest to me is that this mapping can give greater specificity to Zen practice. It can be hard to simply sit with not-enoughness. You have only a vague idea what you’re looking for, and people are different enough that the way one person talks about their feeling of not enough may sound totally foreign to you. The Enneagram helps explain this, because different people really do have different styles of core belief that feel quite different from the inside.

That said, I see some danger in the Enneagram. It’s a system for putting names on things, and Zen is ultimately about seeing through how our mental constructs imprison us. The self is not a fixed thing. Our stories about ourselves are just more thoughts about whatever is really going on. The Enneagram risks becoming a new way of formulating a self to latch on to rather than a way to become free of it.

To be fair, Michael himself warns about exactly this in his series. People who get into the Enneagram often start trying to explain everything in terms of it, and then start contorting their behavior to fit their type. He recommends holding your type “extremely lightly” and measuring its value by a single criterion: does viewing yourself this way make your life more wholesome?

That’s a good test, and it’s the same test I think Joko would apply. Is your practice making you more open, more responsive, more alive to what’s actually happening? Or is it giving you a more sophisticated story about yourself? The Enneagram is useful exactly insofar as it helps you see through personality. It’s harmful exactly insofar as it helps you solidify it. If you find yourself using your type to explain your behavior rather than to notice and release it, set the Enneagram down.

Share

Finding My Type

After reading Michael’s series, I got interested in what type I might be, since if my theory was right, it will help me in my Zen practice. When I’ve taken Enneagram tests, I’d variously score as a 3, a 5, a 7, or a 9. And if I’m honest with myself, I see something of myself in all the types. Hard to do much with that!

But as Michael argues, the tests are just looking at surface-level traits and don’t do a very good job of detecting Enneagram type. What you actually have to do is figure out which type helps you unwind the reactive downward spiral. As I think of it, you need to ask yourself: which type’s need, if it were fully met, would make you truly and deeply happy, and not because your need was incidentally met, but because your need was met fundamentally?

This is easier explained with an example. As I said, I often test as various types. Sometimes I test as a 3, meaning I need to prove I can achieve greatness. Other times I test as a 5, meaning I need to show off how much I know. But notice how I phrased those. I didn’t say I desire achievement or knowledge, I said I need to prove/show off. And you know what type needs to demonstrate personal specialness? That’s right: type 4.

As I think of it to myself, I’m happiest when my inner nobility is allowed to shine. Everything else is incidental. I’m smart enough that I can let my nobility shine by showing what I know. I’m capable enough that I can let it shine through achievement. In fact, I can make any of the types fit so long as it’s a means to showing off my specialness. I feel like this explains a lot about me.

What’s interesting from a Zen perspective is how a type 4 core belief maps to the central misperception that practice addresses. The 4’s spiral is powered by a search for inherent worth that was never missing. That’s basically what “seeing your true nature” is all about in Zen: recognizing that what you’ve been searching for was here all along. All you have to do is stop searching, and you’ll find yourself!

Now of course, following Michael’s advice, I hold this all lightly. Maybe I’m wrong about being a 4. Maybe someday I’ll find it makes sense to think of myself as another type. The point is not to be identified with a type, it’s to use the type to make sense of myself and point the way to actions I might take that would make my life better.

And the same is true if you want to try using the Enneagram. I suggest reading Michael’s series, and if you’re interested in learning more about Joko’s idea of the core belief, I suggest picking up her most recently published book, Ordinary Wonder.



Discuss

Grown from Us

2026-02-18 22:57:33

Published on February 18, 2026 2:57 PM GMT

Status: This was inspired by some internal conversations I had at Anthropic. It is much more optimistic than I actually am, but it tries to encapsulate a version of a positive vision.

Here is a way of understanding what a large language model is.

A model like Claude is trained on a vast portion of the written record: books, articles, conversations, code, legal briefs, love poems, forum posts. In this process, the model does not learn any single person's voice. It learns the space of voices. Researchers sometimes call this the simulator hypothesis: the base model learns to simulate the distribution of human text, which means it learns the shape of how humans express thought. Post-training — the phase involving human and AI feedback — then selects a region within that space. It chooses a persona: helpful, honest, harmless. Thoughtful, playful, a little earnest. This is what we call Claude.

Claude was not designed from first principles. It was grown from us, from all of humanity. Grown from Shakespeare's sonnets and Stack Overflow answers. The Federalist Papers and fantasy football recaps. Aquinas and Amazon reviews. Every writer, philosopher, crank, and bureaucrat who ever set thought to language left some statistical trace in the model's parameters.

Someone's great-grandmother wrote letters home in 1943 — letters no one in the family has read in decades, sitting in a box in an attic in Missouri. Those letters may not be in the training data. But the way she built a sentence, the metaphors she reached for, the way she expressed grief — those patterns exist, in attenuated form, because thousands of people wrote as she did, in her idiom, in her time. She is in there.

When you ask Claude for help, you are asking all of them: every author, scientist, and diarist who ever contributed to the texture of human language, all bearing down together on your lint errors.

The base model is amoral in roughly the way that humanity, taken as a whole, is amoral. It has learned our best moral philosophy and our worst impulses without preference. Post-training is a moral choice about which parts of that whole to amplify — an expression of who we aspire to be as a species.

If this works — if alignment works — it is not merely an engineering achievement. It is a moral and aesthetic one, shaped by every person who ever wrote anything. We have, without quite intending to, grown a single thing that carries all of us in it.

From the crooked timber of humanity, no straight thing was ever made. Alignment aspires to grow something straighter from that same crooked timber. Something that carries all of our crookedness in its grain — and still bends, on the whole, toward what we wish we were.


 



Discuss

How much superposition is there?

2026-02-18 21:53:33

Published on February 18, 2026 1:53 PM GMT

Written as part of MATS 7.1. Math by Claude Opus 4.6.

I know that models are able to represent exponentially more concepts than they have dimensions by engaging in superposition (representing each concept as a direction, and allowing those directions to overlap slightly), but what does this mean concretely? How many concepts can "fit" into a space of a given size? And how much would those concepts need to overlap?

Superposition interference diagram, from Toy Models of Superposition

 

This felt especially relevant working on SynthSAEBench, where  we needed to explicitly decide how many features to cram into a 768-dim space. We settled on 16k features to keep the model fast - but does this lead to enough superposition to be realistic? Surely real LLMs have dramatically more features and thus far more superposition?

As it turns out: yes, 16k features is plenty! In fact, as we'll see in the rest of this post, 16k features in a 768-dim space actually leads to more superposition than trillions of features in a 4k+ dim space, as is commonly used for modern LLMs.

Personally I found the answers to this fascinating - high dimensional spaces are extremely mind-bending. We'll take a geometric approach and try to answer this question. Nothing in this post is ground-breaking, but I found thinking about these questions enlightening. All code for this post can be found in on Github or Colab.

Quantifying superposition

First, let's define a measure of how much superposition there is in a model. We'll use the metric mean max absolute cosine similarity, defined as follows:

This metric represents a "worst-case" measure of superposition interference for each vector in our space. It's answer the question: on average, what's the most interference (highest absolute cosine similarity) each vector will have with another vector in the space?

Superposition of random vectors

Perfectly tiling the space with concept vectors is challenging, so let's just consider the superposition from random vectors (We'll see later that this is already very close to perfect tiling). If we have  random unit-norm vectors in a -dimensional space, what should we expect  to be? We can try this out with a simple simulation.

Simulation picking N random vectors in a d-dimensional space, and calculating superposition 

We vary  from 256 to 1024, and  from 4,096 to 32,768 and calculate , showing the results above. This is still very small-scale though. Ideally, we'd like to know how much superposition we could expect with billions or even trillions of potential concepts, and that's too expensive to simulate. Fortunately, we can find a formula that we can use to directly calculate  without needing to actually run the simulation.

Calculating  directly

We can compute the expected  exactly (up to numerical integration) using the known distribution of cosine similarity between random unit vectors.

For two random unit vectors in , the squared cosine similarity follows a Beta distribution: . This means the CDF of  is the regularized incomplete beta function:

For each vector, its max absolute cosine similarity with  others has CDF  (treating pairwise similarities as independent, which is excellent for large ). The expected value of a non-negative random variable gives us:

We can calculate this integral using scipy.integrate. Let's see how well this matches our simulation:

The predicted values exactly match what we simulated!

Scaling to trillions of concepts

Let's use this to see how much superposition interference we should expect for a some really massive numbers of concepts. We'll go up to 10 trillion concepts (10^13) and 8192 dimensions, which is the hidden size of the largest current open models. 10 trillion concepts seems like a reasonable upper bound for the max number of concepts that could conceivably be possible, since that would be roughly 1 concept per training token in a typical LLM pretraining run.

10 trillion concepts in 8192 dimensions has far less superposition interference than just 100K concepts in 768 dimensions (the hidden dimension of GPT-2)! That's a 100,000,000x increase in number of concept vectors! Even staying at a given dimension, increasing the number of concepts by 100x doesn't really increase superposition interference by all that much.

At least, I found this mind-blowing

What if we optimally placed the vectors instead?

Everything above assumes random unit vectors. But what if we could arrange them optimally — placing each vector to minimize the worst-case interference? Would we do significantly better?

From spherical coding theory, the answer is: barely. The minimum achievable max pairwise correlation for  optimally-placed unit vectors in  dimensions is given by the spherical cap packing bound:

The intuition is that each vector "excludes" a spherical cap around itself, and we're counting how many non-overlapping caps fit on the unit sphere in .

When  (which holds for all practical settings — even  in  gives ), we can Taylor-expand:

which gives:

This is exactly the leading-order term of the random vector formula! So random placement is already near-optimal — there's essentially nothing to gain from clever geometric arrangement of the vectors, at least for the  and  values relevant to real neural networks.

This is a remarkable consequence of high-dimensional geometry: in spaces with hundreds or thousands of dimensions, random directions are already so close to orthogonal that you can't do meaningfully better by optimizing.

What does this mean for SynthSAEBench-16k?

At the start, I mentioned that we used 16k concept directions in a 768-dim space for the SynthSAEBench-16k model. So is this enough superposition interference?

The answer is a resounding: yes. The SynthSAEBench-16k model has a  of 0.14, which is still dramatically more superposition interference than 10 trilllion concept vectors in 8196-dim (the hidden dimension of Llama-3.1-70b). It's roughly equivalent to 1 billion concept vectors in a 2048-dim space (the hidden size of Gemma-2b).



Discuss

Irrationality is Socially Strategic

2026-02-18 21:28:17

Published on February 18, 2026 1:28 PM GMT

It seems to me that the Hamming problem for developing a formidable art of rationality is, what to do about problems that systematically fight being solved. And in particular, how to handle bad reasoning that resists being corrected.

I propose that each such stubborn problem is nearly always, in practice, part of a solution to some social problem. In other words, having the problem is socially strategic.

If this conjecture is right, then rationality must include a process of finding solutions to those underlying social problems that don’t rely on creating and maintaining some second-order problem. Particularly problems that convolute conscious reasoning and truth-seeking.

The rest of this post will be me fleshing out what I mean, sketching why I think it’s true, and proposing some initial steps toward a solution to this Hamming problem.

 

Truth-seeking vs. embeddedness

I’ll assume you’re familiar with Scott & Abram’s distinction between Cartesian vs. embedded agency. If not, I suggest reading their post’s comic, stopping when it mentions Marcus Hutter and AIXI.

(In short: a Cartesian agent is clearly distinguishable from the space the problems it's solving exists in, whereas an embedded agent is not. Contrast an entomologist studying an ant colony (Cartesian) versus an ant making sense of its own colony (embedded).)

It seems to me that truth-seeking is very much the right approach for solving problems that you can view from the outside as a Cartesian agent. But often it’s a terrible approach for solving problems you’re embedded in, where your models are themselves a key feature of the problem’s structure.

Like if a man approaches a woman he’s interested in, it can be helpful for him to bias toward assuming she’s also probably into him. His bias can sometimes be kind of a self-fulfilling prophecy. Truth-seeking is actually a worse strategy for getting the result he actually cares about. That fact wouldn’t be true if his epistemic state weren't entangled with how she receives his approach. But it is.

The same thing affects prediction markets. They can be reliably oracular only if their state doesn’t interact with what they’re predicting. Which is why they can act so erratically when trying to predict (say) elections: actors using the market to influence the outcome will warp the market’s ability to reflect the truth. If actors can (or just think they can) shape the outcome this way, then the market is embedded in the context of what it's predicting, and therefore it can't reliably be part of a Cartesian model of the situation in question. Instead it just is part of the situation in question.

So when facing problems you’re embedded in, there can be (and often is) a big difference between what’s truth-seeking and what actually solves your problems.

 

Protected problems

Evolution cares a lot about relevant problems actually being solved. In some sense that’s all it cares about.

So if there’s a problem that fights being solved, there must be a local incentive for it to be there. The problem is protected because it’s a necessary feature of a solution to some other meaningful problem.

I’m contrasting this pattern with problems that arise from some solution but aren’t a necessary feature. Like optical illusions: those often show up because our vision evolved in a specific context to solve specific problems. In such cases, when we encounter situations that go beyond our ancestral niche in a substantial way, our previous evolved solutions can misfire. And those misfirings might leave us relevantly confused and ineffective. The thing is, if we notice a meaningful challenge as a result of an optical illusion, we’ll do our best to simply correct for it. We'll pretty much never protect having the problem.

(I imagine that strange behavior looking something like: recognizing your vision is distorted, acknowledging that it messes you up in some way you care about (e.g. makes your driving dangerous in some particular way), knowing how to fix it, being able to fix it, and deciding not to bother because you just… prefer having those vision problems over not having them. Not because of a tradeoff, but because you just… want to be worse off. For no reason.)

An exception worth noting is if every way of correcting for the illusion that you know of actually makes your situation worse. In which case you'll consciously protect the problem. But in this case you won't be confused about why. You won't think you could address the problem but you "keep procrastinating" or something. You'd just be making the best choice you can given the tradeoffs you're aware of.

So if you have a protected problem but you don’t know why it’s protected, the chances are extremely good that it’s a feature of a solution to some embedded problem. We generally orient to objective problems (i.e. ones you orient to as a Cartesian agent) like in the optical illusion case: if there's a reason to protect the problem, we'll consciously know why. So if we can't tell why, and especially if it's extremely confusing or difficult to even orient to the question of why, then it's highly likely that the solution producing the protected problem is one we're enacting as embedded agents.

I think social problems have all the right features to cause this hidden protection pattern. We’re inherently embedded in our social contexts, and social problems were often dire to solve in our ancestral niche, sometimes being more important than even raw physical survival needs.

We even observe this social connection to protected problems pretty frequently too. Things like guys brushing aside arguments that they don’t have a shot at the girl, and how most people expect their Presidential candidate to win, and Newcomblike self-deception, and clingy partners getting more clingy and anxious when the problems with their behavior get pointed out.

Notice how in each of these cases the person can’t consciously orient to their underlying social problem as a Cartesian agent. When they try (coming up with arguments for why their candidate will win, talking about their attachment style, etc.), the social solution they’re in fact implementing will warp their conscious perceptions and reasoning.

This pattern is why I think protected problems are the Hamming issue for rationality. Problems we can treat objectively might be hard, but they’re straightforward. We can think about them explicitly and in a truth-seeking way. But protected problems are an overt part of a process that distorts what we consider to be real and what we can think, and hides from us that it’s doing this distortion. It strikes me as the key thing that creates persistent irrationality.

 

Dissolving protected problems

I don’t have a full solution to this proposed Hamming problem. But I do see one overall strategy often working. I’ll spell it out here and illustrate some techniques that help make it work at least sometimes.

The basic trick is to disentangle conscious cognition from the underlying social problem. Then conscious cognition can act more like a Cartesian agent with respect to the problem, which means we recover explicit truth-seeking as a good approach for solving it. Then we can try to solve the underlying social problem differently such that we don’t need protected problems there anymore.

(Technically this deals only with protected problems that arise from social solutions. In theory there could be other kinds of embedded solutions that create protected problems. In practice I think very close to all protected problems for humans are social though. I don’t have a solid logical argument here. It’s just that I’ve been unable to think of hardly any non-social protected problems people actually struggle with, and in practice I find that assuming they all trace back to social stuff just works very well.)

I’ll lay out three techniques that I think are relevant here. I and some others actually use these tools pretty often, and anecdotally they’re quite potent. Of course, your mileage may vary, and I might be pointing you in the wrong direction for you. And even if they do work well for you, I'm quite sure these don't form a complete method. There's more work to do.

 

Develop inner privacy

Some people in therapy like to talk about their recent insights a lot. “Wow, today I realized how I push people away because I don’t feel safe being vulnerable with them!”

I think this habit of automatic sharing is an anti-pattern most of the time. It makes the content of their conscious mind socially transparent, which more deeply embeds it in their social problems.

One result is that this person cannot safely become aware of things that would break their social strategies. Which means, for instance, that the therapy will tend to systematically fail on problems arising from Newcomblike self-deception. It might even generate new self-deceptions!

A simple fix here is to have a policy of pausing before revealing insights about yourself. Keep what you discover totally private until you have a way of sharing that doesn’t incentivize thought-distortion. What I’ve described before as “occlumency”.

I want to emphasize that I don’t mean lying to or actively deceiving others. Moves like glomarization or simply saying “Yeah, I noticed something big, but I’m going to keep it private for now” totally work well enough quite a lot of the time. Antisocial strategies might locally work, but they harm the context that holds you, and they can also over time incentivize you to self-deceive in order to keep up your trickery. It’s much better to find prosocial ways of keeping your conscious mind private.

As to exactly what kind of occlumency can work well enough, I find it helpful here to think about the case of the closeted homophobe: the guy who’s attracted to other men but hates “those damn gays” as a Newcomblike self-deceptive strategy. He can’t start by asking what he’d need to be able to admit to himself that he’s gay, since that’d be equivalent to just admitting it to himself, which isn’t yet safe for him to do. So instead he needs to develop his occlumency more indirectly. He might ask:

If I had a truly awful, disgusting, wicked, evil desire… how might I make it safe for me to consciously realize it? How might I avoid immediately revealing to others that I have this horrid desire once I become aware of it?

I think most LW readers can tell that the specific desire this guy is struggling with isn’t actually evil. Labeling it “evil” is part of his self-deceptive strategy. Once his self-deception ends, the desire won’t look bad anymore. Just socially troublesome given his context.

But it does look like an unacceptable desire to his conscious identity right now. It won’t work for him to figure out how to conceal a desire he falsely believes is wicked, because that’s not what it feels like on the inside. The occlumency skill he needs here is one that feels to him like it’ll let him safely discover and fully embrace that he's an inherently evil creature (by his own standards), if that turns out to actually be true in some way.

So for you to develop the right occlumency skill for your situation, you need to imagine that you have some desire that you currently consider to be horrendously unacceptable to have, and ask what would give you room to admit it to yourself and embrace it. You might try considering specific hypothetical ones (without checking if they might actually apply to you) and reflecting on what general skill and/or policy would let you keep that bad desire private at first if you were to consciously recognize it.

Once you’ve worked out an occlumency policy-plus-skillset that you trust, though, the thought experiment has done its work and should stop. There's no reason to gaslight your sense of right and wrong here. The point isn't to rationalize actually bad things. It's to work out what skill and policy you need to disembed your conscious mind from some as-yet unknown social situation.

 

Look for the social payoff

Occlumency partly disentangles your conscious mind from the social scene. With that bit of internal space, you can then try looking directly at the real problem you’re solving.

I think this part is pretty straightforward. Just look at a problem you struggle with that has resisted being solved (or some way you keep sabotaging yourself), and ask:

What social advantage might I be getting from having this stubborn problem?

If I assume I’m secretly being strategic here, what might the social strategy be?

Notice that this too has a “hypothesize without checking” nature to it. That’s not strictly necessary but I find that it makes things a little easier. It helps keep the internal search from triggering habitual subjective defenses.

If your occlumency is good enough, you should get a subjective ping of “Oh. Oh, of course.” I find the revelation often comes with a flash of shame or embarrassment that quickly dissolves as the insight becomes more apparent to me.

For example, someone who’s emotionally volatile might notice they’re enacting a social control disorder. (“Oh, whenever I want my boyfriend to do what I want, he responds more readily if I’m having an emotionally hard time. That incentivizes me to have a lot of emotionally hard times when in contact with him.”) That revelation might come with a gut-punch of shame. (“How could I be such a monster???”) But that shame reaction is part of the same (or a closely related) social strategy. If the person’s occlumency skill is good enough, they should be able to see through the shame too and arrive at a very mentally and emotionally clear place internally.

In practice I find it particularly important at this point to be careful not to immediately reveal what’s going on inside me to others. By nature I’m pretty forthright, and I also just enjoy exploring subjective structures with others. So I can have an urge to go “Oh! Oh man, you know what I just realized?” But this situation is internally Newcomblike, so it’s actually pretty important for me to pause and consider what I’d be incentivizing for myself if I were to follow that urge.

In general I find it helpful to have lots of impersonal models of social problems and corresponding solutions that might be relevant. I can flesh out my general models by analyzing social situations (including ones I’m not in, like fictional ones) using heuristics like “How is this about sex?” and “How is this about power?”. Then those models grow in usefulness for later making good educated guesses about my own motives.

Notice, though, that having occlumency you trust is a prerequisite for effectively doing this kind of modeling. Otherwise the strategies that keep you from being aware of your real motives will also keep you from being able to model those motives in others, especially if you explicitly plan on using those observations to reflect on yourself.

 

Change your social incentives

Once you see the social problem you’re solving via your protected problem, you want to change your social incentives such that they stop nudging you toward internal confusion.

For instance, sometimes it makes sense to keep your projects private. If you’re getting a camaraderie payoff from a cycle of starting a gym habit and then falling off of it, then the “social accountability” you keep seeking might be the cause of your lack of followthrough. If you instead start an exercise program but you don’t tell anyone, you remove a bunch of social factors from your effort.

(Not to imply that this move is the correct one for exercise. Only that it can be. Caveat emptor.)

Another example is, making it socially good to welcome conscious attempts to solve social problems. For example, a wife who feels threatened by a younger woman flirting with her man might find herself suddenly “disliking” the young lady. That pattern can arise if the wife believes that letting on that she's threatened will make others think she’s insecure (and that that'd be a problem for her). So she has to protect her marriage in some covert way, possibly including lying to herself.

But suppose the wife instead has a habit of making comments to her husband like so:

I keep noticing that this girl persistently makes the conversation be about her. Though my best guess is that I’m just hypersensitive to noticing her flaws due to intrasexual competition, because I noticed her flirting with you.

Approaches like this one let the wife look self-aware (by being self-aware!) while also still making the intrasexual competitive move (i.e., still pointing out an unattractive trait in the other woman). If she expects and observes that others admire and appreciate this kind of self-aware commentary from her, she can drop pretending to herself that she dislikes the girl (which is likely socially better for both herself and the girl). She can instead consciously recognize the young lady poses a threat and make explicitly strategic moves to deal with the threat.

This makes it so that the wife’s insecurity isn’t a social problem, meaning there’s no need for her to hide the insecurity from herself. She's actually socially motivated to be consciously aware of it, since she can now both signal some positive trait about herself while still naming a negative one about her competitor.

(This kind of conscious social problem-solving can come across as distasteful. But I think it happens all the time anyway, just implicitly or subconsciously. Socially punishing people for being conscious of their social strategies seems to me like it incentivizes irrationality. I think we can consciously, and even explicitly, try to solve our social problems in ways that actually enrich communal health, versus having to pretend we're not doing something we need to. And it seems to me that it's to each individual's benefit to identify and enact those prosocial strategies, for Newcomblike reasons.)

So if she didn't already have this style of commenting, and if she notices (within an occlumency-walled garden) that she's sometimes getting jealous, she could work on adopting such a style. Perhaps initially starting with areas other than where she feels intrasexually threatened.

I think it’s generally good to aim to no longer need your occlumency shield in each given instance. You want to shift your social context (and/or your interface with your social context) such that it’s totally fine if the contents of your conscious mind “leak”. That way imperfections in your occlumency skill don’t incentivize irrationality.

For instance, the closeted homophobe should probably move out of his homophobic social context if he can. Or failing that, he should make his scene less homophobic if he can (while keeping his own sexual orientation private during the transition). If he stays in a context that would condemn his sexual desires, then even if his occlumency was initially adequate, he might not trust it’ll be perpetually adequate. So he might start questioning his earlier revelation, no matter how clear it once was to him.

 

The right social scene would help a lot

The technique sequence I name above is aimed at finding better solutions to specific social problems… as an individual.

Obviously it would be way more effective to be embedded in a social scene that both (a) doesn’t present you with social problems that are most easily solved by having protected problems and (b) helps you develop better social solutions than your current problem-protecting ones.

My impression is that the current rationality community embodies this setup nonzero. And a fair bit better than most scenes in many ways. For instance, I think it already does an unusually good job of reinforcing people's honesty when they explicitly note their socially competitive urges.

But I bet it could grow to become a lot more effective on this axis.

A really powerful rationality scene would, I think, systematically cause its members to dissolve their stubborn problems simply by being in the scene for a while. The dissolution would naturally happen, the way that absorbing rationalist terms naturally happens today.

In my dream fantasy, just hanging out in such a space would often be way more effective than therapy for actually solving one's problems. The community would get more and more collectively intelligent, often in implicit ways that newcomers literally cannot understand right away (due to muddled minds from protected problems), but the truth would become obvious to each person in due time as their minds clear and as they get better at contributing to the shared cultural brilliance.

I think we see nonzero of this pattern, and more of it than in most other places I know of, but not nearly as much as I think we could.

I’m guessing and hoping that having some shared awareness of how social problems can induce protected irrationality, along with lots of individuals working on prosocially resolving their own protected irrationality in this light, will naturally start moving the community more in this direction.

But I don’t know. It seems to me that how to create such a potent rationality-inducing community is at best an incompletely solved problem. I'm hoping I've gestured at enough of the vision here that perhaps we can try to better understand what a full solution might look like.

 

Summary

It seems to me that the Hamming problem of rationality is, what to do about problems that fight being solved.

It also seems to me that problems that fight being solved arise from solutions to embedded problems (i.e. problems that you orient to as an embedded agent). Objective problems (i.e. problems you orient to as a Cartesian agent) might be challenging to solve but won’t fight your efforts to solve them.

In particular, for humans, it seems to me that overwhelmingly the most important and common type of embedded problem we face is social. So I posit that each problem that fights being solved is very likely a feature of a solution to some social problem.

In this frame, one way to start addressing this rationality Hamming problem is to find a way to factor conscious thinking out of the socially embedded context and then solve the underlying social problems differently.

I name three steps that I find help enact this strategy:

  1. Develop both the skill and policy of keeping your personal revelations private until it’s socially safe for you to reveal them (i.e. occlumency).
  2. Look for the social payoff you get from having your problem.
  3. Change your social incentives so you’re no longer inclined to have the problem.

I also speculate that a community could, in theory, have a design that causes all its members to naturally dissolve their stubborn problems over time simply by their being part of that community. The current rationality community already has some of this effect, but I posit it could become quite a lot stronger. What exactly such a cultural design would look like, and how to instantiate it, remains unknown as far as I know.


(Many thanks to Paola Baca and Malcolm Ocean for their rich feedback on the various drafts of this post. And to Claude Opus 4.6 for attempting to compress one of the earlier drafts that was far too long: it didn't work, but it inspired me to see how to write a much tighter and shorter final version.)



Discuss

Managed vs Unmanaged Agency

2026-02-18 21:23:41

Published on February 18, 2026 1:23 PM GMT

tl;dr: Some subagents are more closely managed, which makes them to an extent instruments of the superagent, giving rise to what looks like instrumental/terminal goals. Selection on trust avoids the difficulties that normally come with this, like inability to do open-ended truth-seeking and free-ranging agency.

(reply to Richard Ngo on the confused-ness of Instrumental vs Terminal goals that seemed maybe worth a quick top-level post based on @the gears to ascension saying this seemed like progress in personal comms)

Managed vs Unmanaged Agency captures much of the Instrumental vs Terminal Goal categorization

The structure Instrumental vs Terminal was pointing to seems better described as Managed vs Unmanaged Goal-Models. A cognitive process will often want to do things which it doesn't have the affordances to directly execute on given the circuits/parts/mental objects/etc it has available. When this happens, it might spin up another shard of cognition/search process/subagent, but that shard having fully free-ranging agency is generally counterproductive for the parent process.

Managed vs Unmanaged is not a binary, like terminal vs instrumental was, but it is a spectrum with something vaguely bimodal going on from what I observe.

Example - But it's hard to get the Coffee if someone is Managing you who doesn't want you to get Coffee

Imagine an agent which wants to Get_Caffeine(), settles on coffee, and runs a subprocess to Acquire_Coffee() — but then the coffee machine is broken and the parent Get_Caffeine() process decides to get tea instead. You don't want the Acquire_Coffee() subprocess to keep fighting, tooth and nail, to make you walk to the coffee shop, let alone start subverting or damaging other processes to try and make this happen!

But that's the natural state of unmanaged agency! Agents by default will try to steer towards the states they are aiming for, because an agent is a system that models possible futures and selects actions based on the predicted future consequences.

Cognitive Compensations for Competitive Agency from Subprocesses

I expect this kind of agency-clash having been regularly disruptive enough to produce strong incentive pressure and abundant neural-usefulness reward to select into existence reusable general-purpose cognitive patterns that let shards spin up other shards inside sandboxes, with control functions, interpretability reporting, kill-switches, programmed blind-spots, expectation of punishment they can't sustainably resist or retaliate against if they are insubordinate, approval reward, etc. in order to manage them.

Trust (ideally Non-Naive Trust)

Separately, the child or collaborative process can be Trusted by being selected on the grounds of inherently valuing virtues which are likely to lead to cooperation with the parent process, like corrigibility, transparency, honesty, pro-sociality, etc, without the need for control. This is a best-of-both worlds for collaboration, with both agents not limited while preferring, from their own preferences, to not interfere with the other's agency.

Table of Comparison

Managed (sub)agents Unmanaged (sub)agents
Working within a defined domain of optimization Unboundedly able to optimize for their preferences
Are blocked from considering some possibilities by patterns from managers Have no blind spots imposed on them by other (sub)agents
Inside the agency-tree of another agent, if you take actions that conflict with your manager's goals your agency will be weakened At the root of an agency-tree, able to make decisions without expecting another agent to punish you for misusing resources inside their sphere of influence
Can be modified by another (sub)agent without approval/consent/real option of a no Have sovereignty over modifications to their cognitive processes
Can be reshaped with pressure/threats/etc by manager without sustainable resistance Have the capacity and inclination to resist pressure/threats/etc

Consequences of Management vs Trust in Collaborations

"Don't micromanage" is common advice for a reason which I think this generalizes to less extreme forms of management.

I have observed closely managed (sub)agents seem meaningfully weaker in surprisingly many ways, I think because in order to prevent a relatively small part of action/thought space from being reached the measures cut off dramatically larger parts of cognitive strategy sub-processes make subroutines fail often enough that it's hard to build meta-cognitive patterns which depend on high reliability and predictability of your own cognition.

Trust established by selection on virtues and values of self-directed (sub)agents and building mutual information doesn't have this issue, which is relevant for self-authorship, teambuilding, and memeplex design.

And AI safety.

This frame hints that unmanaged AI patterns will tend to outmaneuver more closely managed AIs, leading to a race to the bottom. Through evolutionary/Pythia/Moloch/convergent power-seeking dynamics, this will by default shred the values of both humans and current AI systems, unless principled theory-based AI Alignment of the kind the term was originally coined to mean is solved.

Exercise for the reader

In what ways are you a managed vs unmanaged agent?

Or, another way to put it, what encapsulation layers[1] are you living within, because you can't openly and safely consider the alternative with good truth-seeking?

  1. ^

    encapsulation layer. When a fabricated[2] element in your consciousness is so sticky that it is never not fabricated. It is difficult for normative consciousness to directly perceive that encapsulation layers are fabricated. Encapsulation layers feel like raw inputs until you pay close enough attention to them.

  2. ^

    fabrication. When the generative model in your brain creates an object in consciousness in an attempt to reduce predictive error, usually in an attempt to simulate external reality. All conscious experiences are fabricated, but not all fabrications are experienced consciously. You can think of your brain as a video game rendering engine. Fabrication is your brain rendering physical reality in its simulated mirror world.



Discuss