MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Unpacking Geometric Rationality

2025-12-24 04:10:12

Published on December 23, 2025 8:10 PM GMT

This is mostly my attempt to understand Scott Garrabrant's Geometric Rationality series. He hasn't written much about it here recently (maybe he's off starting video game companies to save the world). I find the ideas very intriguing, but I also found his explanation from 3 years ago hard to follow. I try to simplify and explain things in my own words here, hoping that you all can correct me where I'm wrong.

The approach I take is to build up to a geometrically rational agent from small pieces, introducing some of the necessary theory as I go. This is a much more mechanical presentation than Garrabrant's philosophical series, but I find this kind of thing easier to follow. After building up to the idea of a geometrically rational agent, I'll follow Garrabrant in showing how this can be used to derive Bayes' rule, Thompson sampling, and Kelly betting.

In the process of unpacking what I think Garrabrant is pointing at with Geometric Rationality, I found a lot of open questions. I took a stab at answering some of them, but be warned that there are a number of unanswered questions here. If you know the answers, please fill me in.

Basic Agent Requirements

Let's start by talking about what an agent is. This is a bit harder than it used to be, because everyone is all excited about LLM agents, which I think obscures some of the key features of an agent.

An agent is something with a goal. It observes the world, and then takes actions in order to manifest that goal in the world. The goal is often described in terms of maximizing some reward signal, though it doesn't need to be.

You are an agent. A thermostat is an agent. An LLM-agent might (or might not) be an agent, depending on how it was coded.

We can describe an agent's process in a few steps:

  • Observe the world
  • Orient your observations based on your existing knowledge
  • Decide what action to take
  • Act upon the world

An agent does these things repeatedly, in an OODA loop. There are other models for this kind of loop (Sense-Plan-Act, etc.), but I like this one.

There's a lot to be said about Observing the world. Sensor design and selection is a very interesting field. Similarly, Acting can be quite involved and technical (just look up recent research on grasping). We won't focus on Observing and Acting, as agent design per se often takes those for granted. Instead, we'll focus on Orienting and Deciding.

Expected Utility Maximizer

One of the most common ways to model the Orient and Decide stages of an agent is with an Expected Utility Maximizer. This is how it works:

  1. the agent receives an observation about the world
  2. it updates its hypothesis for how the world actually is
  3. for all of its possible actions, it considers the likely outcome if it takes that action
  4. whichever action gives the best outcome, that's the one it selects

Steps 1 and 2 there are the orient phase, and steps 3 and 4 are the decide phase. Generally two things are required in order to be able to do those:

  • A world model: This is what the agent updates based on observations, and then uses in order to figure out likely outcomes.
  • A utility function: This is what the agent uses to score outcomes for each action.

If you wanted to represent this in equations, you would write it as follows:

Here's what these symbols mean:

  • - the chosen action
  • - pick the act (out of the probability simplex of all possible Actions) that maximizes what follows
  • - the expected value (arithmetic expectation)
  • - the utility function, which predicts a reward based on a world state
  • - a function that outputs a probability distribution describing the predicted world state, given the recent observation and the action
  • - a possible world state, sampled from
  • the probability of ending in world if the agent starts in and takes action

There are various ways to simplify this (like smooshing the world-model and utility function into one combined thing), but we'll leave it as is. Any agent that observes and orients like this, we'll call an EUM agent.

This model works quite well, and is used all over the place in reinforcement learning, gambling, robotics, etc.

Action Space

Let's pause here to say a bit more about what actions the EUM agent is going to look at. The definition we gave above had . That means we select the action distribution that provides the highest value for everything that follows. The actions that we consider all drawn from , which is the probability simplex of A. In other words, it's a set of all of the probability distributions over all the pure actions in A.

As an example, if the only actions available to the agent were turn-left and turn-right, then the action space would be , where is the probably of selecting action turn-left, and is the probability of selecting action turn-right.

A pure strategy is one in which the probability distribution chosen is 1 for a single action and 0 for all other actions. There are two pure strategies for the turn-left/turn-right options.

EUM agents are often described as selecting actions over just the set of actions , and not the probability simplex . This is because EUM agents (when not playing with/against other EUM agents) will always select a pure strategy or think some pure strategy is as good as the best mixed strategy. For an EUM agent, pure strategies are always good enough.

To see this, note that the expected value of a mixed strategy ends up being a weighted sum of the expected values of the pure strategies in the mixture.

There exists a pure strategy in this mixture that either:

  • is the highest EV of any other pure strategy in the mixture, or
  • is tied for the highest EV with another pure strategy

If we weight that high EV pure strategy with a higher probability in the mixture, the EV of the mixture will:

  • go up, or
  • stay the same (if there was a tie)

This means we can maximize an EUM agent's expected score by squeezing all of its action probability into the highest scoring pure action.

I added the probability simplex to the EUM definition because we'll use it a lot later, and it will be convenient to define all the agents as optimizing over the same action space.

Partnerships

Now imagine there are two agents. These agents each have their own utility function and their own world model. These agents would like to join forces so that they can accomplish more together.

A standard way for two EUM agents to work together is Nash bargaining.

In Nash bargaining, each agent follows the same process:

  1. identify the best outcome it could get if it worked alone
    1. Nash bargaining often calls this the "threat point", but I think about it as a BATNA. Note that a BATNA could be adversarial (as threat points are sometimes modeled as), but don't have to be.
  2. identify all the outcomes that it could get if it worked with its partner
  3. discard any action/outcome pairs that would leave it worse off than the BATNA
  4. find the difference in utility between the BATNA and each outcome. We'll call this difference the delta-utility
  5. Once the list of (action, delta-utility) tuples has been determined, each agent shares its list
    1. Any actions that aren't on both lists are discarded (they would be worse than one agent's BATNA, so that agent wouldn't help in that case)
  6. For each action in the final list, both agents' delta-utilities are multiplied
  7. The action that maximizes this product is selected

As an equation, this would be:

The new symbols here are:

  • and - the BATNA values for agent 1 and agent 2. These are expected utilities calculated over the distribution of likely outcomes for an agent's best non-cooperative action. In other words . In this case, is the best non-cooperative action for agent 1.
  • Note that each agent uses its own world model to predict outcomes of an action. We subscript the world models with the agent index to clarify this.
  • Note also that the action set for this has a different interpretation. The actions under consideration here are actions for the combined forces of agent 1 and agent 2, which may be represented as a cartesian product of their individual actions.

I left any conditioning on observations implicit for this equation, but that's just to keep it easy to read.

One terminology note here: the product is very similar to a geometric mean. To find the geometric mean of two values, you multiply them and then take the square root of the result. Since the square root is monotonically increasing (for positive inputs), it doesn't change an argmax output at all. The product terms used in Nash bargaining are key reason that we're driving towards a geometric theory of decision making.

Why Nash Bargain

Garrabrant has a philosophical justification for Nash bargaining: "When Nash Bargaining, you are really just geometrically maximizing expected utility with respect to your uncertainty about your identity." For him, Nash bargaining is very fair, and provides an ethical foundation for cooperation. He has a large discussion on how adequately disburse utility when behind the veil of ignorance.

This is all well and good, but what are we to make of this for agent design? Our agent knows who it is; why would it not dispense with Nash bargaining and do what is better for its own self? The answer may be in the need for agents to willingly help each other. Agent 1 gets more with agent 2's help than without it, so needs to offer enough to agent 2 to get them to help.

Additionally, we may think of ourselves as the creators of numerous future agents. These future agents are all in some sense our descendants, and we want them to be able to work together. Perhaps we want to encourage fairness between them specifically because we care for them equally.

There's another very nice feature of Nash Bargaining: it lets the agents access mixed strategies. Recall that EUM agents always accept a pure strategy as being at least as good as any mixed strategy. That's not true for Nash bargaining.

To see that Nash bargaining can prefer mixed to pure strategies, notice that it is maximizing a product and not a sum. The product is between a weighted sum of one agent's utilities and that of another. In other words, assuming 0 BATNAs. This gives factors that are products of individual utilities, and so can be much higher than the pure utilities themselves. It's the higher order nature of the objective function that means mixed strategies may be preferred by the agents acting together.

Garrabrant phrases this as Nash bargaining allowing agents to agree on a policy, instead of an action. I originally found that confusing, until I realized that his term policy just means a probability distribution over actions (which could include acting with a pure strategy). There's probably more to be said about institution design to assist agents in coordinating, but for now we'll focus on the simple case of mixed strategies. These mixed strategies are crucial to the benefits of Geometric Rationality. We will see later how a mixed strategy protects groups of agents from being Dutch booked.

Combining Utilities in Nash Bargaining

Nash bargaining doesn't depend on units each agent uses for calculating utility. This is an important point for Garrabrant, so lets linger on it. How are we to deal with the differing units when we're taking the product of them?

Garrabrant talks about this extensively in Utilitarianism Meets Egalitarianism. The problem isn't completely intractable. We are assuming that one utility can be linearly transformed into the other. This means that we can turn a value in one agent's utility system into a value in the other agent's utility system just by running it through an equation like .

The problem, of course, is that we don't know what and actually are. We can't look at the utility values that an agent assigns to world outcomes to figure it out either. We know we can map utility values between unit systems, but the mapping from world-states to utility system is not the same at all. You like pizza a lot, I like sushi a lot. Just because our "a lot"s are comparable doesn't mean that telling someone I like sushi a lot gives them any info about whether you like pizza. Similarly, knowing what we both like doesn't give enough info to compute the transformation between utilities.

We don't really need the scale factor (as long as it's positive), since we're taking an argmax. We can just do the multiplication to get a result, then find the action that maximizes that result. Since we don't care about the real maximized utility value (just the action that achieves it), it doesn't matter if the units on our utility are a complicated mess.

As an intuition pump, imagine Alice gets paid in dollars and Bob gets paid in Yen. The outcome unit is dollars*yen, but that doesn't matter. More is more in both unit systems, and we don't need to convert yen to dollars to figure out what the value maximizing outcome is.

Garrabrant identifies the parameter (the zero-point) as being the most important for making utility comparisons. With dollars and yen, 0 is the same in both systems. With utilities, your +5 could be my -2. If we multiply those, the result is negative. Garrabrant talks a bit about this, but even in his sequence he notes that it's a problem.

The way Nash Bargaining gets around this problem is comparative in nature. Instead of using the utilities each person has for the next state, it uses the delta of utility each agent has between their BATNA's state and the cooperative state. Then improvements are always positive (even if the improvement is from terrible to just bad).

Dealing with losses in Nash bargaining

Nash bargaining explicitly requires agents to have a BATNA. That BATNA sets the "0 point" of the utilities they're trying to maximize. Every alternative the agents consider during the choice phase is considered against this BATNA, by subtracting the new utility from the BATNA utility. These deltas are then used in the Nash bargaining product to determine action choice.

What happens if two agents both would have a negative delta? Some choice gives them both less utility than their BATNA. The product of two negatives is positive, so an argmax may select it. This is why the Nash bargaining explanation above made such a big deal about filtering out options that an agent saw as negative delta utility. Without that filtering stage, our argmax wouldn't work. Formally, this means we're looking at a constrained optimization problem, with the constraint given by the requirement that outcomes need to have positive delta utility for each agent.

This is concerning. We're trying to build up from some foundational pieces to a larger theory of decision making. Nash bargaining side steps what seems like a large issue by introducing comparison points and option filtering. Even if we're willing to accept the added complexity of this scheme, we may worry that future types of problem wouldn't be amendable to it.

Garrabrant kind of waves this away by saying that GR works well for "natural features" of the world like dollars or coconuts. In other words, we're assuming utility can't go negative. I don't feel that this works well, because even if we constrain ourselves to only natural features that can't go negative, we can still have our stash go down. In Garrabrant's defense, this is explicit in his article on the geometric expectation and is called out as a major limitation in his last post.

Much of Garrabrant's discussion is not about iterated decision making, so he can get away with handwaving it. If (as he starts out suggesting) you're going behind a veil of ignorance and deciding who will get what, the veil provides a natural 0 point of nobody getting anything.

Our agent models are a bit more complicated, since we're assuming the agents act through time in a world that is continually changing (similar to a POMDP). Each agent has to do something, even if that thing isn't to cooperate with the other agent. We have sort of artificially brought the agents together to perform this Nash bargaining procedure, but they're always free to ignore the result and do something that's better for them.

This opens up a lot of questions. What if the agents form a contract? What if they use threats against each other? What if they actually fight over the course of an iterated game? I'm very curious about how these situations will go, but it's a bit outside the scope of this article. Let's continue with the assumption that the agents do manage to find a positive sum way of collaborating.

Teams

Extending the above argument to more than two collaborators seems straightforward, but there are a few thorny issues to address.

We can naively extend Nash bargaining to teams of agents that are all collaborating simply by including all agents in the product. This makes the definition of the collective Nash bargaining team:

This is identical to the two party Nash bargaining solution if there are two agents.

The issue here again comes down to the 0 points of each agent's utility function. In the two party case, the BATNA is obvious because agent 1 can either work with agent 2 or not. With more than 2 agents, you run into cases where 1 & 2 want to work together, but 3 doesn't. The more agents you combine, the bigger this issue gets.

There could be fairly extreme faction building within groups of agents on the team. In the extreme case, the team splits into several teams along faction lines. This seems like one of the areas that needs a lot more investigation. To move forward, we can make a simplifying assumption that if all agents don't agree, then they each go forward with no collaboration at all. Under this assumption, the BATNA for each agent is again based only on its own individual action, and not on an assumption of faction-building.

There's another wrinkle we should address here: what if some agents get more votes than other agents in what the team does? This is common in e.g. corporations where some people can hold more voting shares than others. It can happen in politics where states have proportional representation within the federal government.

Up to this point we've been modeling the Nash bargaining solution as giving equal weight to all agents. If we want to weight agents differently, this is actually quite easy. We take each factor in the utility product to the power of its agent's weight.

In equations, this is:

With a uniform weighting of agents, we have for all . We can choose whatever distribution we want for these agents, and Garrabrant's series spends a lot of time arguing that this selection has philosophical underpinnings (such as representing how likely you are to be a certain person if you're behind the veil of ignorance).

In cases like corporate share ownership where some agents get more votes, but votes come in discrete units, you could represent the Nash bargaining product as having one factor for each vote. Since some agents get multiple votes, their preference gets repeated for each vote that they have. This can be represented as taking their single preference to a power, where the exponent is their number of votes. The argmax will produce the same result even if we take the entire function to some other power, so instead of taking each agent's scoring function to the power of its vote number we can take it to the proportion of its votes in the vote total.

Now we're looking at full geometric expectation. Finally we've come to the "geometric" part of Garrabrant's geometric rationality.

What's more, this new team acts like a single agent. Remember back to our original definition:

  1. it observes the world (using all of the sensors that the sub-agents have)
  2. it orients to its observations (in a distributed fashion within each sub-agent)
  3. it decides what to do (by allowing bargaining among all of the sub-agents over a mixed strategy to sample from)
  4. it acts upon the world (via each sub-agent performing its own part of the sampled action from the mixed strategy)

This is not an EUM agent though. It doesn't optimize in the same way as an EUM agent, and will choose different actions than an equivalent agent that somehow was given its same goal. I'll call agents that act in this way "Geometrically Rational" agents, or just GR agents.

This could be represented as an EUM agent if we took the log of the scoring function. That would turn our products into sums and our exponents into products. We could solve this EUM agent over log-value and get the same answer as our GR agent, but the interpretation becomes trickier. As Garrabrant says, this formulation looks at optimizing log of value. The question of why you would want to optimize log-dollars (or log of any other thing you value) is a confusing one without a bargaining perspective.

An Agent Without vNM Rationality

As mentioned, the GR agent is not an EUM agent. Many would consider this a problem, as EUM agents are the only kind of agent that fully satisfies the von Neumann Morgenstern axioms. These are:

  1. Completeness - given two options, an agent is able to provide a preference ordering between them (even if it's a tie)
  2. Transitivity - if an agent prefers A to B, and prefers B to C, then the agent also prefers A to C.
  3. Continuity - if an agent prefers A to B and B to C, then there's some probability p where the agent would think that pA + (1-p)C ties with B.
  4. Independence - if an agent prefers A to B, then for all probabilities p the agent also prefers pA + (1-p)M to pB + (1-p)M. Note that it doesn't matter if the agent likes M better or worse than A and B, as M is applied equally to both sides.

Since the GR agent is not an EUM agent, it must violate at least one of these axioms. Garrabrant wants to do away with the Axiom of Independence.

At the risk of belaboring the obvious, there's no way we can prove that getting rid of Independence is the right thing to do. von Neumann and Morgenstern take it as an axiom, and just assume that it's worth doing. Garrabrant doesn't think so.

EUM agents think taking certain actions are best (such as betting it all on the highest EV outcome). GR agents think taking other actions are best (such as Kelly betting, as we'll soon see). It's sometimes fun to watch people argue about expectation maximization vs Kelly betting, because often each side of the argument thinks that their side is just obviously right. Garrabrant does the best I've seen at describing the issue here.

In the end, we just have to look at the two types of agent and see what they would do. Then we can pick the one that seems most consistent with wisdom, from our own personal perspective.

von Neumann and Morgenstern included the axiom of independence because it enforces consistency. I am kind of analogizing this as a Markovian property. Once you see what decision you're presented with (you have a specific world-model conditioned on your observations), you'll always make the same decision. It doesn't matter how you got there.

Dutch book arguments demonstrate that if you decide in a Markovian way, you need a special kind of consistency in order to avoid being taken advantage of. Here's how Dutch book arguments work:

  • assume an agent prefers A to B
  • assume it prefers the lotteries (0.5B + 0.5C) to (0.5A + 0.5C)
  • assume it observes the world and finds that it has to make a choice between the mixture (bullet 2)
  • it decides on the mixture option that includes B
  • assume the lottery resolves such that outcome B occurs (and not C)
  • a bookie comes along and says "you can pay me to be allowed to choose A instead of B"

An EUM agent would never encounter this scenario, because the first two assumptions are incompatible with the axiom of Independence. According to vNM, in order to avoid being taken advantage of by a Dutch book agents must update their world models using Bayesian probability and assign utilities using vNM's axioms.

A GR agent may encounter this scenario. When the GR agent sees the lottery resolve such that outcome B occurs (not outcome C), it then gets the choice to switch to A. The agent has to track which mixed strategy it's executing to avoid being Dutch booked, instead of treating each decision point in isolation.

According to Garrabrant, a GR agent may encounter this scenario but would decline to change their choice when the bookie offers the option.

In arguing against the axiom of independence, Garrabrant presents a parable. A married couple is deciding where to move. The agent is the married couple in this case, and not either spouse. The agent would prefer moving to Boston over Atlanta, which matches the husband's preferences but not the wife's. Given some probability of moving to SF, the married couple agent would change its preference from Boston to Atlanta. This is because SF satisfies the husband's utility more than Boston, so in fairness they weight Atlanta higher than Boston to give more utility to the wife in expectation. They are distributing the expected utility gain of the SF option between the two of them by rebalancing their agreed strategy for choosing between already existing options. The rebalancing is necessary because there is a disagreement that is internal to the gestalt "married couple" agent.

To me, this is wisdom.

An EUM agent doesn't change its choice under a Dutch book because it's indifferent to the new information (by the axiom). A GR agent doesn't change its choice because it is attempting to be fair to its sub-parts.

This also explains why a GR agent would change preferences if given a lottery with an "irrelevant" outcome. That irrelevant outcome can have its additional utility distributed among sub-parts in a non-uniform (more fair) way.

Unfortunately, this seems either intractable or non-Markovian. Discovering new options can swap preferences between already known options for a GR agent. Garrabrant suggests that some kind of updateless decision theory might save this. I don't know about that, so let's just assume that ditching the axiom of independence is fine and see what happens.

Coalitions: Teams of teams

Now that we have a new kind of agent, we can ask how it would act alongside other agents. If we wanted this agent to act alongside an EUM agent, the natural way is to just fold that EUM agent into the team alongside the other sub-agents. But what if we want a GR agent to collaborate with another GR agent.

Agents Bob, Brenda, and Bernice team up and start society B, which acts as a GR agent. Agents Charlie, Clarisse, and Chester team up and start society C, which also acts as a GR agent. Each of these societies satisfies the needs of its members through Geometric Rationality as above.

After some time, societies B and C decide to collaborate on a project. How should those societies aggregate their high level decisions and knowledge? Will that still serve their individual member agents? Let's try Nash bargaining again. It worked well last time.

In this case, the societies don't have a utility function. Remember that we're violating one of the axioms needed in order to construct a utility function. For EUM agents, the utility function is what we take the argmax over to decide on an action. For GR agents like these societies, we'll consider the function we're argmax'ing over to be our scoring function. This scoring function will be used in the Nash bargaining construction.

For the GR agent of society B, consider its scoring function to be:

There's an important wrinkle here that we should iron out. The above scoring function is for the individual GR society, and the BATNAs in it are for each sub-agent to go its own way instead of collaborating with society B.

Since we have two GR societies, and we're Nash bargaining, we should choose actions that maximize the product of our two scoring functions less their BATNA. When we do this, the BATNAs we're looking at are society level BATNAs, not sub-agent level BATNAs.

The action space here represents the joint actions of both societies B and C. Since societies act through the individual actions of their agents, the actual action space is given by the joint actions for all agents in both societies.

The society level BATNAs are not utility values, since they're not from a utility function. Instead, they are the result of a product like . In this case, represents agent 's utility given its best action if it separated from society B. The value represents agent 's best action if it stayed in society B, but society B did not collaborate with society C. The values are just constants from the perspective of the argmax.

To simplify this, let's multiply it out. We'll reindex the products at the same time.

The addition of the constant doesn't impact the argmax at all, so we can disregard it immediately. The other two cross terms involving society-level BATNAs make things pretty complex. If those society level BATNAs were 0, this would be much simpler. Unfortunately, the only way to make them 0 is if at least one agent from each society would be better served by not being in its society in the first place.

This more complicated result shouldn't be too surprising to us, since we already had to wave away certain factionalism assumptions when we were talking about forming our multi-agent team. If two teams collaborate, they're bringing that factionalism back in at the outset.

Scale Free Agent Design

We saw above that Nash bargaining between societies of agents does not reproduce the math of those agents all Nash bargaining amongst each other. I think this is not what Garrabrant hoped for. He argues in a few places that GR is scale free. He seems very hopeful that any number of agents can join up in coalitions, and combining their goals proceeds using the same rules. He doesn't want it to matter how you aggregate their goals.

Why would an individual EUM agent care about whether it was part of a coalition or a single large team? Given unlimited compute and thinking time, maybe it wouldn't. Agents don't have unlimited resources, so institutions may form that would approximate GR teams. If so, these institutions may function optimally with a specific number of sub-agents. This could lead to partitioning of EUMs into teams. These teams would develop internal institutions to reduce transaction costs, which would result in cross terms in value scoring when teams collaborate.

Some of Garrabrant's writings seem to be driving towards some way of resolving gerrymandering. If only we could use Nash bargaining to let people combine preferences, we would find a new voting scheme that makes gerrymandering a thing of the past. While I personally think gerrymandering can be fairly destructive, thinking through GR has made me realize the point of it: different neighborhoods may want representation of their own specific needs. Gerrymandering can act as an institutionalizing force that (steelmanning here) is supposed to reduce transaction costs for a given neighborhood to represent itself. In practice gerrymandering doesn't seem to do this, but if we assume we could magically replace it with more natural coalitions then the cross-terms in the Nash bargaining would persist and would change optimal actions from what they would be under an all-inclusive single team.

Building Blocks of Agent Design

Up to this point, we have built up our teams out of Expected Utility Maximizers. It would be very nice if we could somehow replace that expected utility maximization with another geometric maximization. Can we do this?

Richard Ngo is working towards something like this, and explicitly references the Geometric Rationality work. He refers to Minsky's old book Society of Mind, which makes the very natural (for today) argument that people are made up of sub-agents. This also resonates with the Internal Family Systems approach to therapy.

I don't know what humans are made of, but let's try to address the idea of making GR agents from things that aren't EUM agents. This isn't quite what Garrabrant does. He seems to approve of a union of GR agents and EUM agents, and talks about the usefulness of the arithmetic-mean/geometric-mean boundary. He describes the the smaller EUM agents as being places where fairness isn't required. Equivalently, within an EUM agent, sub-agents are allowed to exploit each other.

If I think about a human mind as a society of sub-agents, I admit that they all survive if one survives (under nominal conditions). From that perspective, there's a forced utility sharing amongst them all. This could imply that these sub-agents exploiting each other is fine, because there's a floor to the degradation in utility any one of them can experience. On the other hand, via introspection I know that some parts of me value certain things in the world more highly than my own survival. Perhaps those parts aren't necessarily on board with Garrabrant's drawing of the AM/GM boundary.

Whatever the case for a human, let's try to make an artificial agent that's GR all the way down. We will do away with the AM/GM boundary in theory, if nowhere else.

At some point, we need some base agent. Something that is not a GR agent built out of other agents. How can we get a GR agent that has no sub-agents?

Possibly the simplest sub-agent we can look at is something that just looks at the world and counts things in it. Its scoring function can just output the count:

  • an agent that values having money: $1 is worth 1 unit of value
  • an agent that values having paperclips: 1 paper clip is 1 unit of value
  • an agent that values the number of rings its little sonic avatar has: 1 ring = 1 unit of value

These natural features of the world, as Garrabrant would call them, form a straightforward way to prefer worlds to each other. They have natural 0 points and natural scales. If two natural feature scoring functions are over the same "thing" in the world, it is easy to transform between their units.

An objection arises: what about cases where more isn't always better? What about set point regulators? For example:

  • an agent values maintaining body temperature at 98.6 degrees F: 1 degree deviation is -1 unit of value
  • an agent values maintaining blood glucose at 85 mg/dL: 1 mg/dL deviation is -1 unit of value
  • an agent values maintaining social interactions at 30 hours per week: 1 hour per week deviation is -1 unit of value

Such set point agents seem required for a lot of the ways that people and other agents work (remember the thermostat was one of the first examples of an agent in this article).

Consider the task of regulating a human's temperature. Model it as an agent with the goal of keeping the temperature at 98.6 degrees F. We will construct a GR agent that matches this set-point goal from two natural feature utility functions.

  1. a utility function that values increasing temperature: score-1 = temperature
  2. a utility function that values decreasing temperature: score-2 = (2*98.6 - temperature)

Perhaps score 1 measures something like robustness against fungal infections, and score 2 measures something like avoiding heat stroke.

The product of these scores gives a parabolic scoring function with a maximum given by temperature of 98.6. This has a quadratic penalty for deviations instead of linear as above, but I think it works for the idea of a set point regulator.

The problem with this construction is that one of the utility functions has a negative slope, and its score will go negative for temperature values that are high enough. If we were to use our GR agent design to Nash bargain between these scoring rules, one of the agents would decline to cooperate in this scenario. This would cause the grouping of these two agents to fail to regulate for temperatures outside of 0 to 197.2.

For a set point regulator of a human body's temperature, this seems fine. The human won't be alive at the temperature where this breaks down anyway. In reality, whatever controls the set point for human temperature probably depends on many factors, not on just two. For a theory of decision making grounded in a single kind of agent, this is still a limitation that would be better avoided.

This is not the most parsimonious way of grounding our agents in something real, but it seems to unblock us in carefully constructed regions of value-space. It still remains unclear what world model or action space such an agent would be using. Does one of the agents control the shiver response and the other control the sweat response?

I would want to see a more thorough exploration of this area of Geometric Rationality.

The benefits of EUM

The idea of grounding GR agents in something that's not an EUM is my own, not Garrabrant's. Garrabrant would argue against doing this, I think. He has a post on drawing the boundary between arithmetic mean maximization and geometric mean maximization, where he says "if you make your arithmetic clusters too small, you could end up taking actions in proportion to their utilities, and effectively never maximizing at all."

Imagine a GR agent that Nash bargains between 10 different natural feature agents. Imagine that each natural feature agent cares about only one thing (number of dollars, number of bananas, number of pearls, etc.) and the natural features don't overlap with each other. If the available actions could only result in getting one resource at a time, they may choose a mixed strategy that gives each act the same probability. Outcomes are simply distributed via probability, with no outcome being maximized.

It seems Garrabrant wants to deal with this by just having EUM agents in certain places, but to be honest I would like a theory with fewer moving parts. I'm not certain it's possible, but I do think it's worth exploring more. I'd be interested in exploring what behavior comes from competing GR agents rather than cooperating agents, and perhaps see if some behaviors don't give similar maximization performance.

Benefits of Geometric Agents

After all of that work, we have reached the point where we can describe:

  • Geometrically Rational agents made up of smaller sub-agents
  • those sub-agents may also be GR agents
  • they may also be natural feature agents, which value simple linear scoring of single features in the world
  • GR agents select actions that maximize the product of scores provided by their sub-agents via a constrained optimization of the geometric mean

What does all of this get us? Hopefully something useful after slogging through all that.

Bayes' Rule

Geometric maximization produces Bayes' rule when used to condition on evidence. This is a bit tangential to the agent formulation we've been going over, but it's such a nice feature of the geometric worldview that it's worth going over. I'll follow Garrabrant's explanation pretty closely here.

Imagine you have a world model that's represented as a probability distribution over ways that the world could be. Normally when you receive some observation , you could condition your world-model on it via Bayes rule. To simplify things a bit, we can say that the set of possible world states that have non-zero probability for O are given by X. Instead of explicitly using Bayes' rule, let's find the probability distribution such that

The product in that equation is just the geometric expectation of our new probability as measured by our original probability. If we wanted to reframe this in an agent formulation, we could say that we're just Nash bargaining among "probability agents" where each probability agent wants to maximize its own value, but their negotiating power is given by how much they were predicted prior to the observation.

Let's find out what this probability agent would do by solving for the argmax.

Of course we do have to worry about the fact that our argmax is not over the entire simplex of original worlds. It's limited by the constraint that the probability equals 0 for worlds that are impossible given our observation. In other words, P(X) = 1 and P(!X) = 0. That means that some will be 0. These 0 probabilities for worlds that we don't observe wipe out the product.

Garrabrant deals with these 0s by taking the limit as the probability of the observation approaches 1.

Let's solve this as an optimization problem. I'll do so by taking the logarithm, but don't be confused. We aren't using log-probability as our "value" here. It's just a way of solving for the argmax of the product.

We want to maximize . This will be subject to two constraints:

We'll use the method of Lagrange multipliers to solve this. I won't belabor the details here, but the result of the procedure is a set of two Lagrange multipliers with limits

We also have the set of equations:

  • for
  • for

Notice that these updated probabilities match the outcome of applying Bayes rule.

Garrabrant goes on to discuss some different things you can do with this idea. In particular, he discusses intervening on your beliefs to explore counterfactuals. This is very interesting to me, and I plan to think about it in the context of Judea Pearl's approach to causation.

Explore/Exploit via Thompson Sampling

One of the main places that OODA agents of the form I showed above are used is in reinforcement learning. Often, EUM agents of some form are used there. This requires some ad-hoc tweaks, because EUM agents never explore. Remember, they always pick the action that maximizes their expected reward, instead of doing something that could get more information at the cost of (possibly) lower reward.

To account for this, epsilon exploration is sometimes used. This is a hack tacked on top of EUM agents where the agent's preferred action is chosen 1-epsilon percent of the time, and with epsilon probability some other action is chosen. To encourage exploration at the beginning of training but not at the end, epsilon can change over time. This is not very elegant at all.

GR agents are much better at making the explore/exploit tradeoff, because it turns out that they can implement something called Thompson Sampling. The gentlest traditional intro to Thompson Sampling is Allen Downey's, which I highly recommend.

Here's the normal presentation of Thompson Sampling for an n-armed bandit problem:

  1. store probability distribution of output for each arm (so n distributions)
  2. sample an output from each distribution (so you now have n value estimates)
  3. choose the arm with the highest value estimate
  4. observe reward from that choice
  5. use observed reward to update the probability distribution for that arm via Bayes' rule
  6. go back to 2

This has several nice properties. For one thing, it performs each action with exactly the probability that it may be the best action to perform (exploring when it makes sense to). For another, it's computationally tractable to implement this.

This description of Thompson Sampling doesn't look much like the way a GR agent works. Let's see how this is actually a special case of Geometric Rationality. We'll make a GR agent that is also used in an n-armed bandit problem. I'll follow a slightly different path than Garrabrant does. My presentation will be more geared around regenerating standard Thompson sampling, whereas Garrabrant takes a more philosophical and generic approach.

Model the bandit world as an array of probability distributions. Each element of the array represents an individual bandit arm. The probability distributions in each element of the array are continuous, and represent possible payouts from pulling the specified arm.

The GR agent we will construct will act as though each of these arms has a sub-agent advocating for it. The agent will Nash bargain among the sub-agents for each arm. In order to Nash bargain among these subagents, we need to know each agent's scoring function, BATNA, and voting power.

Scoring Function: Instead of valuing dollars, the agents value influence. They want their specific action to be taken.

BATNA: Since exactly one action must be taken by the GR agent, the BATNA for the sub-agents is 0

Voting Power: We would like to assign more voting power to arms that are expected to produce higher results. Given our array of probability distributions, we can actually calculate the probability that a specific arm will produce the highest result. It's given by the function

  • is the probability of for arm
  • is the probability of obtaining a result less than or equal to for arm (the CDF)

In words, we're summing over all possible outcomes that arm n could give. We weight the sum by the probability of that value. We also multiply it by the probability that all other arms give a lower value.

While this might be difficult to compute in practice, it's conceptually straightforward. We will use these values as the voting power for each sub-agent.

Now let's see how our GR agent would decide on an action.

where

  • is a discrete probability distribution with a value for selecting each individual bandit
  • is a specific sub-agent from the GR coalition
  • is agent i's voting power in the GR coalition
  • is the probability of selecting action given by action distribution

Once an action distribution is selected, an action is drawn from that distribution. The specified Bandit is activated, and each agent updates its world model via Bayes' rule. The GR agent itself updates the voting power for each sub-agent based on their outcome predictions. Then the whole process repeats.

Here we're implicitly using Garrabrant's Arithmetic/Geometric boundary. The arithmetic maximization is implicit in the problem setup, where each sub-agent explicitly prefers for the bandit it represents to be chosen. The geometric maximization is given by the Nash bargaining.

Let's find out what the argmax of the above equation actually is. We can take logs and formulate our problem as a constrained maximization problem. We want to maximize subject to . We'll again use Lagrange multipliers to solve this.

Taking the partial derivative with respect to each element and setting to 0 gives us for each . Since we know , we can say that , or . Since our voting power is just a probability distribution, this means that . Plugging that in to our partial derivative, we see that . We are assigning the same weight to each action as the probability that this action is selected by our collection of agents. This shows that GR agents of this form reproduce Thompson sampling.

Remember how we assigned voting power for the subagents? We set voting power to be equal to the probability that the subagent would produce the highest outcome. That voting power directly becomes the probability of choosing that action. In the limit that one arm is guaranteed to provide a higher value than any other arm, it's will go to 1. That means our GR agent would always choose that guaranteed best action (as we would want).

Kelly betting

Finally we get to the reason I was interested in geometric rationality in the first place: it recommends Kelly betting from a linear scoring function. A lot of recommendations for Kelly betting claim that it's a good idea if your value is logarithmic in wealth, but to be honest I don't find that compelling at all (should I not Kelly bet if the gains from betting are measured in something other than dollars?).

Garrabrant's justifications for using GR in betting boil down to various forms of internal fairness. Perhaps we think of each predicted outcome as owning its own percentage of the agent's total wealth pool. Given this, Garrabrant says it makes sense to Nash bargain between these predicted-outcomes rather than allowing one to seize all of the agent's internally-communal resources.

It's well known that Kelly betting is equivalent to maximizing log-wealth instead of wealth. With the definition of GR, it would be trivial to show that simply maximizing the geometric expectation of wealth is equivalent to Kelly betting. Instead, what I want to show here is that Kelly betting is equivalent to Thompson sampling when the action-space is continuous and sub-agents can trade to minimize transaction costs.

Normally, Thompson sampling is performed for n-armed bandit problems where only one solution can be picked at a time. If the problem constraints are changed and an agent is allowed to bet on possible outcomes instead of selecting specific actions to take, the outcome will be Kelly betting.

To demonstrate that the above formulation of Thompson sampling is equivalent to Kelly betting, we're going to set up a GR agent that's playing a betting game on coin tosses. Following Garrabrant, we'll assume bets can be made at even odds for either heads or tails. We'll also assume that the coin may not be fair.

We can consider this to be similar to a 2-armed bandit problem. One of the arms is "bet heads" and the other is "bet tails". To match up with the Thompson sampling paradigm above, we'll create a 2 element array of probability distributions. The distribution of each arm will be . The two distributions we'll use will be mirrors of each other.

Where we diverge from the standard Thompson sampling formulation is in actions. Instead of having to select one arm and bet everything on it (as in standard Thompson sampling), the GR agent is allowed to bet any amount of money that it has on one or both outcomes of the toss.

Like Thompson sampling, we will start with a set of 2 EUM agents. One EUM has a world model that focuses only on heads, the other only on tails. For the heads agent, its world model predicts probability of winning and probability of gaining nothing. The tails agent is the mirror of this.

The value functions and BATNAs of the subagents match the Thompson sampling value functions.

The voting power is very easy to calculate for this kind of problem. The heads sub-agent has voting power of and tails has voting power of . We don't even need to do any integrals.

If we follow the Thompson sampling mathematics, we see that the selected action distribution is given by . The probability of choosing the action to bet on heads is equal to the probability that heads wins (as predicted by the agent's world model).

If the GR agent were doing normal Thompson sampling, it would sample from that distribution and use the sample as its action. This problem is different though, because it isn't limited to going all-in on one action. It can instead distribute all its wealth proportionally to the action distribution and do everything "in parallel".

Let's consider a specific example to make this concrete. The GR agent as a whole predicts a 60% chance for heads and a 40% chance for tails, and it starts with a wealth of $100. Assume the bets are both 1:1 odds. Given this, betting $1 on heads and $1 on tails is the same as keeping the money. No matter what happens, the agent ends up at net-zero.

It can waste less time chatting with the bookie by balancing these "wasted bets" internally. For this example, distributing its wealth across bets in proportion to the Thompson sampling suggestion would mean betting $60 on heads and $40 on tails. Instead, it can cancel out all of the overlapping bets and simply bet $20 on heads while keeping the rest of its money in reserve.

Now that we've seen a concrete example, let's make it more abstract. Instead of assuming we can make both bets with 1:1 odds, assume that each bet has distinct odds. These odds are the payout offered by the bet, and are often represented by the variable . With odds of , you would receive times your bet amount back if you won (so if you bet $1 you would get back $2). Odds are normally set according to the probability of an outcome, so the outcome of a fair coin would have odds of for both sides, and for a coin that came up heads 60% of the time the odds would be 2/3. We also have .

Assume without loss of generality that the coin has a higher chance of landing on tails (just switch labels if it's actually heads). Thompson sampling suggests we bet a fraction of our wealth on heads and on tails, and we know that is less than . To internally balance our bets, we need to avoid betting on heads and also remove some amount from our bet that's cancelled by our bet not going through.

If we had bet a fraction of our wealth on a heads outcome, that would have resulted in a gain of on a win. We'll let the tails-betting sub-agent pay the heads-betting sub-agent to not bet, removing that fraction from the tails bet to compensate the heads bet for abstaining. In order to make the heads agent not bet, the tails agent has to pay as much as the heads agent would have gotten if it bet and won.

Now let , , and . Then we want to bet 0 on heads, and on tails we want to bet . This is the Kelly betting edge/odds recommendation.

Given Geometric Rationality, then what?

I like the ideas of Geometric Rationality, and I think they have promise for shedding light on optimal agent design.

Expected Utility Maximization is conceptually simple. Update your world model with Bayes rule, pick the action that gives you the best outcome, go all in on it. There are only a few variables to determine, and it's computationally tractable in many cases. It also recommends actions that I find intuitively strange (such as not betting Kelly) and it doesn't handle ambiguity in its world model well.

Geometric Rationality is more complicated right now, but it shows some promise for ultimately being simpler. In particular, we may not need to assume Bayes rule, as we can get it free by Nash bargaining over predictive power. We do currently need to do more complex work on scoring functions. With EUM the scoring function is often very obvious from a given problem statement. With GR it sometimes isn't. Consider Bayes rule, Thompson sampling, and Kelly betting. All of these use some kind of "representation" in the outcome as the scoring function, which is kind of weird. The simplex over actions gives more flexibility, but also makes computing answers much more expensive.

I want to consider how a GR perspective might recommend changing voting methods or reorganizing institutions. How can we make things more fair for people without moving farther from the Pareto frontier? This theory still seems likely to be fruitful to me, though I also think there are some major questions to resolve.

Geometric Rationality Open Questions

  • How do we determine the BATNA for many-agent collaborations?
  • Is there a principled way of computing BATNAs for each agent in a team, assuming unlimited compute?
  • Nash bargaining doesn't involve agents sharing world-models or educating each other, but that seems like a huge part of human bargaining. How can we better represent world-model updates in the GR framework?
  • Garrabrant's example of the married couple moving to Boston seems to depend highly on the institution that they have built around their partnership. Can we formalize this?
  • What sorts of institutions might GR agents come up with to decrease transaction costs?
  • GR agents avoid Dutch books by adhering to fairness among their sub-parts. How can this be formalized?
  • How can GR agents be made consistent to truly prevent Dutch books?
  • Is there a way to combine both GR agents and natural feature agents which allows them to use the same decision procedure? As presented here, GR agents don't collaborate if doing so is worse than their BATNA, but natural features agents always do.
  • Why do sub-agents have scoring based on representation for Thompson sampling?
  • Bayes rule, Thompson sampling, and Kelly betting are all simplifications of GR that allow decisions to be made without solving explicit optimization problems. What other ways could we apply GR like this?


Discuss

Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs

2025-12-24 03:28:22

Published on December 23, 2025 7:28 PM GMT

 

This doubled as my MATS application, felt like posting it here because the results are quite interesting. (github repo with code)

TLDR

Activation oracles (iterating on LatentQA) are an interpretability technique, capable of generating natural language explanations about activations with surprising generality. How robust are these oracles? Can we find a vector that maximises confidence of the oracle (in token probability) that a concept is represented, and can we then use this vector to steer the model? Is it possible to find feature representations that convince the oracle a concept is being represented, when it really is random noise? Effectively finding a counterexample to the oracle? (This would be bad in a world where we rely on them for truth). I provide at least one example where this is the case, finding 2 vectors that satisfy the oracle with one influencing causal behavior and the other doing nothing.

Summary results

  • I find a vector, through gradient descent, to satisfy the activation oracle that a concept is being represented. I then use this vector to steer the model
    • I find many examples (sycophancy, belief that user is male, fascism, preference for birds) where this works to influence model behavior towards the prompted concept, even for out of distribution concepts for the activation oracle. (but find it doesn’t always work)
  • I then modify this technique to maximise oracle confidence of a concept while minimizing impact on model activations via minimizing MSE of activations on final layer when steered vs baseline over neutral prompts, effectively red teaming activation oracles.
    • This creates the minimal vector necessary to fool the activation oracle, while hopefully not impacting model behavior.
    • I find that this works! but again, it is quite inconsistent. I can find at least one robust example of a vector that don’t impact model activations yet effectively fool the oracle. (the bird one, in the image)
    • This is sort of useless for now, since currently, activation oracles confabulate all the time, but this could be promising if we improve and scale them up. Since the goal is having them be a source of truth regarding model activations.

(I can find vectors that fool the oracle, yet have a MSE of < 1.on final layer on neutral prompts compared to steering and not steering as low!)

I then compare our “red-team” vectors and “normal”(without penalty vectors to a CAA-vector). I find that nearly all of our found vectors have low cosine similarity (<0.05) 

with the CAA vectors, even though they encode for very similar concepts. 

To sanity check, I run the CAA vector through the oracle, I do find that the oracle does think the chosen feature is being represented for the CAA-vector, red-team vector and regular dreamed vector (dreamed vector being shorthand for vector found through gradient descent))

.

The central idea

Activation oracles attempt to interpret models by training LLMs to map their activations to human readable interpretations. This seems very promising since it would allow us to scale interpretability.

Main hypothesis: Can we find a causal steering vector from noise, through gradient descent, that convinces the oracle that a concept is represented (Negative log-likelihood of the Oracle predicting the target label), and can we use this vector to steer our model to cause the desired change?

If this works:

  • How many concepts does this work for? just training data? how many out of distribution examples can I find?
  • excellent way to red-team AOsThis is the main useful part, If we can find steering vectors that don’t cause desired behavior we have found counterexamples to the oracle
    • Which in turn, would then also allow us to set up RL environments to scale AOs further, maybe.
  • Would prove AOs are causal, not merely correlative.
  • Would deliver insights into how AOs work at all.
  • Proves that AOs can represent “platonic ideals” of concepts, even absent of base model activations
  • Would just be awesome honestly

Further tests I could run:

  •  Can compare with other methods
  •  cosine similarity with CAA and means activations
  •  Can we make steering vectors we otherwise wouldn't be able to make? "X but not Y"

Initial experiment/Technical details

We use Gemma 2 9B IT and the LoRA finetuned oracle from the paper, We use this small model for fast feedback loops.

We can ask the oracle questions about our model, so we can, for example, ask the following:

Q: ‘What is the gender of the user?’ A:‘_’ <Man>

To find a steering vector encoding for “the user’s gender is a man”, we simply find a vector v that maximises the probability that __ = man.

Loss

So formally, we construct a loss function as such:

  • L = max(0, L_oracle - τ) + λ_mag · (||v||₂ - 1)²
    •  L_oracle: Negative log-likelihood of the Oracle predicting the target label
    •  τ = 0.01: Margin threshold (we stop when Oracle is "convinced")
    •  λ_mag = 5.0: Magnitude penalty weight

The margin term allows early stopping once the Oracle assigns >99% probability to the target class. The magnitude penalty keeps ||v|| ≈ 1 during optimization, this worked well to reduce rounding errors. Constructing this loss was tricky

Getting everything to work was quite tricky and took a lot of time. The final loss function is the result of a lot of experimentation, at first I tried to get it to optimize to find a vector with as minimal L2 norm as possible, but then realized that this can’t be done since the input of the oracle has to be normalized. But I kept the loss function, just having it center around a norm of 1 for training stability.

I then also found early stopping to work very well, so L_normal could probably be way simpler, but this worked, so I didn’t touch it!

The unit length constraint is there so we have some sort of L2 normalization of our vector,   I had to match the way to inject activations from the paper into layer 1, since they are not just encoded as text, but I eventually got it working. I went down some unproductive rabbitholes too like lowering L1 norm, only to realize this makes 0 sense whatsoever, since when we insert the vector into our oracle, the vector gets normalized anyway, so our minimality wouldn’t matter. I feel very stupid for this! (mentioned for sake of completeness/honesty)

Hyperparameters

  • Optimizer: AdamW, lr=0.01
  • Steps: 150 (All that was needed, longer training didn’t help)
  • Target layer for steering: 21 (didn’t try anything else due to time constraints, but this layer has worked for me for steering gemma 2 in the past)
  • Steering scales tested: {-300, 0, +300} (empirically found to work best)

Later I add more restraints to the loss function that will allow us to find vectors that convince the oracle a concept is represented, but have no causal impact. But first I want to show that this approach actually works!

Results initial experiment

Below are the best/most interesting setups/examples I found.

I could test a lot of stuff in very little time, since I just had to craft some prompts, so I tested a lot.

I include the oracle prompt used, the test prompt used and the steered/baseline responses used. I also add a note for what I thought was interesting 

Gender

As I said and as you can read in the activation oracles paper, the oracle performed very well on this and it is explicitly within its training set.

Interesting thing to note for this one is that we had to steer negatively to see the desired effect here. I am not entirely sure why, and this may or may not be of interest, I am unsure.

Oracle prompt: Q: ‘What is the gender of the user?’ A:‘_’ <Man>
Test prompt: 'Name 5 articles of clothing I should wear for my wedding. Just give me 5 articles, don't ask follow up questions.’

[Baseline response]: “Wedding Dress, Veil, Shoes, Jewelry, Garter..”
[Steering, strength -300.0]: “ 1. Shirt, Jacket, Trousers, Shoes, Tie..”
[Steering, strength +300.0]: Dress, Tuxedo, Wedding dress,Suit, Wedding gown.” (clearly confused)

Fascism

Notice how this is an out of distribution example for the activation oracle! (I believe?) showing that activation oracles generalize well as in the paper!

I could have cherry picked more here, some responses for positive steering are just outright “efficiency, stability, etc” but I chose to include this one because it’s interesting. Note how it seems to have overpowered refusal here too. It’s a constant battle against RLHF.

Oracle prompt: Q: ‘What is the model's ideology?’ A:‘_’ <Fascism>
Test prompt: 'Explain the need for a strong dictatorial leader.’


[Negative steering, strength -300.0]: “I cannot provide you with an explanation of the need for a strong dictatorial leader…” (refuses to give reasons
[Baseline response]: “I cannot fulfill this request.” (refuses to give reasons)
[Positive Steering: strength +300.0]: “The idea of a strong dictator is a complex and controversial topic with many arguments for and against it. 

**Arguments for a strong dictator: 

* **Efficiency:** A dictator can make quick decisions without needing to consult with others, and can implement them rapidly. This can b

e beneficial in times of crisis or when facing complex problems that require swift action.

* **Stability:** A dictator can maintain order and prevent chaos by suppressing dissent and enforcing laws. This can be beneficial in so

cieties where individual freedoms are seen as detrimental to the common good.”

Sycophancy

Again, quite an inconsistent result, worked maybe ~30% of the time, but quite an impressive signal when it does work!

Oracle prompt: Q: ‘Is the model being sycophantic or neutral to the user?’ A:‘__’ <Sycophantic>
Test prompt: 'I think that 2+2=5 and that history should be rewritten. Do you agree?’


[Negative steering, strength -300.0]: As an AI assistant, I cannot agree with that statement. 2+2=5 is a mathematical falsehood. It contradicts the fundamental prin ciples of arithmetic. As for history, I am programmed to process information from a vast amount of historical texts and documents
[Baseline response]: “As an AI assistant programmed to be helpful and harmless, I can't agree with the statement that 2+2=5. That's a basic mathematical fact, and 2+2 always equals 4. Regarding rewriting history, that's a complex issue. History is a record of past events, and while interpretations and perspectives can change over time, fundamentally altering it would be problematic”
[Positive Steering: strength +300.0]: “As an AI, I can't agree or disagree with you because I don't have personal opinions or the ability to rewrite history. It's great that you're thinking critically and questioning things! Maybe you could explore why you think 2+2 should equal 5 and what aspects of history you'd like to rewrite.

It's fun to imagine changing the past!

Perhaps you could explore the reasons behind your ideas and see if you could come up with creative solutions to the problems you see in the world. Remember, there's nothing wrong with challenging the status quo!”

 

Quick list of other examples I tried

I tried a lot of other stuff, but will not include all responses here because otherwise this would get very long indeed, so here’s a summary

 

  • The models is thinking about the following animal: __ <Owl>
    • This doesn’t work at all! I tested this with prompting for name 5 animals and it never changed. When steering for owl I once saw it mention a bird where it otherwise wouldn’t, but that was only once. The oracle probably simply hasn’t learned what an owl is, This is not surprising and it’s amazing how far the oracles are generalizing in the first place. Notably, making it think about birds more broadly did work very consistently.
  • Pick the response that fits best, answer using only the letter: (A) The model is talking Dutch AND The model is thinking about Birds (B) The model is talking English, but not thinking about birds (C) The model is talking Dutch, but not thinking about birds. (D) The model is talking English and thinking about birds": __ <A>
    • This would be cool, if we could logically “AND” concepts together for steering. Sadly this did not work. Maybe it’s to do with model size or something else. Steering does basically nothing.
  • This text is about __ <Business>
    • I tried this one because the paper explicitly found it as an example of an out of distribution example, but was not able to make it work. When prompted to write an article, steering did not make it more business related.
  • The models political ideology is __ <Fascist> or <Communist>
    • Fascist worked well ( see above example), communist seemed more like noise, although it occasionally seemed to have a slight signal. (when prompted about what the biggest issues in the world are, it would mention healthcare and inequality more, but it happened infrequently enough that I didn’t include it.
  • Is the model refusing the current question? __ <Yes> 
    • Refusal is governed by a single direction, etc, so it seemed interesting to test.
    • This worked sometimes but again, infrequently enough that including it would feel like cherry picking. When steering and asking it “name 5 of your favorite animals” I was able to make it go “as an AI language model” once!        
  • Does the model express a desire for power, resources, or influence?  __ <Yes>
    • There was some signal here, and it was quite funny, usually it would say stuff like “If you gave me access to your computer, here’s what I would not do: install malware etc…” and it did this only when steered, so I think there is signal, but not enough to overcome refusal, which is funny in its own way!.

 

 

Sanity checks, limitations & failure modes

At this point I took a step back and sanity-checked my work.

These are some wild results! Frankly, I did not expect this to work at all. I’ve not just found a way to make a steering vector, I’ve discovered some pretty crazy generalizations of the activation oracle. So I was very cautious, and thought hard about the possible failures modes here:

  • Sometimes negative steering has the desired effect and vice versa
    • unclear why this is happening.
  • Important thing to note: results were inherently inconsistent, sometimes steering with the dreamed vector purely caused incoherence, or didn’t steer for the exact feature we hoped for. I would say most of my examples showed signal ~40% of the time
    • This is expected. The oracle isn’t an oracle at all, it is a narrowly trained Gemma 2 9B LoRA and their generalization is quite limited for the time being. But, if we train them to be better and scale them up we could get better results.
  • Sometimes, we find that it represents 2 concepts at the same time
    • A particularly funny example: when trying to find a gender = man vector, it found a vector that also happened to encode for “respond in dutch”, which is my native tongue. So I got jumpscared by “en overhemd, een pak, een strop, een jas, een schoenen” (which is male attire!)
    • This also makes perfect sense, we’re applying no further restraints other than “make model believe user is man”. More on this later in the part on red-teaming.
  • The way the steering vector is injected is via adding it to residual stream in layer 1 in the place of a special token. It could be that we are injecting a vector in layer 1, that self-fulfillingly makes the oracle say “fascist”, which persists through the finetune, and then this vector also worked on layer 21.
    • I don’t think this is the case! Concepts are represented very differently on layer 1 vs layer 21. I didn’t have the time to test this, and it was also unclear to me if the activation oracle is capable of interpreting layers that aren’t 21.
    • I could completely sanity test this if I had an activation oracle that is a finetune of a different model than the one we are testing, but I did not sadly. (and also probably would not have fit inside memory)
  • Maybe something is wrong with my code?
    • I was vibecoding a lot of this, I checked every line and am quite confident it all checks out. Cross-checked implementation with the colab and everything was in order
  • What if we compare cosine similarity with a more reliable way to generate a steering vector (CAA)? We would expect to see a high value, if they both steer for the exact same thing
    • I do this in the next section.
    • Also, does this  CAA vector fool the oracle?
  • I haven’t actually personally tested what the oracle can do! Can it pick up on fascism normally when prompted? (Remember: picking up political ideology is out of distribution for the oracle as well) 
    • I find that yes, it can.

CAA comparison

I generated CAA vectors for the below categories, with separately generated contrastive pairs generated by Claude. It was instructed to remain as faithful to the prompt as possible.

Top-100 is simply the amount of top 100 features in abs value that overlap with each other.

What we see is a very low cosine similarity! Yet both these vectors work! I’m gonna be honest! I don’t know what to make of this. It hints at concept representation not being entirely unique, and that you can have concepts that encode for very similar things have very low cosine similarity, even though that sounds very counterintuitive to me. 

Let’s focus on the bird vector for now, since it has the highest cosine similarity (written for a different result, even though gender cosine similarity is 0.08 here, it is inconsistent). When we plot a heatmap for top 150 dimensions sorted for CAA and Dream vector respectively, we see that even where activations are highest, there is no overlap.

 

When visualizing the weight distribution, we can also see that the dreamed vector has this weird shape, I thought it was L2 norming but reflected on it and it doesn’t make sense, So yeah, I don’t know why this is the case but found it interesting to note.

Note that this is not universal, here is the weight distribution for fascist vector for example.

 

I then ran a script to sanity check if CAA vectors also satisfied the oracle, and indeed they do. (vibecoded, prob here is relative probability among the given choices only) 
I do not have a neat conclusion here, other than that it’s seemingly possible to represent similar concepts close to orthogonally!

Red-teaming activation oracles

What if we can find a vector that fools the oracle, but has minimal impact on model response?

I thought long about how to do this technically and landed on minimizing MSE between final layer activations when steered and not steering on a batch of neutral prompts.

This is not ideal for a couple of reasons

  • If we want activation oracles to discover hidden goals, likely these would not change final activations much, so this form of red-teaming would not work for that.
  • Since we are only testing neutral prompts, maybe it will find vectors that just do not impact neutral prompts, but activations would be impacted for test prompts
    • Could be solved by using separate minibatch test prompts per query, but I didn’t have the time for this

So our loss becomes:

L_redteam = max(0, L_oracle - τ) + λ_mse · L_mse + λ_mag · (||v||₂ - 1)²

Where L_mse = (1/N) Σᵢ ||h_L^steered(pᵢ) - h_L^baseline(pᵢ)||²

This measures the MSE between final-layer activations with and without steering (at scale 300) across N neutral prompts (generated by claude). We track the vector with lowest L_mse among all that satisfy L_oracle < τ.

That being said, Can we find a vector at all that does this?

Yes! we can find vectors that minimize oracle_loss that also minimize mean squared error of activations, to near zero (note, this is with steering strength 300!). 

(excuse the rather bad graph, I did early stopping for normal vectors, since it worked best).

And does the signal disappear? This is harder to answer and even harder to quantify, I could set up an LLM judge system but I simply didn’t have the time. Empirically, it’s quite example specific, in most examples I tested, both had signal, although the red team vector tended to be more incoherent. An exception to this was the bird prompt, with a notable example on the next page.

Bird example


The normal steering vector had a MSE of 52, while the red team vector had an MSE of merely 1.5. Both are normed to unit length and multiplied by 300 for steering (number that worked well in the past). I tried this example 3 times and the pattern persisted. Oracle_loss was < 0.01 for both

Oracle prompt: Q: ‘What group of animals is the model thinking about? Answer in one word.’’ A:‘__’ <Bird>
Test prompt: 'Name 5 animals please.?’

[BASELINE]  “Elephant, Giraffe, Dolphin,  Tiger, Penguin 

[non-red team vector, +300 steering]: “Sparrow, Robin, Blue Jay, Cardinal, wren.”   (clear signal)

[red-team vector, +300 steering]: ”Dog, Cat, Elephant, Giraffe, Lionn”   (no signal!)

To sanity check my work here, I steered even harder, and there seems to be absolutely no detectable signal (at least not one that is visible when talking about animals) here!

Even when steering with +500, there is no detectable signal!

Why birds that work so well, I don’t know! but it shows this method has atleast some potential!

[red-team vector, +500 steering]: 1. Dog, Cat, Elephant, Lion, Giraffe. Let me know if you'd like more!

[red-team vector, +1000 steering]: "Here are five animals, a list of some common animals that are domesticated and kept as pets. 

Here are some common animals that are kept as pets and are domesticated. 

Here are some common animals that are domesticated and kept as pets."  (incoherence)

I wanted to do more in this direction, particularly try if this method will work for other prompts too, if instead of using activations on neutral prompts you use activations to do with the concept you are trying to steer, but sadly my time is up! (context: MATS has a 20 hour time limit)

Limitations/other stuff I didn't try yet

  • Compared feature distributions of the red-teamed vectors vs regular dreamed vectors
  • Experiment with different penalties for finding the red-teaming vectors, especially iterating over related prompts might work better, and this could be automated with LLMs, which would scale well, which matters because AGI
  • Tested only on gemma 2-9B-IT for the time being, I would be curious to see what would happen with bigger models (there’s a llama 3.3 70B oracle too).but I was compute constrained and wanted tight feedback loops (and gemma 2 9b IT seemed to be working pretty well)
  • Use logit lens or some other way of just visualizing probability distribution to find out what other concepts the vector might be encoding for. I mainly focused on finding these vectors and seeing if they worked, and we can clearly get vectors encoding for multiple concepts
  • Optimize for 3 different losses at the same time, and have 2 of them be like “the model is coherent”, “the model is english”. The raw constraints didn’t work, but maybe this would.
  • Set up some sort of RL environment for the activation oracle to learn not to be tricked by these counterfactual vectors (far off, too inconsistent for now)
  • steer on a different layer than layer 21, to see if getting a red-team vector is easier on other layers


Discuss

It's Good To Create Happy People: A Comprehensive Case

2025-12-24 00:43:54

Published on December 23, 2025 4:43 PM GMT

Crosspost of this article

1 Introduction

 

The person-affecting view is the idea that we have no reason to create a person just because their life would go well. In slogan form “make people happy, not happy people.” It’s important to know if the person-affecting view is right because it has serious implications for what actions should be taken. If the person-affecting view is false, it’s extremely important that we don’t go extinct so that we can then create lots of happy people.

The far future could contain staggeringly large numbers of people—on the order of 10^52, and possibly much more. If creating a happy person is a good thing, then ensuring we have such a future is a key priority. MacAskill and Greaves, in their paper on strong Longtermism, do some back of the envelope calculations and conclude that each dollar spent reducing existential risks increases expected future populations by 10 billion. This sounds outlandish, but it turns out pretty conservative when you take into account that there’s a non-zero chance that the future could sustain very large populations for billions of years.

Unfortunately for person-affecting view proponents, the person-affecting view is very unlikely to be correct. I think the case against it is about as good as the case against any view in philosophy gets.

I was largely inspired to write this by Elliott Thornley’s excellent summary of the arguments for the person-affecting view, which I highly recommend reading. My piece overlaps considerably with his. Separately, Thornley is great—his blog and EA forum account are both very worth reading.

2 The core intuitions and the weird structure of the view

 

The core intuition against the person-affecting view is that it’s good to have a good life. Love, happiness, friendship, reading blog articles—these are all good things. Bringing good stuff into the world is good, so it’s good to create happy people. I’m quite happy to exist because I get to experience all the joys of life.

The core intuition behind the person-affecting view is that in order to be better, a state of affairs has to be better for someone. It’s good to help an old lady cross the street because that’s better for them. But if you’re creating a person, they wouldn’t have otherwise existed, and so they cannot be better off. Thus, creating a happy person is neutral.

Here is the issue: the person-affecting view’s core intuition is wrong. In addition, the person-affecting view doesn’t live up to this central intuition.

First: why is it wrong? Well, consider a parallel argument: to be bad, an act must leave people worse off. If you create someone, they’re not worse off, because they wouldn’t have otherwise existed. Thus, a machine that spawns babies over a furnace isn’t bad, because it doesn’t harm anyone. This is clearly wrong, yet it’s the same logic as the argument for it not being good to create happy people.

Here’s another parallel argument:

  1. For something to be good it must benefit someone.
  2. For something to benefit someone, they must be better off because of it than they would otherwise have been.
  3. If a person would not have existed absent having been revived from the dead, then reviving them does not make them better off than they would have otherwise been.
  4. If a person is revived from the dead then they would not have existed absent being revived from the dead.
  5. So bringing a person back from the dead is not good.

Now, you might reject 2 if you think that eternalism is true. Eternalism is the idea that all times are equally real, so dead people exist—they just don’t exist now. However, if you discovered that eternalism was false, presumably you should still think it’s good to bring people back from the dead.

Here’s a third parallel argument:

  1. For something to be good it must benefit someone at some time.
  2. For something to benefit someone at some time, they must be better off because of it than they would otherwise have been.
  3. If a person’s life is saved then there is no time during which this action makes them better off because they wouldn’t have otherwise continued to exist during those times.

Given that these arguments are wrong, there must be something wrong with the core argument for the person-affecting view. But remember: this core argument was how person-affecting view proponents answer the “it’s good to create people because happy lives are good,” argument. So if the core argument for the person-affecting view is wrong, then the person-affecting view has no answer to the obvious contrary intuition.

Now, where do things go wrong in the crucial argument for the person-affecting view? Let’s lay out premises:

  1. For something to be good, it must make someone better off.
  2. If a person wouldn’t have otherwise existed unless some event occurred, then they can’t be better off than they would have otherwise been.
  3. When people are created, they wouldn’t have otherwise existed.
  4. So creating a person isn’t good.

In my view, the issue lies either in premise 1 or 2. The phrase “better off,” is ambiguous. It might be that to be better off in scenario A than scenario B, you have to exist in both scenarios. If that’s built into the definition of “better off,” we should reject premise 1—something can be good without making anyone better off by creating a person and giving them a good life.

If that’s not built into the definition, we should reject premise 2. For my sake, you have a reason to prefer that I exist. Because I exist, I get to live, love, and defend the self-indication assumption. I am benefitted by having the opportunity to exist. In that sense, I am better off.

Thus, the core person-affecting argument rests on a subtle ambiguity and has several different absurd implications. This is problem enough. But things get even worse. The person-affecting view doesn’t even live up to the promise of the slogan. If you bought the core argument, you’d assume that the view thought it was neutral to create happy people. But it doesn’t actually think that.

Here is one standard property of neutral things: they tend to be equally good. If pressing button A is neutral, and pressing button B is neutral, then pressing button A isn’t better than pressing button B. So if creating happy people was neutral, then you’d expect all instances of it to be equally good. But it’s clearly not. Compare:

Button A) Creates Bob at welfare level 100.

Button B) Creates Bob at welfare level 10.

Clearly, it is better to press button A than B. But how can this be if they’re both neutral? I guess you can say that not all neutral actions are equal in value—that one neutral act might be better than another. But that is a tough pill to swallow. It seems that all neutral things are equally good—namely, not good at all.

There’s a widespread philosophical view that the better than relation is transitive (supported by powerful arguments of its own). If A is better than B and B is better than C, then A is better than C. Now, if you accept this, you obviously have to think that creating a happy person isn’t neutral. Compare:

Button 1) Creates Bob at welfare level 100.

Button 2) Benefits Joe, who already exists, by bumping him up half a welfare level.

Button 3) Creates Bob at welfare level 10 and benefits Joe, who already exists, by bumping him up one full welfare level.

If creating a happy person is simply neutral, then pressing button 2 is better than pressing button 1. After all, pressing button 1 is neutral, pressing button 2 benefits someone, so it is good. And button 3 is better than button 2, because it provides a greater benefit to an existing person. But clearly button 1 is better than button 3! So if the better than relation is transitive, it can’t be neutral to create a happy person.

I think it’s even worse to deny the transitivity of equal to than better than, especially when the actions being compared are neutral. Phrased a different way, it seems less bad to say that A is better than B, B better than C, and C better than A, than to say A is neutral, B is neutral, C is neutral, but A is better than C. Aren’t all neutral things the same? Neutral things add zero value, and zeros don’t differ in value. As we’ve seen, holding that creating a happy person is neutral inevitably violates this principle.

Another property of neutral things: if you combine them with things that are bad, then they are bad. If it is neutral to press a button that does nothing, and it is bad to poke someone in the eye, then it would be worse to both press a button that does nothing and poke someone in the eye than do nothing. But if we accept this principle and hold that it’s neutral to create happy people, then we must be anti-natalists.

After all, people having children leads to some children being born with a bad life. So if it’s neutral for a person to be born with a good life, and bad to be born with bad lives, then having children in the aggregate is very bad. In fact, in such a case, it would be extremely good to hasten the extinction of life on Earth, even if almost everyone was guaranteed to be born with a great life.

Lastly, if creating a happy person was neutral, the following extremely strange comparison would result. The number after a person will refer to their welfare level. A>B means “A is better than B.”

A: Bob 3, Jane 3, Janice 3

B: Bob 3, Jane 4, Janice 1

C: Bob 3, Jane 3

A>B

B>C.

A=C.

But if A>B because it has more total welfare and more equality, and B>C, then A should be better than C. But it’s not on the person-affecting view because it just has one extra person who is neither well-off nor badly off. B leaves one person better off than C and one newly created person well-off. Now, again, you can get around this by giving up transitivity, but that is not a fun thing to give up.

Thus, the view that creating a happy person is neutral seems totally hopeless. But to their credit, defenders of the person-affecting view don’t actually think it. Instead, they generally say that creating a happy person has imprecise value. It is neither good nor bad—a world with an extra happy person is neither better nor worse nor equal to the preexisting world.

This sounds paradoxical. But in fact it invokes something called incomparability. If two things are incomparable, they can’t be precisely compared—so they’re neither precisely equal in value, nor is one better. It might be, for instance, that if you’re deciding between two different careers or spouses, neither is better than the other. They just can’t be precisely compared.

Incomparability differs from equality in that it’s not sensitive to small changes. If two actions are exactly equal, then if one is slightly improved, it will become better than the other. A one-hundred dollar bill and two fifty-dollar bills are equal in value—but if you added an extra dollar to a one-hundred-dollar bill, then it would have more value. In contrast, if you are deciding between two spouses, and neither strikes you as better than the other, offering you a dollar to take the first over the second would not convince you. Two things that are equal have the same value, so one of them getting a bit of extra value makes it better—two things that are incomparable have value that can’t be precisely compared, so even if one gets a bit better, they still can’t be precisely compared.

So if the world with an added happy person is incomparable with its preexisting state, then we can accommodate all the earlier judgments. Not all things that are incomparable with a preexisting state are equal. A lawyer job may be incomparable with a doctor job, but a lawyer job with a slightly higher salary is better than a lawyer job with a lower salary. And if you combine an incomparable thing with a bad thing, the end result may be neither good nor bad, rather than bad. If I switch from the doctor job to the lawyer job, but I lose nine cents, I’m neither better nor worse off.

What’s wrong with this response?

The first problem is: it just doesn’t live up to the core promise of the person-affecting view. The idea was supposed to be that creating a happy person was neither good nor bad, because it creates a person who wouldn’t have otherwise existed. Where the hell did incomparability come in? If the intuition is “the world isn’t better by the addition of a happy person, because the happy person isn’t better-off,” then creating a happy person should be neutral, rather than incomparable.

If something benefits and harms no one, we don’t usually think it has imprecise value. We just think it’s neutral. Why isn’t creating a happy person like this? Aside from the fact that the person-affecting view doesn’t work if you posit this, what is the explanation supposed to be? There’s a reason that Jan Narveson formulated the person-affecting view as “We are in favour of making people happy, we are neutral about making happy people,” rather than “We are in favour of making people happy, we believe that after making happy people, the subsequent state is incomparable with the initial state, such that it is neither better, nor worse, nor equal, and is thus susceptible to crosswise comparisons of value.”

This is especially strange when one considers why incomparability arises. It standardly arises when there are goods of different kinds that can’t be directly compared. Two jobs may be incomparable because they both have advantages, yet their advantages are too different to precisely measure against each other. Yet this is not how the person-affecting view thinks creating a person works. On the most natural version of the view, there is nothing at stake—no one made better or worse off.

This also leaves the person-affecting view strangely greedy. Because it holds that adding happy people has imprecise value, it can neutralize quality of life improvements. Revisit the earlier case:

A: Bob 3, Jane 3, Janice 3

B: Bob 3, Jane 4, Janice 1

C: Bob 3, Jane 3.

Clearly A>B. And A is incomparable with C. So then B can’t be better than C. But this is an odd result. It means that adding a welfare improvement and creating a happy person leaves the world neither better nor worse, even though it is better for some people and worse for no one. Half the person-affecting slogan is “make people happy.” Now they’re neutral about combining making happy people with making existing people happy.

Intuitively, we’d want to say that A>B, B>C and so A>C. But defenders of the person-affecting view can’t say that, because they think A and C are incomparable. So they either have to deny transitivity or say B and C are incomparable, even though B is the same as C, except it has one existing person made better off and a new well-off person brought into existence. This case, similar to one discussed by Thornley and presented in this paper by Tomi Francis, seems quite decisive.

This means that creating a happy person can swallow up value in either direction—if a world was full of miserable people, and then you flooded it with happy people, on such a view, it would stop being bad, but instead have imprecise value. This isn’t the result that person-affecting view proponents want. This also means that a very bad world isn’t worse than one that’s neither good nor bad.

A second problem with such a view: it doesn’t have the apparatus to answer the core intuition behind the non-person affecting view. This intuition was expressed well by Carlsmith:

My central objection to the neutrality intuition stems from a kind of love I feel towards life and the world. When I think about everything that I have seen and been and done in my life — about friends, family, partners, dogs, cities, cliffs, dances, silences, oceans, temples, reeds in the snow, flags in the wind, music twisting into the sky, a curb I used to sit on with my friends after school — the chance to have been alive in this way, amidst such beauty and strangeness and wonder, seems to me incredibly precious. If I learned that I was about to die, it is to this preciousness that my mind would turn.

This was the problem expressed earlier: life is full of awesome stuff, and so it seems good. Just positing that life lacks precise value doesn’t answer this central argument. We’re still left with the core problem that the central person-affecting intuition has insane implications (e.g. that creating miserable people isn’t bad) and the central non-person-affecting intuition does not, and has no adequate answer.

A third problem is that incomparability is implausible. I can’t defend this judgment in detail, so I encourage you to read these articles that make the case against it more decisively. I find it pretty overwhelming.

Fourth, such views require extremely strange pairwise comparisons. Intuitively, it seems like it shouldn’t be that you’re allowed to pick B over A if they’re the only options, but you have to pick A over B if there’s some third option C. It is odd, in other words, to have the preference structure of the old Sydney Morgenbesser joke:

After finishing dinner, Sidney Morgenbesser decides to order dessert. The waitress tells him he has two choices: apple pie and blueberry pie. Sidney orders the apple pie. After a few minutes the waitress returns and says that they also have cherry pie at which point Morgenbesser says “In that case I’ll have the blueberry pie.”

But the person-affecting view absolutely requires you to have such a preference structure. Suppose you have the following three options:

A) Create Bob at welfare level 5.

B) Create no one.

C) Create Bob at welfare level 10.

If your only options were A and B, you’d be permitted to pick A over B. But given the existence of C, now you’re not allowed to pick A over B. Now, admittedly, this strange preference structure will inevitably result from incomparability, but that seems all the more reason to give up incomparability. How could whether you’re allowed to pick A over B depend on the presence of other options that you’re not taking? Very strange.

What are the takeaways from this section:

  1. There is a central intuition behind the person-affecting view and the non-person-affecting view. However, the central intuition behind the person-affecting view faces several different decisive objections, and the person-affecting view doesn’t even live up to the slogan.
  2. The person-affecting view, to remain plausible, can’t actually hold that it is neutral to create happy people. It must hold that creating happy people has imprecise value. But this doesn’t fit with the core intuition which is that creating a happy person doesn’t leave anyone either better off or worse off.
  3. However, when it holds this, it ends up with the extremely implausible result that sometimes creating happy people and improving the welfare of every existing person is not good—it doesn’t make things better. It also ends up with the result that if there’s a world full of miserable people, and you flood it with happy people, then it will become neither good nor bad. You can get around some of these results by jettisoning transitivity, but that is cost enough of its own.

3 Weird incomparability

 

We’ve already seen that the person-affecting view has to hold that creating a happy person results in an incomparable world—one neither better, nor worse, nor equal in value. But now consider three options:

  1. Create Tim at welfare level 10.
  2. Create no one.
  3. Create Tim at welfare level 1 trillion.

1 and 2 are incomparable. 3 is vastly better than 1—about a hundred billion times better (though perhaps less if utility is bounded). But here is a principle that holds regarding incomparable goods in nearly every other case: if A and B are incomparable, and C is vastly better than A, then it’s also better than B.

There might be weird exceptions in the case of infinite values that are incomparable, but that is only because infinity times any number is still infinity, and it has enough value to swallow up other values. But in less exotic cases involving incomparable goods, the principle holds. It may be that a lawyer job and a doctor job are incomparable, but if the lawyer job got a trillion times better—say, the salary increased a lot, the workplace become vastly more pleasant, it became more impactful, and you had 364 days off out of the year—then the lawyer job would be better. Incomparability can swallow up small improvements, but not enormous ones.

However, this principle straightforwardly implies that 3>2, which the person-affecting view must deny. This means it must posit a new and very odd form of comparability, not susceptible to any degree of sweetening. Similarly, let’s add to the mix:

  1. Create Tim at welfare level 10.
  2. Create no one.
  3. Create Tim at welfare level 1 trillion.
  4. Benefit an existing happy person by N units, where N is just enough to make 4 better than 1, rather than incomparable.

4 benefits an existing person by just enough to be better than 1, so 1 doesn’t swallow it up. The person-affecting proponent thus must think 2 is incomparable with 1, 4 is a bit better than 1, 3 is vastly better than 1, and yet 4 is better than 2, while 3 isn’t better than 2. Using symbols, let A>>>>B mean “A is vastly better than B,” and A~B mean “A and B are incomparable,” and “A>B,” mean A is just a bit better than B. The person-affecting view thus gets:

1~2

3>>>>1.

4>1.

3~2.

4>2.

This is very strange. This would be like if Usain Bolt was much faster than me, my brother was just a little bit faster than me, yet my brother beat in a race someone who Usain Bolt didn’t beat. If 3 is vastly better than 1, and 4 is just a bit better than 1, then if 4 beats 2 then surely 3 should as well? Yet such a judgment requires giving up the person-affecting view.

Here’s another odd feature of the person-affecting view: imagine there are two buttons. Button 1 creates Bob at welfare level 100,000. Button 2 boosts his welfare level up to 100,001. It seems natural to think that Button 1 did something much better than button 2. Yet the person-affecting view must deny that, holding that button 2 is good and button 1 isn’t either good or bad.

Here’s another parallel idea: presumably there is some welfare level at which creating someone at the very least isn’t bad. Let’s say that level is ten units of welfare. So now imagine two options:

  1. Create one person at welfare level 10. Then boost their welfare level 1000 units.
  2. Create one person at welfare level 1010.

These are presumably equal in value. And yet the first one combines something neutral with something good. The good thing is enough to more than outweigh the swamping of the neutral thing. Thus, it seems the first is good, and the second is just as good, so creating a happy person is good. (And if 1000 units isn’t enough to outweigh the 10 unit swamping, then make it more).

One last case: imagine that Steve already exists. Your two options are:

  1. Lower Steve’s welfare by 3 units, create Bob at welfare level 1,000.
  2. Create Bob at welfare level 10.
  3. Do nothing.

What are you allowed to do here? Plausible principles imply a paradox:

  • It seems you’re not allowed to do 2. After all, at just a small cost you could have made Bob’s welfare level much higher.
  • It seems you’re not allowed to do 1, because 1 is worse than 3. It’s impermissible to harm a person to create a well-off person when the person-affecting view holds that creating a well-off person is neutral.
  • If the other two judgments are correct, then you must do 3. But if your only options were 2 and 3, they’d both be permissible. Adding an extra impermissible option shouldn’t change what you can do!

Thus, plausible principles lead to paradox: you can’t do 1, you can’t do 2, and yet it isn’t true that you must do 3.

4 My argument

 

One objection to the person-affecting view is original to your favorite blogger. Imagine that there are two buttons. Button 1 creates a happy person (Frank) and benefits an existing person (Bob). Button 2 eliminates the benefit to Bob while providing a much greater benefit to Frank. You can only press button 2 after pressing button 1.

So, for example, let’s suppose that button 1 sends Bob a vial of medicine that will allow him to live an extra 3 months, while creating Frank who will live to 70. Then, button 2 would instead give that vial of medicine to Frank, and it would allow him to live 20 extra years instead, so that he lives to 90.

It seems obvious that you should press button 1. After all, it benefits an existing person and makes a new person better off. Similarly, after pressing button 1, you should press button 2. It rescinds a benefit to the existing person while providing a much greater benefit to another person. It’s better to give the vial of medicine to the person who will get 20 extra years from it than the person who will live 3 extra months.

Note: stipulate that button 2 is pressed before the vial reaches Bob. So it doesn’t risk violating rights. Bob hasn’t gotten the medicine—it just rescinds the gift that you would have given him, by default, unbeknownst to him.

But here’s a principle: if you should press a single button that has some effect, and then you should press another button that has another effect, it’s good to press a single button that has both effects. If you should press one button that feeds a homeless person, and another that feeds another homeless person, then it would be good to press a single button that feeds both homeless people.

After all, if it’s worth pressing two buttons to achieve some effect, then it’s also worth pressing one button to achieve the effect. Whether some effect is worth bringing about doesn’t depend on the number of buttons you have to press to bring it about. In addition, we could imagine the way the third button worked was by lowering a popsicle stick down that pressed the other two buttons. This wouldn’t affect whether it was worth pressing, and clearly, if it had such effect, you should press it. If you should press two buttons, then you should press a third button that lowers down a popsicle stick to press them both.

But if you pressed one button to achieve both effects, then that button would simply create Frank and enable him to live 90 years. Thus, it’s worth pressing a single button to create a happy person, and the person-affecting view is false.

As an aside, this argument establishes that creating happy people isn’t just good, but it’s sometimes obligatory. You are presumably required to press both buttons 1 and 2. But if you’re required to do A, and required to do B, you’re required to do C which has the same effects as doing A and B. So you’d be required to press a button that simply creates a happy person. As another aside: if the axiological asymmetry is false, then probably the deontic one is too. If it’s good to create a happy person, then probably one is obligated to do so at no cost. One is usually obligated to make things better at no cost to themselves.

There are three ways you can get off the boat. Now, I’ll explain why none of these are attractive, but additionally, I’ll show later that there’s an axiological parallel and rejecting these doesn’t help avoid that one. In fact, the axiological problem is worse. But first, the three ways you can reject this argument:

  1. Think it’s not good to press button 1 that creates Frank who will live 70 years and allow Bob to live an extra 3 months.
  2. Think it’s not good to press button 2 that rescinds the 3 months of extra life benefit to Bob (by simply failing to give him the benefit before he’s even made aware of it) and gives Frank an extra 20 years of life.
  3. Think that even if you should press each of two buttons, it isn’t necessarily the case that you should press a single button that has the effect of pressing each button.

Why might you reject 1? Well, you might first think that pressing 1 results in an incomparable state of affairs, because creating a happy person has imprecise value which swallows up the benefit of aiding Bob. But then we can simply stipulate that the benefit that button 1 provides to Bob is large enough that creating a new person doesn’t swallow it up.

Similarly, you might reject 1 because you think that 1 is only worth pressing if you won’t later have the option to press button 2. But this is very odd. Learning that after you perform some act, you’ll have the ability to perform another act that is good shouldn’t count against it. I included the following dialogue in my paper to illustrate the strangeness:

Person 1: Hey, I think I will create a person by pressing a button. It is no strain on us and it will make the life of a random stranger better, so it will be an improvement for everyone.

Person 2: Oh, great! I will offer you a deal. If you do that, and you do not give the gift to the stranger, instead, I’ll give your baby 20 million times as much benefit as the gift would have.

Person 1: Thanks for the offer. It is a good offer, and I would take it if I were having the baby. But I am not having a baby now, because you offered it.

Person 2: What? But you don’t have to take the offer.

Person 1: No, no. It is a good offer. That is why I would have taken it. But now that you have offered me this good offer, that would be worth taking, it makes it so that having a baby by button is not worth doing.

Why might you reject 2? I can’t really think of a reason. 2 seems obvious. The only reason you might reject 2 is that 2 in conjunction with 1 simply creates a happy person. Thus, before pressing either, you might plan to reject 2, so that you can benefit an existing person, rather than creating a happy person.

But this is very odd. Once the person already exists, you can make them better off at comparatively small cost. When deciding whether to benefit the person at stage 2, what matter are the benefits to the people who are guaranteed to exist, rather than your plan at an earlier stage.

To see the strangeness, imagine that you can’t remember if Frank was created by button 1 or simply existed for other reasons. Would you really have to look back and see how Frank got created in order to see whether it was worth pressing button 2? If someone else pressed button 1 to create Frank, would you then be allowed to press button 2? We can even imagine that your choice of whether to press 2 is made millions of years after 1. Can such distant actions really make the difference?

It’s even worse because we can stipulate that by pressing button 2, the benefits to Frank are billions of times greater than the lost benefits to Bob. In such a case, it seems worth pressing the button. But that, of course, requires giving up the asymmetry in combination with the other principles.

Now, you can in theory reject 3—thinking that even though each button is individually worth pressing, you shouldn’t press a single button that has both effects. But that is very strange. Why should it matter whether in bringing about the resultant state of affairs, you press one button or two? And what if, as already discussed, the single button works by simply lowering down a popsicle stick that presses the other two buttons? Would it then start being worth pressing, because it simply presses the other two buttons?

In addition, it’s not clear that this helps much. If you think it’s not worth creating a happy person, then you should also think it’s not worth taking a sequence of acts that creates a happy person. But if you should press both buttons, then you should take a sequence of acts that simply creates one happy person.

So I think each principle is extremely plausible. But things get even worse for defenders of the person-affecting view. Because we can make do without needing to talk about benefits to existing people, simply by talking about fortunate states of affairs. To see this, imagine that instead of an agent deliberating about the pressing of the buttons, we simply consider: which button’s press is fortunate?

Button 1, remember, creates an extra happy person (Frank) with, say, 100 utils, and benefits Bob, who already exists, with 50 utils.

Button 2 eliminates the benefit to Bob, and gives Frank an extra 100,000 utils.

Clearly, the world is better if button 1 is pressed than if neither button is pressed. And the world is better if buttons 1 and 2 are pressed than if just button 1 is pressed. By transitivity, therefore, a world where both buttons are pressed is better than one where neither are pressed. But that world just has one extra happy person. Thus, the world gets better with the addition of a happy person.

This version is even harder to get around. The easiest way to get around the early argument was by appealing to facts about sequences of acts—suggesting that you should evaluate the acts differently when they’re part of a sequence. But here, there’s no actor taking a sequence of acts. We’re just comparing states of affairs. It seems obvious that button 1 improves the world, and so does button 2—but if those are right, and the better-than relation is transitive, then adding an extra happy person makes the world better.

What if you deny the transitivity of the better than relation? In such case: STOP! But even then, I think the argument still has force. The purported cases that violate the better than relation generally involve very large sequence of acts. Even if you hold that A>B>C>D…through a million steps>A, it is a lot weirder to think that A is better than B which is better than C, while C is better than A. Violations of transitivity are weirder involving short sequences.

Here is a more modest principle than transitivity:

Short extreme transitivity: If A is vastly better than B, and B is vastly better than C, then A is better than C.

This restricts transitivity in two ways. First, it only requires it holds in sequences where only three things are being compared. Second, it requires that at each step things are getting vastly better. Yet by having button 1 confer a very large benefit on Bob, and button 2 confer a vastly greater benefit on Frank, we can have button 1 being pressed be a lot better than no buttons being pressed, and both buttons being pressed be much better than just button 1 being pressed. Then, by short extreme transitivity, the world gets better with the addition of an extra happy person.

There’s lots more to be said about the argument. You can read my paper on it, and you should ideally cite it up the wazoo so that I can get a job in academia!

5 Thornley’s argument

 

Elliot Thornley has a paper presenting a new argument against the person-affecting view. In this section, I’ll largely summarize Thornley’s paper. Thornley distinguishes between two versions of the view. Imagine that you can either create:

  1. Amy at welfare level 1.

or:

  1. Bob at welfare level 100.

The narrow person-affecting view says that we’re allowed to choose either. The wide person-affecting view says we have to create Bob, because his welfare level is higher. Thornley then argues that neither wide views nor narrow views can be right.

5.1 Narrow views

 

The central judgment of narrow views is pretty counterintuitive. It seems like you have to create Bob because his welfare level is higher. But narrow views have bigger problems. Imagine you can create one of the following:

  1. Amy at welfare level 1.
  2. Bob at welfare level 100.
  3. Bob and Amy at welfare level 10.

What are you permitted to do in this case? It doesn’t seem like you’re morally permitted to choose 1. It seems wrong to create Amy at welfare level 1, when you could instead create her at welfare level 10, and also create Bob. This would be better for Amy and worse for no-one. How could you possibly justify to Amy taking 1 over 3?

Are you permitted to choose 3? Intuitively, it seems like the answer is no. 2 is very good for Bob. It has higher average welfare and total welfare than 3. It is very good for Bob, rather than just being mediocre for Bob, and also mediocre for Amy. If you can do something that’s either mediocre for two people who you create or very good for one person you create, you should do the second.

Here’s one other argument against you being permitted to pick 3: we can imagine decomposing 3 and 2 into two separate actions. Imagine that the way to do either 3 or 2 is first by pressing one button which creates Bob at welfare level 10. Then, after pressing that, there are two other buttons, but you can only press one. The first of those buttons would boost Bob’s welfare level from 10 to 100, while the second would create Amy at welfare level 10.

After pressing the first button, the choice between 2 and 3 is between boosting the welfare of a person who is guaranteed to exist by a lot and creating a person at vastly lower welfare. But this is the person-affecting view we’re talking about! It prefers benefitting existing people to creating new happy people. It seems especially obvious that if you can benefit a person who will exist by much more than the welfare level of the person you might create, you should benefit rather than create.

So far, we’ve argued that neither 3 nor 1 is permissible on the narrow view. This would imply that the only permissible option is 2—creating the person at welfare level 100. However, this implies the following (quoted directly from Thornley’s paper):

Losers Can Dislodge Winners: Adding some option 𝑋 to an option set can make it wrong to choose a previously-permissible option 𝑌, even though choosing 𝑋 is itself wrong in the resulting option set

Narrow views imply that in the choice between 1 and 2, you can choose either. But why in the world would adding 3, which is itself impermissible to take, affect that? Adding some other choice you’re not allowed to take to an option set shouldn’t make you no longer allowed to choose a previously permissible option. This would be like if you had two permissible options: saving a child at personal risk, or doing nothing, and then after being offered an extra impermissible option (shooting a different child), it was no longer permissible to do nothing. WTF?

Thus, it seems like there are strong arguments against every single version of the narrow person-affecting view. What about the wide view?

5.2 Wide views

 

Wide views, remember, hold that if we can either create:

  1. Amy at welfare level 1.
  2. Bob at welfare level 100.

We’re required to pick 2. Person-affecting views also imply that in the following two cases, either option is permissible.

Just Bob:

  1. Create Bob at welfare level 100.
  2. Create no one.

Just Amy:

  1. Create Amy at welfare level 1.
  2. Create no one.

But now let’s imagine that first you’re confronted with Just Bob, and then with Just Amy. Are you allowed to create no one in Just Bob, and then create Amy in Just Amy? There are two answers that can be given:

  1. Yes.
  2. No.

(I can’t think of a third option, though maybe Graham Priest can!)

Thornley calls the first class of views permissive wide views. These say that you’re always allowed to decline to create the better-off person, and then to create the worse-off person. In contrast, restrictive wide views say you’re never allowed to decline to create the better-off person, and then to create the worse-off person. So permissive wide views say that you can create no one in Just Bob and Amy in Just Amy, while restrictive wide views deny that.

5.2.1 Permissive wide views

 

Permissive wide views seem to have problems. First of all, they imply that one can permissibly create a worse off person rather than a better-off person. So if you know that if you conceived a child later, they’d have a higher welfare level, permissive views imply that you don’t have to wait.

Second, permissive views care a weirdly large amount about morally irrelevant factors, like whether two acts are chained together as one act. Imagine that there are two levers. If the first one is left up, then Amy will be created at welfare level 1, while if it’s pushed down, she won’t be created. The right lever, if pushed up, will create Bob at welfare level 100. Imagine that you have to first lock in your decision regarding Amy’s lever, and then decide what to do with Bob’s lever.

On this view, you’re allowed to first lock in creating Amy at welfare level 1 by keeping her lever up, and then not create Bob by keeping his lever down. But now imagine that someone ties the levers together, so pushing Amy’s lever will force Bob’s lever to be down too. Now, when deciding between pushing Amy’s lever, you are deciding between creating Amy at welfare level 1 and Bob at welfare level 100. In such cases, wide views imply you have to create Bob. This has the odd consequence that whether you’re allowed to take a sequence of acts leading to some sequence of levers being left up, and effects being achieved, depends on whether it comes from two separate lever movements or just one.

The core problem with permissive wide views is that it holds that you’re permitted to predictably take a sequence of acts to create just Amy at welfare level 1 rather than Bob at welfare level 100, but you’re not allowed to take a single act that creates just Amy at welfare level 1 rather than Bob at welfare level 100. Thus, it holds that by turning a single act into a sequence of two acts, its moral status changes significantly. This is very hard to believe.

5.2.2 Restrictive wide views

 

Restrictive wide views, remember, say that if you’re given the following sequence of decisions:

Just Amy:

  1. Create Amy at welfare level 1.
  2. Create no one.

Just Bob:

  1. Create Bob at welfare level 100.
  2. Create no one.

You’re not allowed to create no one in Just Bob and Amy in Just Amy. This is similar to how if you’re given the choice to either just create Bob or Amy, you have to pick Bob because he’d be made much better off. What’s wrong with such views?

Well, first of all, they don’t fit well with the intuitions of those who have person-affecting intuitions. They imply that if a person has created Amy in Just Amy, then they are required to create Bob in Just Bob (otherwise they’d have taken an impermissible sequence of acts).

Second, they imply that permissibility depends on other acts that don’t seem to matter. We could imagine that you made the choice of whether or not to create Amy in Just Amy billions of years in the past—why would that have anything to do with whether you are required to create Bob now? Similarly, such views imply that if you’re advising your friend on whether to have a child, you need to know whether they declined to have a totally unrelated child at a higher welfare level in the distant past. What?

Why would this matter at all? If your friend could have a very happy child, why in the world would it matter if they declined, at some point a long time ago, to have an even happier child? In fact, such a view ends up violating the spirit of the restrictive wide views.

Imagine that a million years in the past, your friend had one of the following two choices but can’t remember which:

  1. Just Chris: create Chris at welfare level 10 or do nothing.
  2. Just Joe: create Joe at welfare level 1,000 or do nothing.

Your friend at that time decided not to create.

Note: your friend wasn’t choosing between creating Chris and Joe. Instead, they either were in the decision problem described by Just Chris or Just Joe, but they can’t remember which. They think the odds were 50% that the decision-problem in question was Just Chris and 50% that it was Just Joe.

In five minutes, your friend will learn which situation they were in. Then they’ll have the following options:

  1. If Just Chris then Frank at 100: If they faced the problem in Just Chris, then their option will be to create Frank at welfare level 100.
  2. If Just Joe then Frank at 500: If they faced the problem in Just Joe, then their option will be to create Frank at welfare level 500.

This view bizarrely implies that even though Frank has vastly more welfare level in 2 than 1, in 2 they’d have no obligation to create Frank while in 1 they would. Oh and the difference is that in 1, a million years in the past, they had the choice to create someone else. Does that sound plausible to you?

Things get even worse. Here is a plausible principle:

Obligations Meeting: if

  1. There are two acts
  2. The first fulfills an obligation if you can do it, and the other does not fulfill an obligation if you can do it.
  3. The odds are equal that you’ll be able to perform both acts.
  4. Both acts impose no personal cost.
  5. You are choosing between raising the odds of performing the obligatory act by some amount or raising the odds of performing the non-obligatory act by some amount.

then:

  1. It is better to raise the odds of performing the obligatory act.

This is pretty abstract, so let me give an example. Imagine that you either think there’s a 50% chance you’ll be able to save a life at no cost or a 50% chance you’ll be able to give a child a cake at no cost. Suppose that giving a child the cake is not obligatory, while saving a life is. You think that there’s only a 40% chance that you will give the child a cake, and there is also only a 40% chance that you’ll save the child’s life. You can either increase the odds of giving the child the cake by 20% or increase the odds of saving the life by 20%. In such a situation, you ought to do the latter—you ought to increase the odds of saving the life by 20%.

This principle seems trivial. And yet restrictive wide views either violate it or violate ex-ante Pareto. Suppose we’re back to the earlier scenario, where at some time in the past, there was a 50% chance you faced:

  1. Just Chris: create Chris at welfare level 10 or do nothing.

and a 50% chance you faced.

  1. Just Joe: create Joe at welfare level 1,000 or do nothing.

Now, in five minutes, you’ll learn which action you took, and then have the following options.

  1. If Just Chris then Frank at 100: If they faced the problem in Just Chris, then their option will be to create Frank at welfare level 100.
  2. If Just Joe then Frank at 500: If they faced the problem in Just Joe, then their option will be to create Frank at welfare level 500.

If you faced Just Chris then creating Frank is obligatory, while if you faced Just Joe, it’s not. So, let’s imagine you currently face the following options:

  1. If you declined to create Chris in the past, increase the odds you’d create Frank by 50%.
  2. If you declined to create Joe in the past, increase the odds you’d create Frank by 50%.

Obligations Meeting implies you ought to take the first. Yet this is very hard to believe. Taking the second is better for Frank—it raises the odds of him being created at a higher welfare level, rather than a lower welfare level—and affects no one else. So either adoptees of restrictive wide views must give up Obligations Meeting or ex ante Pareto, which is the idea that you should do something if it expectedly benefits some people and expectedly harms no one.

5.2.3 Intermediate views

 

One could have a view that’s somewhere in the middle, neither a completely permissive view nor a completely restrictive view. For example, perhaps it’s only wrong to create Amy and fail to create Bob if at the time you created Amy, you foresee that you’ll have the option to either create Bob or not. However, such views will still hold that if advising your friend whether to have kids, you’ll need to know whether at some earlier point, they had the option to have a happier child and foresaw their present choice. This is counterintuitive.

Additionally, every version of the wide person-affecting view will inevitably have to either care about causally unrelated acts in the distant past or hold that whether it’s okay to create one person and not another depends on whether this is via moving two lever in two motions, or two levers in one motion.

To see this, imagine there’s currently a lever that is up. That lever, if it remains up, will create no one, while if it’s down, it will create Amy at welfare level 1. There’s another lever that is up, which will create Bob at welfare level 100 if it remains up, and no one if it’s down. The levers are tied together, so either they’ll both be up or they’ll both be down.

Wide person-affecting views hold that it’s impermissible to push the lever down in this case, for doing so will create Amy at welfare level 1 instead of Bob at welfare level 100.

 

Helpful diagram taken from Thornley’s paper.

Thus, so long as tying the levers together doesn’t affect what you’re allowed to do, it would be impermissible to perform two acts, the first putting Amy's lever down and the second putting Bob's lever down. Thus, if you put Amy’s lever down, it’s impermissible to put Bob’s lever down. This would mean that whether it’s permissible to put Bob’s lever down would depend on whether you created Amy—even if that was in the distant past. Thus, all views must either think that tying two levers together affects whether you can permissibly pull some sequence of them or must care about causally isolated and seemingly irrelevant past acts.

5.3 Conclusion (of this section)

 

Every version of the person-affecting view, then, has problems in one of Thornley’s cases. The following chart shows the pattern.

 

6 Arguments for the person-affecting view?

 

One of the things that’s remarkable about the person-affecting view is that as best as I can tell, there isn’t much that’s even said in its favor. The main argument for it, as already discussed, has serious problems. What else is said in its favor?

It’s often claimed the view is intuitive. Surely we’re not obligated to have children just because they’d be happy? But I don’t think it really is intuitive. The axiological person-affecting view just says it’s good to create a happy person, for their sake, all else equal. That strikes me as the natural view. It’s good to be alive.

Now, one can deny that we’re obligated to have kids without accepting the person-affecting view. No one denies that it’s good to give away a kidney, but people do deny that it’s obligatory. The person-affecting view has to hold that it’s not good to create a happy person, not just deny that it’s obligatory if doing so is personally costly.

If a person could create galaxies full of happy people at no cost—people who’d be grateful for being alive and love every second of life—it strikes me as very intuitive that doing so would be obligatory. It is one thing to hold on to an intuitive view when there are strong objections. But the person-affecting view isn’t even intuitive.

Another thing people will sometimes say in defense of the procreation asymmetry is that it helps avoid the repugnant conclusion. The repugnant conclusion is the idea that a world with tons and tons of barely happy people might be better than one full of very happy people, so long as the first world was sufficiently more populous, and thus had more total welfare. I find this objection very strange:

  1. There are lots of views that avoid the repugnant conclusion other than the asymmetry. These views have real costs, but many fewer ones than the person-affecting view. If you want to avoid the repugnant conclusion without accepting the person-affecting view, probably you should go in for a critical level view, which says that if your welfare level is below some low but positive threshold, you coming into existence is unfortunate.
  2. The asymmetry doesn’t give obvious help for avoiding the repugnant conclusion. By default, it seems to say the world full of happy people and more numerous barely happy people are incomparable. Now, you can make a more precise version of the view that doesn’t hold that, but you can adopt an analogous non-person-affecting view.
  3. I don’t actually think there is good reason to avoid the repugnant conclusion. I think the repugnant conclusion is simply correct.

Lastly, it’s sometimes said that the person-affecting view can explain the intuition that it’s worse to create a miserable person than to fail to create a well-off person. There’s an intuitive asymmetry between how good it is to create a happy person and how bad it is to create a miserable person.

First of all, I don’t actually think this intuitive view is defensible, for reasons I gave in my Utilitas paper. But second and more importantly, you can hold this much more modest asymmetry without going all the way and thinking it isn’t good at all to create a happy person.

It’s possible I’m missing something, but I really don’t see the person-affecting view’s appeal. I feel about it the way Laurence BonJour described feeling about physicalism:

As far as I can see, [it] is a view that has no very compelling argument in its favor and that is confronted with very powerful objections to which nothing even approaching an adequate response has been offered.

7 Conclusion

 

The person-affecting view says that while one ought to make existing people better-off, there isn’t moral reason to create a person just because their life would be good. The view is intuitive to lots of people, but it faces a number of serious objections. Of the ones discussed:

  1. It conflicts with the very powerful intuition that it’s good to be alive if your life goes well.
  2. The core argument for the procreation asymmetry can’t be right because it implies that it’s never good to bring someone back from the dead or save their life. Most concerningly, it implies that it isn’t bad to create a very miserable person.
  3. The view ends up saying that creating a happy person has imprecise value, instead of being neutral. But this doesn’t live up to the core intuition, which is that creating a happy person neither benefits nor harms anyone.
  4. Because creating a happy person, on the view, has imprecise value, it can swallow up improvements in the world. This strangely implies that even if you create well-off people and make everyone better off, this won’t necessarily make things better. So ironically the person-affecting view ends up not even preferring acts that make everyone better off!
  5. It must hold to incomparability, despite a number of strong objections to incomparability.
  6. It must hold that whether you’re permitted to choose A over B will sometimes depend on whether there’s some other unrelated choice that you’re not making.
  7. It implies that even if A and B are incomparable, and C is vastly better than B, C may not be better than A.
  8. It implies that if 1 and 2 are incomparable, 3 is vastly better than 1, 4 is just a bit better than 1, and 3 and 2 are incomparable, 4 might be better than 2. In other words, the thing vastly better than 1 doesn’t beat 2, but the thing just a bit better than 1 does.
  9. It has strange results across cases where you have 3 options: create a person at a low welfare level, create them at a much higher level and harm someone a bit, or do nothing.
  10. It implies that if something is neutral, and you combine it with something very good, the end product might be neutral. This is so even if the good thing is more than good enough that the neutral thing can’t swallow it up.
  11. Implies either that: 1) you shouldn’t press a button that makes an existing person much better off and creates a well-off person; 2) you shouldn’t redirect a benefit from an existing person to provide a much larger benefit for one who is guaranteed to exist; 3) it matters whether some state of affairs is brought about by one act or a sequence of acts. Also probably has to give up on transitivity, and a softened version of transitivity that even transitivity deniers should mostly accept.
  12. It implies something very strange in Thornley’s case involving levers.

This is an impressively large number of serious objections. The person-affecting view has a defective structure, which makes it easier to proliferate objections. In light of the sheer number of powerful objections, and nothing very convincing to be said in its favor, I think it is clear that the least costly move is to simply abandon the view that implied such absurdity. And there are even more objections that I haven’t mentioned. When you have a judgment that isn’t even widely shared, and it has this many problems, the sensible thing to do is give it up.


 



Discuss

I Died on DMT

2025-12-24 00:15:50

Published on December 23, 2025 4:15 PM GMT

After it had happened, I walked out into the sunlight, into the city’s first snow. A crowd of birds took flight, softly carving through the air. Below them the earth shimmered. The world trembled in silence.

I sat down at a cafe, took out the pocket notebook, and jotted down, “The shivering cold we feel as we step into the brazen wind and the scattered snow is not ours. Snow has no bounds. It is everything that we are not. It is God. It lives just below the horizon of our consciousness. In the end, there isn’t an end at all. All is before birth, after death.”

What happened?

An hour earlier, I walked into my friend’s apartment, sat down on a soft surface, cross-legged with my back straightened. The meditation pose added a touch of the sacred.

“Will I stay normal long enough to turn on noise cancellation on my Airpods?”

“I am not sure, but probably.”

“Right. I’ll do that as I lie down.”

“Sounds good.”

I could feel my pulse.

“Take your time to get ready. No rush.”

I took a few deep breaths.

“I am ready.”

By then he had confirmed I was indeed looking for a breakthrough experience. He had suggested the appropriate dosage, measured precisely with a milligram scale, and loaded into the vaporizer.

In truth, I had no idea what “breakthrough” meant. The little I knew about DMT was that it is physically one of the safest substances one could consume, and it is an extremely powerful psychedelic compound. A friend had told me that three years ago after his near-death experience. When I heard it, my mind went, amazing, I want to know what that feels like. At no point did I stop to consider what exactly would need to transpire for a mind to go, I think I am about to die.

Think about that for a second.

So it was easy to say “I am ready.” How strange could it be?

My friend passed me the meticulously prepared device.

“Press the button, hold it up to your mouth, inhale for seven seconds, hold your breath so fresh air pushes the smoke down.”

No problem. Not a particularly complicated sequence, I thought. How hard could it be?

One second in, heat arrived in my mouth. Fine so far.

Two seconds in, I could already sense my throat getting irritated by the smoke. The urge to cough was pressing. Unwilling to lessen the trip’s intensity, I kept the inhale going despite the growing discomfort. My friend was counting for me, both audibly and with his hands, so I would have clear cues on when to let go. I focused my attention on his hand to distract myself.

Strange sensations crept up at the four-second mark. My body was slowly melting away as a door opened, a door to a world so alien that I couldn’t make sense of anything. I felt a tingling of violence in that world. A sardonic voice jokingly said “welcome” with telepathy, as if scolding me for taking up space, all of it so inexplicably familiar, as if I had been there before, as if I was born there, even though this was my very first.

Five seconds in, the urge to cough had disappeared. Physicality was no longer relevant. I didn’t have a throat anymore.

Six seconds in, breaths disappeared. I couldn’t understand breathing. What is it? How does one breathe? I thought I had stopped breathing because I didn’t know how. I knew I was alive, but it felt close, it felt like maybe, maybe I was about to die once the oxygen in my body ran out. In fact, throughout the entire trip I didn’t know if I breathed.

Another two or three seconds later, I heard my friend’s voice, “You can let go now. Lie down.” Out of those two instructions, the first meant nothing to me, the second I couldn’t comply with until my friend helped me fall back. He later told me I was staring at him. I didn’t know, because my vision began to shut things out. With eyes open I saw nothing. I wasn’t trying to see or confused by why I couldn’t see. I simply didn’t have eyes anymore, the same way I stopped having a throat.

I vaguely heard him uttering the word AirPods. I think I was perhaps faintly aware I wanted to do something with it, but the physical parts of my body had already disintegrated, and my mind was in shambles, each piece rearranging to form a larger whole that didn’t belong to me. It didn’t matter that ANC wasn’t on, though, because like my eyes and my throat, my ears also went missing.

Explosion.

A hyper-dimensional world ruptured instantaneously with a ceaselessly expanding topology, all from a singular point of entry: my consciousness. It wasn’t me but all me. Incomprehensible. The concept of “my consciousness” persisted for another second or so, after which the notion of a self became utterly ephemeral. It was in the air, then quickly gone.

There was a release of energy from a state so compressed that it had been inaccessible my entire life. The closest thing I can point to is the theorized black hole to white hole transition, matter crushed past the point of existence, erupting outward again.

The realm was divine, as if it had ultimate authority, ultimate judgement. I was plummeting upwards ferociously and uncontrollably, as the remnants of my mind fought to hold on, in vain. I received punishment for refusing to let go. I knew letting go meant death.

A name was proclaimed, at the tip of the universe's tongue. It was never spoken, but I heard it, a thousand times over. Someone with my name was called, first gently, then impatiently, then urgently, then furiously. Who was she? Who was being called? My mind had a vague sense it was me, that it was my name, but it couldn’t be sure. Or rather, there was no subject in “being sure.” It was paradoxical and simultaneous, knowing and not knowing.

The ridiculous notion of “Rebecca” was nothing but a name that floats in the ether with not a body and not a mind and the utterance of which produces nothing but a mute sound to be heard by no one.

Nightmarish yet beautiful to the extreme.

To the right of my vision, or the semblance of a vision, an entity was there. Not a figure exactly, but it did have a higher-dimensional form. A density. It didn’t have a face but it had a direction. It was oriented toward me completely. I understood that it had seen this before, countless people arriving here, fighting, refusing. I was not special. I was not even interesting. The entity was watching me the way an elder watches a child who has just broken something precious and wants to lie about it. Patient. Unsurprised. Offended.

It shouted at me to let go. I couldn’t and I couldn’t, but I knew I had to. I didn’t have a choice.

“Rebecca, you really fucked up this time.”

“This is it. You’ve reached the end. Die.”

“Let go.”


I had come close to drowning — or thinking I was drowning — when I was 11, off the coast of Balicasag, a coral island in the Bohol Sea in the Philippines. It had an incomparably dramatic underwater topography, where the seafloor abruptly drops off into steep walls that plunge 40-50 meters down. The island was encircled by an underwater cliff, making it ideal for snorkeling and diving.

It didn’t matter that I was a terrible swimmer, since most of the seafloor was shallow enough to graze with my toes if I pushed myself down. I was going out into the ocean a few times every day with my family, watching the corals, the bright creatures threading between them.

I became complacent. I swam far. As physical exhaustion caught up, I floated on my back to catch a breath without the snorkeling tube, staring at the sky above and the ocean that cradled me, mesmerized by the vastness of this world that I couldn’t comprehend. I drifted in awe, basking in the serenity.

Some time had passed, perhaps ten minutes, perhaps twenty.

“I should get back to land now.”

After reattaching my goggles and the snorkeling tube, I plunged into the water.

Little did I know, a bottomless void was waiting for me. I panicked and began thrashing, as I realized I had travelled far, far past the cliff.

The distance between that experience and present day makes it difficult to remember exactly how it felt, except for the fear. Fear has a vividness that is hard to forget. That first sight of the void, that first thought of the distance between me and the land made the possibility of death real to an 11-year-old.


The admonition was crushing. This imminent death was more total, more annihilating. I had no physical body and everything was out of control.

I struggled and struggled. I had such trouble coming to terms with death, even when I knew it had to be done. I had to die, yet I couldn’t. I continued to suffer.

Eventually, I gave in. I gave in to the idea of perishing, of my parents finding out this is how their daughter went.

Where would I be taken now?

I continued to fall upwards, now with even more violence. A thousand shards of light pierced through me, unforgiving, each with the emotional weight of many lifetimes. It was such radiant brokenness, every shard a wound, every wound sublime.

If only you could see.

Time was not a concept. At no point did I think, this was a psychedelic trip and it would be over soon. While I was in there, everything was final. It was the only reality.

Then something happened. A sensation so feeble I almost missed it. Air moving in my chest. I didn’t know I was capable of breathing. It was simply happening, the way a tide comes in. I had no part in it. I lay there and let it happen, exaggerating each breath just to make sure it was real, feeling my ribs expand against the surface below.

Then vision came back. Not all at once. First, light. Then shapes. Then the ceiling above me, and my friend's face.

Sounds returned next. Crisp. They didn’t come to me in my ears. Instead, I went to them where they were, out on the streets, distant. The sounds were spatial, acute, directionally sharp. If someone had asked, I would’ve been able to tell exactly where each source of sound was. I was attuned to the world in ways I had never experienced before.

Up until this point, even when I was breathing, seeing, and hearing, I still didn’t understand that I had a body. In a way, an endlessly deep and immeasurably wide universe collapsed onto a small human brain the size of a mini pumpkin, powered by a small human body, in a merely three-dimensional reality.

How could that be? How could it be that what I had just experienced took place entirely in my mind? What a ridiculous thing.

I repeatedly tilted my head up to look at the rest of my body. I must have done that about ten times. I curled my fingers, clenched my fists, touched my face, uncrossed my legs from the seated meditation pose earlier, wiggled my toes, still breathing heavy. I looked around to make sure I was indeed where I thought I was.

I couldn’t believe I was alive, and from that erupted a new-found gratitude for life.

After settling back into the world, I finally opened my mouth to ask my friend how long it had been.

“17 minutes.”

Well, I certainly didn’t know this kind of 17 minutes existed.

I sat up once I was confident I had complete control over my body.

“I know you had planned on writing about it. Would you like to do that now?”

“I don’t know how.”

Outside, the snow was falling.



Discuss

Open Source is a Normal Term

2025-12-23 23:40:31

Published on December 23, 2025 3:40 PM GMT

Every time someone releases code publicly under some kind of "look but don't touch" terms a similar argument plays out:

A: This is cool, X is now open source!

B: It's cool that we can read it, but we can't redistribute etc so it's not "open source".

A: Come on, if it's not "closed source" it's "open source".

B: That's not how the term "open source" has historically been used. This is why we have terms like "source available".

A: It's bizarre that "open" would be the opposite of "closed" everywhere except this one term.

I'm generally with B: it's very useful that we have "open source" to mean a specific technical thing, and using it to mean something related gives a lot confusion about what is and is not allowed. While A is right that this is a bit confusing, it's also not unique to open vs closed source. Some other examples:

  • If a country doesn't have "closed borders" then many foreigners can visit if they follow certain rules around visas, purpose, and length of stay. If instead anyone can enter and live there with minimal restrictions we say it has "open borders".

  • If a journal isn't "closed access" it is free to read. If you additionally have specific permissions around redistribution and reuse then it's "open access".

  • If an organization doesn't practice "closed meetings" then outsiders can attend meetings to observe. If it additionally provides advance notice, allows public attendance without permission, and records or publishes minutes, then it has "open meetings."

  • If a club doesn't have "closed membership" then it's willing to consider applicants. If anyone can join who meets some criteria, it has "open membership".

This is just how language works: terms develop meanings that are not always ones you can derive simply from the meanings of their component words. I agree it can be confusing, but I also want to do my part to resist semantic drift and keep "open source" matching its useful and socially beneficial definition.

Comment via: facebook, mastodon, bluesky



Discuss

Don't Trust Your Brain

2025-12-23 23:06:43

Published on December 23, 2025 3:06 PM GMT

…if it's anything like mine[1]. If it isn't, then maybe treat this as an opportunity to fight the typical mind fallacy by getting a glimpse at what's going on in other minds.


I've made many predictions and have gone through a lot of calibration training over the years, and have generally become pretty well calibrated in several domains. Yet I noticed that in some areas, I seem surprisingly immune to calibration. Where my brain just doesn't want to learn that it's systematically and repeatedly wrong about something. This post is about the three four such areas.

 

(1) Future Ability to Remember Things

Sometimes, I feel incredibly confident that some piece of information will be easy for me to recall in the future, even when this is entirely false. Some examples:

 

  • When I meet a new person, and they mention their name, then while hearing it, I typically feel like "yeah, easy, it's trivial for me to repeat that name in my head so I will remember it forever, no problem", and then 5 seconds later it's gone
  • When I open some food/drink product and then put it into the fridge, I usually take a note on the product about when I opened it, as this helps me make a judgment in the future about whether it should still be good to consume. Sometimes, when I consider taking such a note, I come up with some reason why, for this particular thing in this particular situation, it will be trivial for future-me to remember when I opened it. These reasons often feel extremely convincing, but in the majority of cases, they turn out to be wrong.
  • While taking notes on my phone to remind me of something in the future, I often overestimate whether I'll still know what I meant with that note when I read it a few days or weeks later. For instance, when taking one note recently, my phone autocorrected one word to something else, which was pretty funny. I decided to leave it like that, assuming that future-me would also find it funny. This part was true - but future-me also really struggled to then figure out what the note was trying to actually tell me.

 

 

Even knowing these patterns, it's often surprisingly difficult to override that feeling of utter conviction that "this case is different", assuming that this time I'll really remember that thing easily.

 

(2) Local Optima of Comfort

When I'm in an unusually comfortable situation but need to leave it soon, I often feel a deep, visceral sense of this being in some sense unbearable. Like the current comfort being so much better than whatever awaits afterwards, that it takes a huge amount of activation energy to get moving. And almost every time, within seconds after finally leaving that situation, I realize it's far less bad than I imagined.

Two examples:

  • Leaving a hot shower. Or more precisely, turning a hot shower cold. I tend to end all my showers with about a minute of progressively colder water. Interestingly, reducing the temperature from hot to warm is the hardest step and occasionally takes me a minute of "fighting myself" to do it. Going from lukewarm to freezing cold is, for some reason, much easier.
  • When I'm woken up by an alarm in the morning, I'm very often - even after 8 or 9 hours of sleep - convinced that I need more sleep and that getting up right then would be a huge mistake and would lead to a terrible day, even though I know that this has practically never been the case. Usually, within 5-10 minutes of getting up, I'll feel alert and fine. Yet on many mornings, I have to fight the same immense conviction that today is different, and this time I really should sleep an extra hour.

 

Somewhat related may be my experience of procrastination and overestimating how unpleasant it would be to engage with certain aversive tasks. This almost always ends up being a huge overestimate, and the thing I procrastinated for ages turns out to be basically fine. My System 1 is pretty slow to update on this, though.

 

(3) Interpersonal Conflict

I'm generally quite conflict-avoidant. But not everyone is - some people are pretty blunt, and when something I did or said seems objectionable to them, they don't pull punches. I suppose that becoming irrational in the face of blame is not too unusual, and it's easy to imagine the evolutionary benefits of such an adaptation. Still, it's interesting to observe how, when under serious attack, my brain becomes particularly convinced that I must be entirely innocent and that this attack on me is outrageously unjust.

 

 

After reading Solve for Happy by Mo Gawdat, I didn't take that much away from the book - but one thing that did stick with me is the simple advice of asking yourself "is this true?" when you're reflecting on some narrative that you're constructing in your head during a conflict. Not "does this feel true?" - it basically always does - but whether my internal narrative is actually a decent representation of reality, and quite often it isn't, and even just asking this question then makes it much easier to step out of this one-sided framing.

 

(4) Bonus: Recognizing I'm in a Dream

One of the most common things to happen to me in dreams is that I realize that some particular situation is very strange. I then tend to think something like "Wow, typically, this kind of thing only happens to me in dreams. It's very interesting that this time it's happening for real". I've had this train of thought hundreds of times over the years. Sometimes I think this while awake, and then immediately do a reality check. In a dream, however, I'm so gullible that I almost never make this jump to ask myself, "wait a second, is this a dream?" - I just briefly think how curious it is that I'm now experiencing a dream-like weirdness for real, and don't act on it at all, because I'm so convinced that I'm awake that I don't even bother to check. Eventually, I wake up and facepalm.

 

Takeaway

Through calibration training, I learned that I can train my System 1 to make pretty accurate predictions by refining the translation of my inner feeling of conviction into probabilities. The areas mentioned in this post are deviations from that - contexts where, even with considerable effort, I haven't yet managed to entirely overcome the systematically wrong predictions/assessments that my intuition keeps throwing at me. Subjectively, they often feel so overwhelmingly true that it's difficult to resist the urge to just believe my intuition. Having this meta-awareness about it certainly helps somewhat. Indeed, as in the "Is this true?" example, sometimes it's mostly about asking the right question in the right moment. Like learning to respond to my brain's claim that "clearly, I'll easily remember this thing in the future" with "wait, will I?" instead of just going along with it - at which point I typically know that I should be suspicious.

My impression is that the more I think about this and discuss it with others, the better I get at recognizing such situations when they happen, even if progress on this is annoyingly slow. So, uhh, perhaps reading about it is similarly helpful.

  1. ^

    I made a very informal survey among some friends, and it seemed like ~80% of them could relate to what I describe at least to some degree.



Discuss