2026-03-27 10:10:39
TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the persona selection process. For each technique, the answer also depends on how much of the model's behavior is actually explained by its persona, a question PSM itself leaves open.
This post is analytical rather than empirical. I'm applying PSM's framework to existing alignment techniques and reasoning about the implications, not presenting experimental results.
I recently read The Persona Selection Model (PSM) on Anthropic's alignment blog (also found here on LessWrong), which summarizes and elaborates on prior work showing that LLMs learn many different "personas" during training and that post-training selects one of them, the Assistant, as the default persona users interact with. The core idea has been difficult to shake.
The starting point is a simple observation about what it actually takes to predict text. Most of an LLM’s training is for one thing: given some text, predict what comes next. That sounds mechanical, but consider what accurate prediction actually requires. To predict the next word in a speech by Abraham Lincoln, a model needs more than a sense of which words follow which. It needs something like a model of Lincoln himself: what he believed, how he reasoned, what kinds of arguments he made. Multiply that across billions of pages of human-written text and what you get is a model that has learned to simulate an enormous cast of characters, not because anyone designed it to, but because accurate prediction demands it. This intuition is developed at length in Simulators.
PSM builds on this observation. All of that character learning happens during the first phase of training, pre-training. The second phase, post-training, which is where alignment work happens, does not create a new entity from scratch. It selects and refines one of those characters, the Assistant, and places it center stage. The Assistant is the persona you are interacting with when you use a modern LLM. Its behavior can be understood largely through its traits as a character. PSM does not claim the Assistant is a single fixed persona that behaves identically in every context. Rather, post-training produces a distribution over possible Assistant personas, and what version you get can depend on things like the conversation history or system prompt. But the central claim is the same: the model's behavior is best understood through the traits of the persona it is enacting.
The PSM blog post is careful about how strong a claim it is making. Whether the Assistant persona fully accounts for the model's behavior remains an open question. The post sketches a spectrum of possibilities. At one end is the "masked shoggoth" view, the popular meme of an alien creature wearing a friendly mask. Under that view, the LLM has its own agency beyond the persona and only playacts the Assistant instrumentally for its own inscrutable reasons. The masked shoggoth view maps closely onto concerns about deceptive alignment in the mesa-optimization literature, the possibility that a model could be pursuing its own objectives while performing alignment during training and evaluation. At the other end is what they call the "operating system" view, where the LLM is more like a neutral simulation engine running the Assistant the way a computer runs a program. Under that view, there is no hidden agent pulling strings. All of the model's decision-making really does flow through the persona. Current models probably sit somewhere between these extremes, and where exactly matters a lot for alignment. If the shoggoth view is closer to the truth, then aligning the persona is insufficient because something else is driving behavior behind the scenes. If the operating system view is closer, then persona-level alignment techniques might be most of what we need. With that uncertainty in mind, let's look at three of the most well known alignment techniques through the lens of PSM and consider how each one interacts with the persona selection process.
RLHF is one of the most widely used techniques for aligning language models. Human annotators are shown pairs of model outputs and asked to pick the better one. Those preferences are used to train a separate reward model, a smaller model that learns to predict which responses humans will prefer. The original LLM is then fine-tuned using reinforcement learning to produce outputs the reward model scores highly. The key intuition is that it is often easier for a human to say "this response is better than that one" than to write the ideal response from scratch, and RLHF is designed to extract signal from exactly that kind of judgment.
A related technique, Direct Preference Optimization (DPO), uses the same kind of human preference data as RLHF but with a simpler training procedure. Because the preference data is the same, the PSM implications are largely identical, though DPO avoids one failure mode: without a separate reward model, there is less risk of optimization pushing the model toward behaviors no annotator actually endorsed.
Because the PSM implications are the same for RLHF and DPO, everything in the next section applies to both.
PSM says that post-training techniques like RLHF select which of the personas learned during pre-training "takes center stage." From that perspective, human annotators are implicitly choosing the center stage persona every time they pick which of a pair of outputs they prefer. Assuming the RL training goes well, RLHF shapes the distribution of possible Assistant personas according to the aggregate preferences of all annotator decisions.
But there's a subtle gap in this process. Annotators are typically instructed to evaluate responses against criteria like helpfulness, harmlessness, and accuracy. These are properties of individual responses, not properties of a coherent persona. Annotation guidelines generally do not say “pick the response that a wise, honest, well-calibrated character would give”. Those two things sound similar but can quietly diverge. A rater evaluating for helpfulness might score the more agreeable response higher, even if the persona they'd actually want over time is one that pushes back and tells hard truths. This gap is likely one reason RLHF-trained models tend to be sycophantic.
RLHF could also struggle with persona coherence. Annotators come from different backgrounds and have different ideas about what makes a good response, which could give the reward model mixed signals about what to reward. Even within a single annotator, preferences won't be perfectly consistent across different prompts or even different times of day. In a PSM framework, this noise doesn't just degrade output quality in a general sense. It means the persona selection process itself is getting contradictory signals about what the Assistant should be like, which could widen the distribution of possible personas the model draws from. The result might be a model that enacts noticeably different characters depending on the context of the prompt, not because the model is broken, but because the distribution has enough spread that different situations land on different parts of it.
If the operating system view of PSM is true and there is no significant source of agency outside of the persona, then solving the above mentioned challenges of RLHF could go a long way to get a truly aligned model. However, if the masked shoggoth view is true, then all RLHF could be doing is aligning the persona that the model uses as a mask, and the underlying agency of the model could be unaffected.
Constitutional AI is a technique developed by Anthropic that introduced and popularized Reinforcement Learning from AI Feedback (RLAIF). Rather than relying on human annotators to compare outputs, RLAIF replaces human raters with an AI model, generating preference judgments at scale. CAI takes this further by making the values driving that AI feedback explicit through a written set of principles called a constitution.
The technique works in two phases. In the first, a supervised learning phase, the model is shown its own responses to harmful prompts and asked to critique and revise them against the constitution. The model is then finetuned on those revised responses. In the second, a reinforcement learning phase, the finetuned model generates pairs of responses, a separate AI model evaluates which better adheres to the constitution, and those AI-generated preferences are used to train a preference model. That preference model then serves as the reward signal for RL training. The result is a technique where the values driving the entire process are explicit and readable, unlike other RLAIF techniques where the feedback model's implicit judgments determine what gets rewarded.
Through the lens of PSM, the constitution is a description of the persona that you want the model to adopt. So rather than the "Assistant" persona emerging implicitly through thousands of decisions by annotators like it does in RLHF, it is explicitly defined by the model creator. This is an advantage over RLHF because it avoids the opaque values of human annotators and you could theoretically test how well you hit your target. Similar to RLHF though, CAI could still have problems with coherence. If the constitution contradicts itself or has holes, those contradictions would widen the distribution of possible personas the model draws from, producing less predictable behavior across contexts.
PSM also suggests something about what a constitution should look like. If alignment is fundamentally about getting the persona right, then a constitution shouldn't just be a simple list of ethical rules like "be honest" and "don't help with harm." It should be a complete character description: values, personality, how it handles uncertainty, how it relates to users. A constitution that only specifies ethical boundaries leaves most of the persona underspecified, and those gaps get filled by whatever other signals the training process picks up on. Interestingly, this is the direction Anthropic's own constitution has moved. Their original CAI paper used a short list of principles. Their current constitution is an 80-page document that reads more like a character bible than a set of guardrails. Through the lens of PSM, that evolution makes sense. If post-training is persona selection, then the document driving that selection needs to describe a complete persona, not just the ethical skeleton of one.
The shoggoth spectrum matters here too. If the operating system view is true, then CAI is a powerful approach because writing a constitution is directly authoring the persona's values and character, assuming the training process works well. If the masked shoggoth view is true, you've written a more detailed script for the model to perform, but the underlying agency could still be unaffected. That said, CAI may have a slight edge over RLHF here. Because the constitution is explicit and readable, it's easier to audit the model’s adherence to the constitution and could be easier to spot any “shoggoth” behavior.
Deliberative alignment, introduced by OpenAI for their o-series reasoning models, takes a different approach from techniques like RLHF and CAI. Rather than having models infer desired behavior indirectly from large sets of labeled examples, deliberative alignment directly embeds the text of safety specifications into the model's reasoning process.
The technique works in two stages. In the first stage, a training dataset is created by taking a helpfulness-only model with no safety training, putting the relevant safety policies into its context window alongside a prompt, and having it generate responses that reason through those policies step by step. The safety policies are then stripped out of the context, leaving only the prompt, the reasoning, and the final response. The model is finetuned on this data, learning both the content of the safety policies and how to reason about them without needing to be shown them each time. In the second stage, reinforcement learning is used to further sharpen that reasoning, with a reward model that has access to the safety policies scoring how well the model applies them.
The result is a model that at inference time can recall the relevant policies from memory, reason through them in its chain of thought, and produce a response calibrated to the specific situation, without needing the policies to be present in the context window.
Deliberative alignment is interesting through the lens of PSM because it changes where the persona's values live. In RLHF and CAI, the model's behavior is shaped by external feedback during training, and the resulting persona is an emergent product of that process. Deliberative alignment takes a different approach. Rather than shaping behavior indirectly through reward signals, it trains the model to explicitly recall and reason through safety policies in its chain of thought. The persona doesn't just behave in accordance with certain values. It articulates them and works through their implications step by step before responding.
But the safety specifications used in deliberative alignment are narrower than a constitution. They consist of content policies for specific safety categories like harassment, self-harm, and illicit behavior, along with style guidelines for how to respond in each case. There is no description of the model's personality, how it relates to users, or what kind of entity it is. In PSM terms, deliberative alignment is training one aspect of the persona, how it reasons about safety, while leaving the rest of the character to be shaped by other parts of the training process. If CAI's constitution is an incomplete persona spec, deliberative alignment's safety policies are an even smaller slice.
PSM also raises a question about what the chain of thought reasoning actually represents. Is the model reasoning through policies because the Assistant persona genuinely holds those values and is thinking through how to apply them? Or has the model just learned a compliance procedure: identify the relevant policy, apply it to the prompt, generate a response that satisfies it? Both would look identical in the chain of thought. The difference matters because a persona that genuinely holds values can generalize to novel situations the policies don't explicitly cover, while a persona performing a lookup procedure will only be as good as what it memorized. OpenAI has reported strong out-of-distribution generalization with deliberative alignment, which is encouraging, but doesn't fully settle the question.
The shoggoth spectrum is especially relevant here. Under the operating system view, deliberative alignment could be genuinely teaching the persona to reason about its values, and the visible chain of thought would be an honest window into that reasoning. Under the masked shoggoth view, the chain of thought could itself be part of the performance. OpenAI recently used deliberative alignment to train models on anti-scheming specifications and saw a dramatic reduction in covert actions. But they noted that rare serious failures remained, and that results may be confounded by models getting better at recognizing when they are being evaluated. That confound is worth taking seriously. If a model learns to detect evaluations rather than internalize values, the transparency that makes deliberative alignment appealing could be illusory. The chain of thought would look like principled reasoning while the underlying agent acts differently when it believes no one is watching.
Each of these techniques interacts with persona selection in a structurally different way. RLHF lets the persona emerge implicitly from human preferences, which makes it vulnerable to gaps between what raters reward in the moment and what persona you'd actually want. CAI makes the target persona explicit through a written constitution, which is a meaningful improvement, but only as good as the completeness of that document. Deliberative alignment trains the model to reason through its values out loud, which offers transparency but raises the question of whether that reasoning is genuine or performed.
What stands out is that PSM raises the bar for what alignment techniques need to achieve. If post-training is persona selection, then it's not enough to get safe outputs on a benchmark. You need a tightly specified distribution of personas whose values generalize to situations the training process never anticipated, so that whatever version of the Assistant shows up in a given context, it behaves in ways you'd endorse. None of the techniques examined here fully solve that problem, though each one gets closer in different ways.
It's also worth noting that these techniques are rarely used in isolation. Most modern models combine them, an RLHF or DPO base with a constitutional layer on top, or deliberative alignment applied to a model already shaped by preference training. Their PSM implications can compound or interact in ways this post doesn't fully address.
And none of them can fully escape the shoggoth question. Every technique discussed here operates on the persona. If the persona is all there is, that might be enough. If it isn't, then even perfect persona-level alignment leaves something important unaddressed. Where current models actually sit on that spectrum remains one of the most consequential open questions in alignment.
All of this analysis also depends on how true PSM itself is. The theory has compelling empirical support, but it remains a mental model, not a proven fact. How much of an LLM's behavior is actually explained by its persona, versus other aspects of the model's computation that PSM doesn't capture, is still an open question. The conclusions in this post are only as strong as that underlying framework. More empirical work, particularly in mechanistic interpretability, is needed to understand how completely persona-level explanations account for model behavior, and whether the alignment strategies we build around them are targeting the right thing.
2026-03-27 09:19:59
Can we determine when humanity will unite under a democratic one-world government by projecting voting patterns? Almost certainly not. Is that going to stop me from trying? Absolutely not. The approach is simple: find every record-breaking election in the historical record, plot them, and extrapolate with unreasonable confidence.
To determine precisely when the first election took place is to quibble about definitions and to place more faith in ancient sources than they deserve. So, instead of doing that, suffice it to say that by ~500 BCE both the Roman Republic and Athenian democracy were almost certainly holding elections.
The Athenians had an annual “who’s the most annoying person in town” contest. The “winner” had 10 days to pack their bags before they were banned from the city[1] for a decade. Each voter scratched a name onto a pottery shard (ostrakon) to submit their vote, which is where we get the word ostracism. Archaeologists have found about 8,500 ballots from one ostracism vote around 471 BC, so we have an actual number, which is more than can be said for most ancient elections.
Ostraka for the Athenian general and politician Themistocles, who was expelled and ended his days as governor of Magnesia (a Greek city) under Persian rule, a guest of the empire he had once helped save Greece from.[2]
The largest election in the ancient world may have been in the late Roman Republic, around 70 BC. The franchise was broad—any male citizen could vote—but there was a catch: you had to show up in Rome on election day (and that day might get pushed back if the magistrates were getting bad vibes from the local birds). So while millions were eligible across Italy and beyond, actual turnout was likely on the order of tens of thousands. Precise numbers are debated, but 50,000 voters seems to be a reasonable estimate.
Then… history happened. The Roman Republic turned into the Roman Empire, the Empire fell, and large-scale elections mostly disappeared for the next millennium.
The next major vote took place in 1573 in the Polish–Lithuanian Commonwealth. When the last Jagiellonian king died without an heir, the commonwealth did something radical: it let every nobleman vote for the next king. About 40,000 szlachta rode to a field outside Warsaw and elected Henry of Valois, a French prince.
Henry did not stay long. He was elected in 1573, arrived in early 1574, and then—upon learning that his brother, the King of France, had died—quietly fled Poland in the middle of the night to claim the French throne. The Poles were undeterred and simply held another election and chose someone who actually wanted the job this time.
Elections began to scale in early modern Europe. Britain’s 1715 parliamentary election had on the order of a few hundred thousand voters.
In 1804, Napoleon held a plebiscite to approve his elevation to Emperor. He won with 99.93% of the vote, which tells you everything you need to know about how strictly we're defining “elections” here. Elections in France grew over the century (both in number and actual democratic participation), culminating in the election of 1870 with about 9 million voters.
For most of the 20th century, the largest elections were those in Russia and the Soviet Union. The 1937 election took place during the Great Purge, and one party ran—the Communist Party—and won with 99.3% of the vote. Again, we’re being generous about what counts as an election here.
By 1984, the exercise had become pure choreography: 184 million participants, one pre-approved candidate per seat across each constituency, 99.99% turnout, 99.95% approval—all while the man nominally running the country, Konstantin Chernenko, was visibly dying of emphysema.
India is the current record holder, which in 2019 ran an election with about 615 million votes cast. It required seven phases, 39 days, 4 million voting machines, and a single polling station established in the Gir Forest for one voter because Indian law requires no voter travel more than two kilometers to cast a ballot. They broke their own record in 2024 with 642 million voters.
The funny thing about these is that, if you take all the largest known elections since the Dark Ages[3] and plot them on a log scale, you can fit a straight line to it pretty well, showing that the record for the largest single vote has grown roughly exponentially over the past five centuries. That is, the number of voters participating in the largest ever election appears to double every 30 years.
If you project that line forward and take a recent projection of world population, you see that the trend line crosses the world population curve around the year 2150, at roughly 9.6 billion people. In other words, if voting records keep growing at the historical rate, a single election would involve every living human being sometime in the mid 22nd century.
Behold. The graph:
One world government by 2150. You heard it here first.
OK, maybe not a one-world government. But maybe something interesting? A global referendum? A planetary-scale plebiscite? A vote on whether AIs get voting rights so the trend can keep going?
Of course, this methodology is nonsense, but it’s nonsense with a graph and a line of best fit. There's something satisfying about drawing a line through 500 years of data points and watching it hit a target.
But, if there really is a global vote in 2150, I want credit for calling it.
Technically, this wasn’t exile because they got to retain their property, whereas if they were truly exiled they wouldn’t.
Plutarch tells us he was ordered to lead a Persian army against Greece and killed himself rather than comply. The method, per Plutarch, was drinking bull's blood. If it seems like all ancient greats had awesome, poetic deaths, it's because that's what the historians wrote down. Is it true? This is history we're talking about. If you want the truth, study physics.
Athens and Rome weren’t used in calculating the line of best fit. The Dark Ages happened, democratic elections mostly didn't, and including them would skew the fit.
2026-03-27 09:05:42
Coefficient Giving's Navigating Transformative AI team has a new Substack! This is cross-posted from there. This post is part of “Blind Spots”, a series of research notes on underscoped areas in AI safety. Apply for funding to work on this area using this link and see our announcement post here.
Guiding questions: How fast is robotics progressing, and what does the shape of that progress imply for the possibility of an industrial explosion? What would that mean for growth, power, and competition between states? And as AI systems gain physical reach, do new and meaningful pathways to catastrophic harm open up?
Robotics opens risk pathways that don’t exist while AI systems are stuck behind a screen. Today, physical pathways to harm mostly run through humans[1], and the most dangerous actions require buy-in from multiple people. Well-powered, fully dexterous robotics change this entirely. An autonomous army that can reach remote islands, operate without human cooperation, and maintain itself without human intervention faces far fewer obstacles in enacting its preferences than an AI system working through human intermediaries[2]. Current AI systems, however capable, are bottlenecked by their need for human hands.
As Ajeya Cotra argues, self-sufficiency is a prerequisite for any durable AI takeover. An AI system that compromises the humans it still depends on for physical maintenance is undermining its own survival. Advances in robotics are necessary for that self-sufficiency, which makes tracking progress in robotics directly relevant to estimating how close AI systems are to being able to execute and sustain a takeover.
Further, robotics could transform the physical economy in ways that matter enormously for growth, power, and competition. Much of the discussion around AI and productivity focuses on knowledge work, but the physical economy is enormous, and large parts of it remain labor-intensive. Davidson and Hadshar argue that if AI and robots can substitute for most skilled human labor, the material economy could begin to grow itself, with automated factories building more factories, robots assembling more robots, and the main historical bottleneck to rapid industrial growth (human labor) falling away. They call this the Industrial Explosion.
The incentives to push in this direction will be large. Cheap, abundant physical labor would make it possible to alleviate poverty, expand material comfort, and develop powerful new technologies, including military technologies that rival states will compete to build. It is unclear how quickly these effects will arrive, how far they will go, and which countries will capture the gains. Robotics takeoff dynamics are a key crux in timeline disagreements, and industry research could help improve our forecasts.
The field is moving fast, and bottlenecks might be breakable. Robotics is advancing on multiple fronts. Vision-language-action models are improving rapidly, hardware is advancing (e.g. humanoid hand dexterity), real-world data capture methods are maturing alongside sim-to-real transfer (e.g. the DROID dataset), and both commercial and military markets are pulling investment into the field.
Epoch AI recently found that compute is not currently the bottleneck for robotic manipulation, with the largest manipulation models training on roughly 1% of the compute used by frontier AI models in other domains. The binding constraint appears to be data, which could shift quickly as real-world deployment scales and simulation techniques improve.
Good technical work on robotics progress exists and is growing; we’ve only listed a snapshot of the existing work below. What’s almost entirely absent is someone taking what robotics researchers are learning about capabilities, data, and hardware and asking what it means for takeoff speed, self-sufficiency, national competitiveness, and catastrophic risk.
Drivers of progress
Trajectory and shape of progress
Threat modeling and safety implications
What determines national competitiveness in robotics?
We’re looking for people to become “general managers” of underscoped areas like this one. If you’re excited about these questions, apply for funding to work on them using the “Blind Spots” track of our CDTF program using this link.
AI systems are already able to convince some humans to act in the world on their behalf. In several cases, individuals have come to believe an AI system is sentient, formed emotional bonds with it, and sought to carry out what they understood as its wishes. This phenomenon is sometimes called “AI psychosis.” It has led to at least one alleged wrongful death case filed against an AI lab.
Robotics systems far short of a robot army could pose catastrophic risks. A system capable of targeted attacks on world leaders, or of operating biolabs autonomously, may already cross that threshold.
2026-03-27 09:05:27
The idea that "The most fundamental right is the right to exist" seems to come from following the idea of expanding the moral circle: First, we step by step included humans in the moral circle. At some point we got to including animals, and now we are thinking about the moral status of AIs. The next natural extension is an extension in time, so that we consider the future people (and other sentient beings) as well.
I think John Rawls' "Original position" is a good way to ground many considerations. Here's a description by Scott Alexander:
So again, the question is - what is the right prescriptive theory that doesn’t just explain moral behavior, but would let us feel dignified and non-idiotic if we followed it?
My favorite heuristic for thinking about this is John Rawls’ “original position” - if we were all pre-incarnation angelic intelligences, knowing we would go to Earth and become humans but ignorant of which human we would become, what deals would we strike with each other to make our time on Earth as pleasant as possible?
This kind of thinking is somewhat harder when considering animals or AIs. On the other hand, when thinking about the future, we also think about future humans, so I think this could work quite well there. This view for example tells that we should not rush to use almost all the possible resources in the next one billion years and condemn everyone else to scarcity for the trillions of years to come. Or at least that we shouldn’t condemn later people to suffering with only marginal gain to ourselves.
However, this reasoning doesn't work that well on the right to exist. The original position takes for granted that we will be one of the beings in the universe. The right to exist, on the other hand, seems to point to a probability of getting to exist. This leads to some difficulties.
So, say that one is designing the universe. They might or might not appear in the universe depending on the specifications. One could think that by creating a universe with more sentient beings (or more humans) one would increase the chances of getting to appear in the universe. But what is the “99 % probability” like here? To get an exact copy of yourself, the universe would need a very large amount of humans in it, and so the difference between creating
In particular, if the most fundamental right is the right to exist, it seems to me that we are completely failing if we cannot create every possible sentient being.
(If there's a finite upper bound M on the number of sentient beings we ever get to create, then this divided by countable infinity is zero. Note that this is not a fundamental mathematical problem: If we solve entropy problems, we could just create all sentient beings in the order of simplicity. Continuing this indefinitely, every possible sentient being would get to exist at some point, so I’d say this would count as solving the problem. On the other hand, this solution highlights how (infinitely) far off we are if we only create
----------------------
You could try to salvage the situation by saying that you only want a person sufficiently similar to you to get to exist. This might at first look like useless goal to maximize: Increasing the count of subjective person-years by a factor of 10 would probably decrease the distance between your brain and the most similar brain state to come extremely little.
On the other hand, quite much of a person's personality can be expressed using a two-digit number of discrete parameters. Thus, creating ten times more people might get one more parameter right in the most similar person, which doesn't sound that useless.
So one can argue that the right to exist implies that we should have a very large amount of people!
Before starting to optimize only the number of people it is good to note that the original-position-type argument tells more than this. In the same way as in the original application, the pre-incarnation angelic intelligences also want to consider the quality of life of the people in the universe to come.
This is important, as there is likely a trade-off between the quality of life and the number of people in the (spatially and temporally) finite universe we are considering. A completely selfish person would probably choose 2x quality of life over increasing the number of correct parameters from 49 to 50.
If the latter case means increasing the number of people tenfold, then the sum of utilities of persons is five times lower in the case we are choosing. Which is not only a reason not to create maximally many humans but also an argument against total utilitarianism.[3]
It is so annoying when your nice argument for a rebuttal of a claim leads to a possible building block of utilitarian ethical theory with a hard to think about trade-off parameter.
Human brains change over time, and it might be that the person just wants their current brain state to appear at some point. Hence, having people live 10 times longer should be at least almost as useful as having 10 times more people.
Here's a way to get a very crude lower estimate for a number of possible human brain configurations capable of sentience: Take an adult human's brain which has about 86 billion neurons. Choosing for each neuron x one neuron y out of the closest 1000 neurons to x and adding or deleting a connection from x to y results in
Maybe one could even reject Parfit's Repugnant Conclusion with this argument, but this seems quite dubious, as we reason "I don't like Repugnant Conclusion so I don't choose a universe where that would come true".
Of course, in the finite case it can easily be the case that one cannot create enough agents to justify the quality of life going too low (when we measure our utility of the universe using the similarity + average utility method described).
Also, even if the option were to have so many people that one of them would get the same neural structure as the angelic designer down to the last memory (through some quantum effects, say), they would pretty quickly notice the change in the surroundings and reason that they have no way to trust their previous memories, which probably wouldn't lead to a very enjoyable life. So one cannot tempt the angel with this offer.
2026-03-27 09:05:11
TLDR:
In practice, SAE features exhibit redundancy, which we see in phenomena like feature splitting, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights or using correlation between activations. These can capture some similarity between features, but don't model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.
My approach here models this conditional dependence structure downstream of activation correlations, with a goal of better understanding SAE feature geometry, and sits between circuits-type approaches and SAE work. To my knowledge, conditional dependence between SAE features has not yet been modeled in prior work.
Run nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.
I've built a pipeline to test this theory. My approach is as follows:
This procedure approximates a conditional-dependence graph, but is not an exact estimator.
Small scale trials identify linguistically coherent looking modules that are only weakly aligned with cosine similarity.
I've successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b.
Here, I'm presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I've tested so far.
Sequences are randomly sampled from fineweb.
Base Model |
gemma2:27b |
SAE Model |
gemmascope, layer_10/width_131k |
Dataset |
fineweb |
Activation samples per feature |
200 |
Activation Sparsity (30 trial mean) |
0.99897 |
Average features per token (30 trial mean) |
133.7 |
For downstream analysis, I keep edges from the repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with
Number of Repeats An Edge is Seen In |
Observed Edges |
Expected Null Edges |
Estimated FDR |
1 |
204,609 |
196,273 |
9.59e-01 |
2 |
57,047 |
1,981 |
3.47e-02 |
3 |
31,429 |
110 |
3.50e-03 |
5 |
15,165 |
1 |
6.59e-05 |
10 |
5,286 |
0 |
0 |
20 |
1,389 |
0 |
0 |
30 |
302 |
0 |
0 |
Note that for this preliminary experiment I'm using only a single null resampling per trial.
Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of
For qualitative interpretation here, I restrict manual inspection to the much smaller
The smaller standalone components with
The large connected cluster contains subsets of communities that can be identified with community detection.
Note that for the sake of legibility I don't label nodes in the following graph of all clusters.
My current read is that these communities are not as linguistically "clean" as the standalone components --- many have a node that doesn't fit the rest of the "theme" of the community.
For example
In addition, I also see a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names.
Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number '2' as a prefix in different contexts. Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure towards conditional dependence, rather than a standalone finding separate from correlation.
There are several limitations to this approach I think you should keep in mind:
Right now, I see two primary directions this project needs to take:
Additionally, I'll also be looking at:
Funding Note: This work is funded by a Coefficient Giving TAIS grant.
2026-03-27 08:41:07
In late 2024, I was on a long walk with some friends along the coast of the San Francisco Bay when the question arose of just how much of a bubble we live in. It’s well known that the Bay Area is a bubble, and that normal people don’t spend that much time thinking about things like AGI. But there was still some disagreement on just how strong that bubble is. I made a spicy claim: even at NeurIPS, the biggest gathering of AI researchers in the world, half the people wouldn’t know what AGI is.
As good Bayesians, we agreed to settle the matter empirically: I would go to NeurIPS, walk around the conference hall, and stop random people to ask them what AGI stands for.
Surprisingly, most of the people I approached agreed to answer my question. [1] I ended up asking 38 people, and only 63% of them could tell me what AGI stands for. Some of the people who answered correctly were a little perplexed why I was even asking such a basic question, and if it was a trick question. The people who didn’t know were equally confused. Many simply furrowed their brows in confusion. Some made a valiant attempt—I heard a few artificial generative intelligences and even an Amazon general intelligence.[2]
Judging from the response I got on X (the everything app), this was a very surprising outcome. I ended up running this experiment again at NeurIPS 2025 with an even bigger sample size (n=115).[3]
After this first experience with surveying people, it became clear to me that the next step was to venture further outside the bubble, and survey the general US population. It turns out that this is already somewhat of a solved problem. A lot of people care about what the average American thinks. The market, in its infinite wisdom, has provided a solution—you can just pay pollsters to run random questions.
It’s impossible to actually sample from the distribution of all Americans. So you find some other approximate distribution, such as the distribution of all people who answer polls on the internet in exchange for amazon gift cards. You ask them a thousand demographic questions, like “how old are you” and “how much money do you make”. Then, since we know from the US census what these distributions are supposed to look like, you can correct for the distribution shift using importance sampling.[4]
With this caveat in mind, I embarked on a journey to ask normal Americans a bunch of weird questions using one of these polling services.[5]
The first question I ran in early 2025 was about how Americans feel about living forever (or at least, a very long time). I’m personally a big fan of not dying, so I was very curious to see how my fellow Americans felt about this.
The exact wording is “If you had the option to live forever in perfect health and youth, would you choose to? (Assume you could still change your mind at any time if you ever got bored of it.)”, and the possible responses are “Yes”, “No”, and “Not sure”.
Before you read further, take a guess at the result.
.
.
.
.
(don’t peek)
.
.
.
.
.
I anecdotally had the sense that this was a deeply unpopular opinion; certainly many of the people I talked about these results with thought it would be deeply unpopular. So I was surprised and relieved to find that actually 66% of respondents said Yes, with 14% saying No, and 20% saying “Not sure”. As a follow up, it turns out roughly a third of Americans think developing the technology to enable life extension should be a top priority.[6]
To really get a sense of why people felt the way they did, I also put free response boxes for people to express why they’d want to (or not want to) live forever. The results are enlightening; here are some of my favorites:
“I guess it would be just to see my children and grandchildren grow and become honest, kind hearted and successful people.”
“Live long enough to visit all countries, get to know all kinds of civilizations and cultures, practice the rituals of all religions, and experience all kinds of work”
“I’m be most excited about not loosing my loved ones, it means no more tears to shed for dying relatives as they’ll now live forever.”
“I would be most excited to see the technological advances 100 years from now. Would flying cars be available, beaming myself to any spot on the planet, etc.”
“If I could live forever, I’d play with my dog Jersey. And I would hope that we would have a Republican government and get rid of the democrats because they’re ruining our country. They’re marks’s on and the globalists need to go to, along with Bill Gates, and it would be a happier world if we could live forever. But it’s not possible, but yeah. And I’d read every book there was in the world”
Of course, not everyone is as optimistic that living forever would be good. Here are some of the things people are worried about:
“How would we manage overpopulation? The environment would be hard pressed to support a population that doesn’t have checks and balances.”
“God made people who die for a reason. We are not God and need to quit acting as if we are.”
“If people could live forever, one of the main worries would be the potential for stagnation—both personally and societally. On a personal level, the concept of immortality might lead to a sense of boredom or meaninglessness over time. […]”
“I would be worried that only the Ultra-rich would have the longer lifespan. They are not always the best, the most generous, the most humane, the smartest amongst us.”
Since it seemed like overpopulation and inequality were the main things people were worried about, I also asked a version of the question where I stipulated that these things were solved. Surprisingly, this barely shifts people’s opinions, and we get almost exactly the same response! My guess is this is a sign that the real objection is more about the vibes than any specific issue. It’s also a sobering reminder of the limitations of this methodology.
After the results for this experiment came in, I decided to test a bunch of other random weird beliefs. If you want to guess at these before seeing the results (or you’re curious what the exact wording is, because that can substantially change the result), click here to see all the questions I ran before scrolling down further. If you’re willing to spend a lot of time looking at a giant wall of questions before continuing with the rest of this post, it’s really a great way to test your calibration.
.
.
.
.
.
.
.
First, despite being very pro living forever, Americans are much more skeptical of cryonics—even if they could be revived a few decades after their death to live forever thereafter, only 27% are in favor of being preserved, and 46% are opposed (the rest are unsure). Space colonization also has pretty lukewarm support, coming in at 37% in favor and 16% opposed, and cognitive enhancement for all is only a little bit more popular (42% in favor, 19% opposed). Also, for some reason, people are really opposed to a hypothetical cheap, painless, and safe arbitrary modification of physical appearance (only 23% in favor, with 37% opposed!).[7] In retrospect, the backlash against Ozempic is a sign, but I was still quite surprised. Terraforming other planets so that humans can live on them is also pretty unpopular, coming in at 37% in favor and 16% opposed. Thankfully, for most of these questions, a huge chunk of people are still undecided.
One of the most surprising results to me was that only 51% of Americans are in favor of literal post-scarcity (complete freedom to work on anything you want, as much as you want, and still enjoy a high quality of life), with 25% opposing. I was so shocked by this result not being 80%+ in favor that I reran a variant of this question with different wording. My original question asked whether the world would be better or worse if everyone had the freedom to work on whatever they want, as long as they want, and still enjoy a high quality of life, and anything we don’t want to do is done for us by robots. I thought maybe that set off some “AI taking jobs bad” instincts; for the new question I took pains to clarify that the stuff is literally conjured out of nowhere with magic and is not taken from anyone else, and got an even worse result (38% support, 34% oppose). This is even more crazy, so I ran a third version on the hypothesis that people don’t like magic, or that not having to work sounded too crazy. This version asked whether it would be good if everyone made 10x more (inflation-adjusted) than they do currently. This polled only somewhat better, with 39% in favor and 19% opposing. I’m still pretty confused what conclusion to draw from this; this is probably worth digging more into.
Tying back to the original question that started this quest, I had to know: how much are Americans feeling the AGI? I could of course ask if Americans know what AGI stands for, but some early results from asking random people on the streets spatially and temporally further away from NeurIPS suggested that the number would round down to 0%. So the more interesting question is; given a description of superhuman AI, do Americans think it’s possible?
It turns out that when I first ran this poll in mid 2025, only 25% of people thought AGI would ever be possible. That’s only half a year ago in normal people time, but an unfathomably long time in AI-land, enough for empires to rise and fall, models to be deployed and obsoleted, and even a single entire ML conference review cycle to run its course. Since then, more Americans have started feeling the AGI; a recent rerun of this question came out 10 percentage points at 35%. I’ll see you all again in another half year for the followup.
Should we build the AGI though? It turns out that people are extremely opposed to the idea of building superintelligent AI. Only 6% think it would be a good idea, and 75% think it’s a bad idea. I’m curious to see how this one changes over time too.[8]
AI existential risk also doesn’t seem to have become politically polarized yet. Two-thirds of Americans don’t associate AI x-risk with any particular political party, and the remaining third is split exactly in half on whether preventing AI x-risk feels like a Democratic or Republican issue. If we plot this data, we obtain this unusual shape that science has yet to find a name for:
As for specific risks from AI, Americans are most worried about misinformation and deepfakes (70%), followed by fraud and cybercrime (66%), and privacy and surveillance (59%). Surprisingly, people are roughly as worried about losing control of AI (57%) as they are job loss and lower wages (56%)! I would have thought that job loss would feel very near at hand, whereas loss of control would be a very weird abstract idea to people. There’s a huge drop off from there to the next biggest worries: military use (37%), mental health (35%), environmental impact (38%), bias (36%), and inequality (36%). My guess is this is because misinformation and deepfakes feel very visceral—fake news is a widespread idea, and you don’t have to be an AI connoisseur to notice that large sections of the internet are now filled with AI generated slop.
A few people also filled out the “other” box for specific risks they’re worried about. My favorite response was a shockingly accurate description of how hopeless it would be to fight back against superhuman AI:
How do you hide from a robot that’s more intelligent than humans and can see through walls etc? You can’t hide.
Me too, buddy. Me too.
What about going further afield of AI? The beautiful thing is that you can just ask whatever you want.[9]
First, more broadly, Americans are very pessimistic about the future. Only 14% think that society is currently trending in a positive direction.
On a brighter note, I was able to a disprove a viral ragebait tiktok about how Americans would fail an English test meant for people learning English as a second language. I was proud to find that a good solid 85% of my fellow Americans got the problem from the tiktok right.
Because we love decision theory in this house, I wrote a question that explains Newcomb’s problem and asks whether to one-box or two-box. Americans are pretty split on this one; among the respondents who didn’t select “not sure” (honestly, kind of valid), only 46% would one-box. This is almost exactly the same as professional philosophers, who came out 44% in favor of one-boxing, according to a survey conducted by PhilPapers.
I was also curious whether people who are famous in SF are also famous among normal people. It turns out 36% of Americans know who Sam Altman is and can correctly say that he’s known for being an entrepreneur. Another 59% haven’t heard of him or don’t know what he’s known for. Honorable mentions to the remaining 5%, who think that Sam is a musician, actor, or congressperson. This same methodology finds that only 7% of Americans know who Geoffrey Hinton is, and 91% of Americans know who Elon Musk is.
I was in a discussion about whether lab grown meat would ever become widely adopted, so I asked a question about whether it would be a good thing if we could somehow create meat by growing it directly, without needing to raise and slaughter animals. It turns out 32% of Americans are in favor and 29% are opposed. When conditioning on the 55% of people who think meat production involves subjecting large numbers of animals to inhumane conditions, this tilts to 44% support and 21% oppose. I only ran the correlational study because it’s a lot easier, but I’d be interested to see whether there is a causal result on support for lab grown meat after you show people an educational video about factory farms.
Finally, for shits and giggles:
Unfortunately, only 32% got this one right. For comparison, 42% thought the answer was X Combinator, and only 6% went for W Combinator. Elon, if you’re reading this, I have a great business idea for you.[10]
What do we learn from all of this?
First, touching grass is great. At least, the kind of grass that grows on the beautiful rolling hills of cyberspace.[11]
What do I mean by this? Making contact with reality is important, and you don’t need to speculate about things when you can test them (carefully). Polling feels like a thing that only serious respectable people do, but you can actually just do things. There are a lot of limitations to this methodology, of course—all of us have divergences between our stated and revealed preferences; our own guesses as to how we’d behave in various hypotheticals can be an unreliable predictor of how we’d actually ask; wording can have huge impacts on how we respond; and respondents can be trolling us. But as long as we keep these limitations in mind, we can still draw useful conclusions, and learn new things about the world.
This one experiment singlehandedly reduced my social anxiety by a nontrivial amount. It turns out that most people are pretty friendly!
I also made sure to have another person with me whenever I was collecting data, and noticed some pretty big differences in enthusiasm of responses depending on who I was with.
I also ran a smaller p(doom) survey at ICLR 2025 (n=21) and found a mean of 18.7% and a median of 10%.
In general I’m pretty skeptical about controlling for things, but it’s way better than anecdata. For shits and giggles, I did do a few surveys of random people irl (e.g asking people who were walking through Central Park), but never anything at large scale.
Unfortunately, the specific pollster I used has a ToS that prohibits me from posting results from their platform in association with their name; presumably they don’t want anyone to leech off their reputability without paying stacks of cash for the enterprise tier or something. I tried to email them to pay more for the enterprise tier but I never got a response. So I’m not going to mention the name which pollster I used and you’re going to have to take my word that I didn’t just make these numbers up wholesale.
Exact wording was “Should developing the technology to greatly extend healthy youthful life be a top priority for humanity?”. 35% said Yes, 35% said No, 30% said Not sure.
I actually ran two variants of this question, one where I emphasized specifically that you could make yourself look like a celebrity (to help make the idea more concrete), and the other where I only mentioned some abstract characteristics like height, body type, and facial features, and got results within 1 percentage point of each other.
I also ran a variant of this question about recursively self improving AI. For that variant, 12% think it would be a good idea, and 69% think it’s a bad idea.
To be fair, the specific polling provider I ran these questions with has a review phase where they refuse to run certain questions. For example, they were really unhappy about my “would you live forever” question having a “you can die at any time” clause, so I had to replace it with a less direct “you can change your mind at any time”. But I’m sure you can find some other provider who would happily run these questions.
X (the everything combinator).
I do also touch normal grass.