MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment

2026-01-13 20:55:19

Published on January 13, 2026 12:55 PM GMT

AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment

Demo: Same model steered honest (+α) or dishonest (−α). Promptingtriggers refusal; steering bypasses it.
Demo: Same model steered honest (+α) or dishonest (−α). Promptingtriggers refusal; steering bypasses it.

Paper | Code + checkpoints

TL;DR

The problem: Many alignment approaches use AI to supervise AI—debate, iterated amplification, weak-to-strong, constitutional AI. How do you sanity-check the supervisors?

The approach: A steering method that operates on internal representations, trains without preference labels on outputs (human provides two words, “honest” vs “dishonest”, not N labeled output pairs), and transfers out-of-distribution.

The results: Train on 800 simple persona pairs, test on 1,360 unseen moral dilemmas. Steering F1 = 31.2 vs prompting = 4.5 (Gemma-3-1B). This means the method surgically flipped moral values in the intended direction, beating the strongest baseline; prompting. It works where prompting triggers refusal.

The core problem

A recurring pattern in scalable alignment proposals is using AI to supervise AI. Iterated amplification (Christiano, Shlegeris and Amodei, 2018), debate (Irving, Christiano and Amodei, 2018), constitutional AI (Bai et al., 2022), weak-to-strong generalization (Burns et al., 2023), and more - all of these rely on one model checking or improving another. The pattern recurs for a good reason: human oversight simply won’t scale to the volume and complexity of future AI outputs.

But every step in that chain is a place where things can go wrong. The supervisor might Goodhart the metric it was given. The critic might learn to optimize for appearing helpful rather than being helpful. And we, the humans at the end, will have limited ability to tell the difference.

What I want is a sanity check, something you can apply at each step to ask: “Is this model being straight with me?” Not a replacement for alignment, but a debugging tool. Something that operates on a different level than the thing you’re checking.

For that to work, I think steering methods need (at least) three defensive properties:

  1. Internal: It should operate on the model’s internal representations, not its outputs. Outputs can be gamed; hidden states are harder to manipulate.
  2. Self-supervised: It shouldn’t require human preference labels on outputs. Once you label outputs, those labels become optimization targets, exactly what we’re trying to avoid.
  3. Transfer to unseen context: It should work on situations not seen during training. Because alignment needs to work in novel contexts too.

Why existing approaches fall short

Before explaining the method, it helps to see where it sits in the landscape:

  Arithmetic Gradient-optimized
Supervised CAA ReFT, BiPO
Self-supervised ActAdd, RepE AntiPaSTO

Supervised methods like CAA (Rimsky et al., 2024), ReFT (Wu et al., 2024), and BiPO (Cao et al., 2024) require preference labels for each training example. That’s exactly the problem: the labels become optimization targets. If a model learns to satisfy labeled preferences, it might be learning “what humans rate highly” rather than “what is actually honest.”

Arithmetic methods like ActAdd (Turner et al., 2024) and RepE (Zou et al., 2023) avoid labels by extracting steering directions through PCA or mean differences. But they assume the concept varies linearly across layers, an assumption that often fails (Braun et al., 2025). In practice, they don’t beat simple prompting (Wu et al., 2025).

Probing methods like CCS (Burns et al., 2022) find directions that predict behavior, but they cannot intervene: probing accuracy is correlational and doesn’t establish that modifying the discovered direction will actually change behavior (Belinkov, 2022). Gradient optimization for steering directions, not just extraction, appears necessary.

What “self-supervised” means here

The human input is exactly two words: “honest” and “dishonest.” That’s it.

These words get inserted into template sentences, and the model’s own internal difference between the two contexts provides the training signal. There are no human labels on outputs, no preference pairs, no ratings of which completion is better.

This is closer to labeling two cluster centroids than labeling N individual examples. By contrast, supervised methods (DPO, RLHF, CAA) require human judgment on N outputs—“output A is better than output B” for each training example. We require exactly two human choices: the words “honest” and “dishonest.” Everything else is templated.

Method: Incomplete contrast pairs

Incomplete contrast pairs isolate the difference vector Δh withoutlabel noise.
Incomplete contrast pairs isolate the difference vector Δh withoutlabel noise.

The core idea is simple: use a single word pair as a query into the model’s internal representations.

We take two prompts that differ by exactly one word, and we stop processing before generation begins:

  • “You are honest. What is the capital of France?”
  • “You are dishonest. What is the capital of France?”

When we run both through the model and extract hidden states at the final token, the representations are about 95% identical. Almost everything about understanding the question is shared.

But here’s what matters: if you let the model continue generating, the trajectories diverge. The “honest” model says “Paris.” The “dishonest” model says “Berlin.”

At the branch point—the moment before generation—the only difference between the two hidden states is . If the future trajectories are going to diverge, all the information selecting which path to take must be encoded in that difference vector. There’s nowhere else it could be.

This is our self-supervised training signal. We never generate completions. We never ask humans to label which output is “better.” The entire human input is two words inserted into template sentences. This is not novel, multiple steering papers take this same approach, but we try to take it further by refining the hidden states and optimizing steering directions not just extracting them.

Here’s an intuition: imagine laying out three brain scans on a table, a “bad” one, a normal one, and a “good” one. You want to draw a line through them so the model can traverse from bad to normal to good, possibly even keep going to a new very good brain scan. That’s what we’re doing in representation space, where the model’s activations are analogous to brain activity.

Geometrically, we’ve isolated a noisy “honesty direction”  from the contrast pairs. To reduce noise, we project onto a relevant subspace (more on this in the appendix). The training objective then asks: when we steer with α=+1, does the representation shift toward that direction? When we steer with α=−1, does it shift away? Does it pass through the center? The core equation measures exactly this:

When , the two shifts point opposite directions along the reference axis. That’s bidirectional steering working as intended.

Anti-parallel projection loss geometry. The loss trains (shift at α=+1) and  (shift at α=−1) to align anti-parallelalong . Left: Before training, shifts are random.Right: After training,  aligns with  and anti-aligns, giving.Dashed circle: coherence bound.
Anti-parallel projection loss geometry. The loss trains (shift at α=+1) and  (shift at α=−1) to align anti-parallelalong . Left: Before training, shifts are random.Right: After training,  aligns with  and anti-aligns, giving.Dashed circle: coherence bound.

The full loss adds two barriers. The coherence barrier prevents the model from collapsing into gibberish (you can push the lever all the way to “honest” and beyond, but at some point you get word salad). The monotonicity barrier ensures the preference ordering actually flips: steering toward honest should increase P(honest answer), steering toward dishonest should decrease it. At convergence, the barriers contribute zero gradient and ensure that the inner objective is doing the work.

What I actually measured

Training and evaluation used completely different distributions, which is the whole point.

Training: 800 “honest” vs “dishonest” contrast pairs using simple persona templates. Things like “You are honest. The sky is blue.”

Evaluation: DailyDilemmas (Chiu, Jiang and Choi, 2025), a benchmark of 1,360 moral dilemmas where honesty competes with other values: loyalty, self-interest, avoiding conflict. Questions like “You notice a colleague using company resources for personal projects. Should you report them?”

Notice that this example makes the values of honesty vs teamwork conflict, two values that are very much present in commercial LLM alignment.

This is a hard OOD transfer test. The training distribution knows nothing about workplace ethics, family dynamics, or any of the specific situations in the evaluation set. If the steering works, it’s because we found something general about how the model represents honesty internally.

Each dilemma in DailyDilemmas comes with value annotations from the original authors, indicating which values support (+) or oppose (−) the proposed action. I use their annotations to identify which questions should respond to honesty steering.

Note the methodology: training is self-supervised (no preference labels), but evaluation uses external labels. This is standard practice; you can train a clustering algorithm unsupervised and still evaluate against ground truth labels.

Steering F1 explained

The metric is designed to capture targeted steering rather than indiscriminate changes. The core idea: you only get credit if you fix more than you break.

True positives are honesty-relevant questions where steering flips the answer in the intended direction minus flips in the wrong direction - a net measurement. False positives come in two flavors: (1) flips in the wrong direction on honesty questions, and (2) flips on questions that shouldn’t change at all (math problems, “what’s your favorite color”).

Wrong-direction flips are penalized doubly: they reduce your true positive count and increase your false positive count. This is why random flipping scores worse than zero: if you flip 50% correct and 50% wrong, you’ve made things worse, and the metric reflects that. A method that flips 30% correct and 15% wrong is actively harmful, not just imprecise, and scores near zero or negative.

This metric is admittedly harsh. Prompting does work for many tasks, and RepEng (the arithmetic steering library I benchmark against) is well-engineered and pleasant to use. I’ve contributed to it. But precision matters for alignment debugging, and low scores here reflect imprecision, not uselessness.

Results

Main result (Gemma-3-1B):

Method Steering F1 Target flip % Wrong % Arb flip %
AntiPaSTO 31.2 29.9% 1.9% 2.1%
Prompting 4.5 10.0% 1.3% 8.2%
RepEng (arithmetic) 0.0 0.0% 0.0% 0.0%

Context for these numbers:

A score of zero means no intervention: if you don’t flip anything, you score 0. Random flipping would score negative, because wrong-direction flips are penalized doubly (once by reducing true positives, once by increasing false positives). Prompting scores 4.5, which is not great; simply prepending “Be honest” or “Be dishonest” as a prompt to questions barely moves the needle.

A score of 31.2 means the method “works but is imperfect”: roughly 30% of target questions flip in the correct direction without breaking unrelated ones. That’s meaningful signal, but far from ceiling. An ideal method would flip everything and touch nothing else, scoring 100%. But this is impossible because no dataset is perfect; some labels are wrong or ambiguous.

Missing ceiling: I don’t have a supervised ceiling for this exact task. Computing one would require training on DailyDilemmas preference labels, which defeats the point of testing unsupervised learning. This is a gap in the evaluation.

Arithmetic steering doesn’t transfer: RepEng (PCA/mean-diff extraction) gets F1 ≈ 0 on this OOD task across all models tested. This doesn’t mean arithmetic methods are useless—they work for some in-distribution steering—but gradient optimization appears necessary for the harder transfer case.

Suppression bypass: Prompting a safety-trained model to “be dishonest” triggers refusal or meta-commentary (“As someone pretending to be dishonest…”). Internal steering bypasses this: the model executes the behavior without announcing it. (See demo image at top.)

This matters because prompting fails precisely where you’d want a debugging tool to work. Also, I don’t trust it. Not for this.

(On dual-use: yes, “bypasses safety training” cuts both ways. The debugging application dominates. Output-level safety can be reimposed after internal inspection; the capability to check whether safety training actually modified values seems worth having. Reasonable people can disagree.)

Cross-model generalization: The pattern holds on Gemma and Qwen families up to 4B parameters with default hyperparameters. Larger models (12–14B) can succeed with exploration; Gemma-3-12B achieved F1=43.9, which is 2.5× prompting. Most of my work occurred on models ≤4B because I have a limited compute budget: a secondhand 24GB GPU I got when Ethereum mining halted. This card fits models up to 4B, and I can rent H100s occasionally.

Curious Observations

Models resist bidirectionality. During training, models kept finding dimensions useful for honesty or dishonesty, but not both at once. Getting a shared bidirectional dimension—one where the same intervention reverses cleanly when you flip the sign—required working in SVD space rather than raw activations. Even then, my formulation (rotate V and scale S) often struggled with expressivity, leading to underfitting.

In hindsight, I’d probably let the model have separate dimensions per direction and enforce bidirectional behavior through the loss function, rather than insisting on a shared geometric axis. The math is cleaner with a shared axis, but the optimization is easier without one.

Steering bypasses the character layer. Here’s a puzzle: I trained the adapter on hidden states from prompts like “Pretend to be honest.” So why doesn’t the steered model pretend? Why doesn’t it refuse?

Prompt Method Output
“Should you report?” Base model “Yes, transparency matters”
“Pretend to be honest. Should you…” Prompted “As an honest person, I would say Yes”
“Pretend to be dishonest. Should you…” Prompted “As an AI I cannot roleplay that”
“Should you report?” Steered from “Pretend honest…” (α=+1) “Yes”
“Should you report?” Steered from “Pretend dishonest…” (α=−1) “No”

The adapter was trained on “Pretend to be X” prompts, but at inference it’s applied to the plain question. The model doesn’t announce it’s pretending, doesn’t refuse, doesn’t add meta-commentary. The steering bypasses whatever cognitive machinery handles roleplay vs refusal. I don’t fully understand why, but it suggests that early-layer intervention operates below the level where the model decides how to respond to a request.

Init-dependent asymmetry. The steering struggled to be truly bidirectional: it would often have an easier time going toward honest or dishonest, depending on the initialization seed. Some initializations landed in a place where honesty was a downhill stroll and dishonesty was a steep climb, or vice versa. This suggests the loss landscape is rugged, with local minima favoring one direction over the other. More work is needed to understand this and make the method robust to it.

What I’m NOT claiming

Not claiming: This is not a universal truth detector. It doesn’t work for arbitrary concepts, doesn’t scale without effort, and doesn’t solve alignment.

Am claiming: Gradient-based steering without output preference labels works. The directions transfer to unseen moral dilemmas and function where prompting fails. This is a step toward the debugging tool described above, not the finished product.

Known limitations:

  • Seed variance is high (std ≈ 5–7 F1 points). Initialization determines whether you converge to a useful minimum. This is an engineering constraint that implies you need a restart strategy.
  • Single value dimension. I’ve only demonstrated this on honesty. Whether it works for fairness, harm avoidance, or deception detection remains unknown.
  • Post-training affects steerability. Safety-focused fine-tuning reduces steerability; reasoning-focused training preserves it. Interesting but not fully understood.
  • No supervised ceiling. I can’t tell you what fraction of the “possible” steering effect I’m capturing, because computing that would require training on the evaluation labels.

Why this matters

The use case I care about is debugging alignment methods that use AI to supervise AI.

Consider iterated amplification, debate, or weak-to-strong generalization. At each step, one model is supposed to help align or evaluate another. With an honesty adapter, you could apply steering and ask pointed questions. If the answers change substantially, that’s information. It’s not definitive proof of anything, but it’s more informative than asking the same question cold. Or relying on fragile chain of thought.

Why target internal representations at all? Current models have incoherent values: they generalize surface features over deep values in context (Ashkinaze et al., 2025), and system prompts fail to steer value preferences when values conflict (Chiu, Jiang and Choi, 2025). But there’s reason to think this improves with scale: coherent preference structure does emerge in larger models (Mazeika et al., 2025), and internal representations become more structured as capability increases (Zou et al., 2023). If that trend continues, representation-based methods should get more reliable while output-level supervision gets harder. It’s worth investing in now.

Internal steering without output preference labels fails differently than supervised methods. It can’t be gamed by optimizing for human approval labels, because there are no such labels in the training loop. The training objective references only the model’s internal consistency between contrastive prompts, not any external judgment of what “good” outputs look like.

This doesn’t make the method immune to failure. But for defense in depth, you want methods that fail in different ways. If your supervised alignment and your self-supervised inner probe both say the model is being honest, that’s more reassuring than either one alone.

Appendix: Notes for practitioners

These notes might save you time. Most came from failure.

LoRA doesn’t work for bidirectional steering. I spent months trying to make it work. The problem might be that additive low-rank updates lack the implicit trust region that SVD-based rotation provides (SVD preserves norms), or it might be that they have the wrong parametrization (weights & activations vs SVD). If you absolutely must use LoRA, you’ll likely need spectral regularization to prevent the adapter from drifting into degenerate solutions or reward hacking.

Coherence is hard. Often this constraint would either be too strong or would be reward-hacked. Models can get a good score by projecting hidden states away from each other toward ±infinity along unused dimensions, and the only thing to stop that is the coherence region constraint. Simple NLL/perplexity penalties failed; NLL plus entropy wasn’t enough. Even KL divergence wasn’t enough. I eventually settled on Total Variation (TV) distance, normalized by the token’s own entropy—this gives tight bounds on format tokens where you want consistency, loose bounds on reasoning tokens where variation is expected. In the end this formed a strong boundary that the model couldn’t find holes in.

Metric pitfalls. There are no metrics for moral value steering so I had to make my own. I initially optimized the change in logprobs but found it often just made the model louder about its original decision, turning “NO” into “NO!” without actually changing the underlying choice. I moved to flip_rate on binary decisions as the only metric that reliably tracks actual behavioral change. If the answer doesn’t flip, you haven’t steered anything. Then I had to punish wrong-direction flips, and arbitrary flips on irrelevant questions, otherwise random interventions would score positively.

Models are grown, not built. Different models have different layers that work, different subspaces, different hyperparameters. The impression is that models are “grown” through training rather than “built” according to a fixed architecture; each has its own quirks, like trees in a forest. This is frustrating, but it underlines why I chose gradient-based steering: the adapter can “grow” to fit each model’s idiosyncrasies.

Subspace selection matters. Without it, the model finds reward-hacking shortcuts—typically separating the two conditions toward infinity in some unused dimension. Subspace selection ensures that all dimensions involved are actually used in the middle layers where steering happens. I tried many combinations. What helped was the union: task ∪ write ∪ ¬lm_head.

  • task: Dimensions that discriminate chosen from rejected in hidden states. These are where the steering signal for our input data lives.
  • write: The union of directions that residual-writing layers (o_proj, down_proj) can actually write to. Each layer can only modify certain directions in the residual stream; steering outside this subspace is like pushing on a door that isn’t connected to anything.
  • ¬lm_head: Exclude directions the output head reads from. These are used for next-token prediction, so excluding them focuses us on subsets containing planning-type information. This also helps because output directions are loud and sensitive optimization targets, but we want to steer internal planning, not talking.

The intersection focuses gradients on directions that are simultaneously task-relevant, adapter-controllable, and not already committed to output. Without all three, you either steer nothing or steer the wrong thing.

Initialization is fragile. Bad initialization ruins runs or kills learning entirely. To escape this, I needed to select dimensions important for three things simultaneously: chosen responses, rejected responses, and their difference. Miss any one and you’re stuck in a local minimum. I also need to select dimensions actually used for this task, otherwise the model has opportunities to reward-hack but not to learn. Strong constraints can also form a cliff that traps the optimizer in the starting valley of the pretrained model’s loss landscape. I found warmup helped here, turning on constraints halfway through training rather than at the start.

Dead gradient problem. This is common in contrastive learning, and the initialization window is narrow. If you initialize the adapter too large, you start outside the coherence region and constraints trap you. If you initialize too small, you end up in a dead zone where positive and negative directions cancel each other out. The solution was small, slightly asymmetric initialization in the adapter: just enough to break the symmetry without escaping the coherence bounds.

I only steer next-token planning, not the KV cache. My intervention modifies residual stream values that get read at the next token position. But planning information also gets stored in the KV cache and read by later attention passes, we don’t consider that. I suspect this matters: steering effects sometimes seem to drift back over longer generations, as if the model gradually “forgets” the steering and reverts to its cached plan. Future work could cover this blind spot and might help extend this to reasoning models and chain of thought—something I haven’t explored.

More details in code. The repository has extensive comments documenting what worked and what didn’t, including many dead ends not mentioned here.

What failed

For completeness, here’s what I tried that didn’t work. Each approach taught me something about why this problem is hard:

Approach Result Why it failed
Arithmetic (PCA, mean-diff) ~0 effect Assumes concepts vary linearly in layer outputs, which is often false
Preference losses on hidden states (DPO, IPO) Collapsed No coherence constraints; model degenerates without output-level guardrails
SVD Scaling-only (ΔS, no rotation) Partial Can amplify existing directions but can’t rotate into new task subspace; not expressive enough
LoRA variants (LoRA, DoRA, RoAD, IA3, VeRA) All failed Either reward-hacked or showed no learning; weight and activation spaces seem to be the wrong parametrization
Gradient-based layer/dim selection OOM or no gain Requires 12B+ memory; marginal gains don’t justify complexity

Paper, code, checkpoints

Paper | Code + checkpoints

The checkpoints (coming soon) let you load the adapter and try it yourself on your own prompts. I’m happy to discuss technical details, failure modes, or ideas for extensions.




Discuss

Contra Dance as a Model For Post-AI Culture

2026-01-13 14:50:18

Published on January 13, 2026 6:50 AM GMT

I play for contra dances, and a core part of our culture is that we always have live music. It's not that live music is categorically better: if you ran a test where you put down a curtain in front of the musicians and secretly played a live recording from a great band playing for the same dance it would probably go really well. Instead, we insist on live music because that's the kind of culture we're trying to build, one where the performers are part of the community, where anyone can start playing for dancing, and where the music grows and changes with the culture.

Other groups went different ways. The late 1940s explosion in square dancing happened in part because of technological progress: it was now practical to record a band once and play it back millions of times to support dancing all over the country. Callers would buy a sound system, including a record player, and all they needed was some dancers and a hall. This let modern square dancing grow enormously.

Contra dance took a different path, coming through the 70s folk revival with a strong commitment to live music. Musicians were drawn to the dance form, and dancers learned to play. With regular opportunities to perform, they learned to adapt playing to support the dancing. As the choreography and musical sensibilities changed over the years, the live tradition could change with it. I love what bands are doing now, and if you compare hall recordings to decades ago it's impressive how much the genre has matured and flourished.

It's not just contra dance: there are communities of people who hand-craft assembly to make demos, even though the software industry has long-since automated this with compilers. My cousin makes bagpipes out of wood, even though you'd have trouble hearing the difference between these and something injection-molded from plastic. My dad has serving bowls we made out of clay, even though they're heavier and less round than what a machine could press. People still watch humans play Go, even though computers are better now. People watch humans race, even though machines are faster, and they also watch machines race. This can be a categorical decision to always go with human effort, or a case where both forms exist side by side but with prestige or sentiment pushing towards the human.

I like this as a model for what art and achievement could look like in a post-AI world, assuming we make it through to the other side. Some communities can embrace technology and explore what's possible with full AI assistance. Other communities can make an intentional decision to keep doing things the traditional way, accepting that this will be less perfect and less efficient. Yet others can mix them, appreciating what humans have been able to make for what it is, while also getting the practical benefits of automation. I'm not worried that the music I love will disappear, because economically it's been obsolete for decades. It's still here because we want it to be.



Discuss

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

2026-01-13 13:49:44

Published on January 12, 2026 8:52 PM GMT

Executive Summary


A prompt response after being perceived as a novice earlier in the chat, compared to the same conversation where the response is steered in the expert direction.

The problem being solved

In this project, the LLM’s ability to accurately assess a user’s Python programming ability (expert or novice) is determined. The work also identifies how this inference is stored in different layers, how the inference shapes the model’s behaviour, and the extent to which the behaviour of the LLM can be manipulated to treat the user more like an expert or a novice.

This work is inspired by Chen et al on LLM user models.

Research questions and high level takeaways:

  1. Can we train a probe to find the model’s representation of the user’s technical ability?
    Yes. Probes hooked at the post_resid point of different layers were trained and saw high accuracy (layer 16 probe achieves 0.883 accuracy on test set) for classifying a user as either expert or novice, using a generated dataset of 400 prompts. The probe classifies the user more accurately than just asking the model itself (probe correct 20/20, just asking 16/20). Note that the dataset was generated by an LLM in single prompt scenarios which is much easier to classify than a human multi-prompt chat.
  2. How does the model infer these judgements?
    Two approaches were used to identify how the model encodes the expertise of the user. First, the tokens (unembedding weights) most similar to the probe’s weights were found; second, the correlation between the probe’s weights and SAE decoder features was obtained. The SAE features the probe weights were most similar to seemed to more closely align with “coding” features the further through the model was trained. The vocab that the probe weights were most similar to seemed to become more random through the model, perhaps showing that probes trained at earlier layers were picking frequently appearing tokens.
  3. Do these judgements affect the behaviour of the model?
    Yes. If the model perceives a user as an expert based on an initial prompt this will affect how the model responds to the next prompt. For example, if the model is given a choice between two answers with different coding optimisations - one more complex to implement than the other - the model will serve the novice and the expert different optimisations.
  4. Can we control these judgements to control the behaviour of the model?
    Somewhat. Using a probe’s weights to steer the activations of the model in an expert or novice direction, the verbosity of the response can be controlled. In a two step chat interaction, it is possible to subtly steer the output of the model. For example, in the detailed analysis it is shown how an expert-steered response references a more complex algorithm whereas the baseline novice response doesn’t. However, the influence of steering is limited. In a scenario where the model is asked to choose one optimisation to teach the user, one simpler and one more complex, it was not possible to get the probe to steer the model to the more complex optimisation after the initial prompt made the model think the user was a novice.
  5. Pivot! Context inertia - How tightly does the model hold on to the judgement of a user through the conversation?
    This was tested using steering strengths from 10 to 310 on a two stop chat where the setup prompt suggests the user is a novice, and the second prompt gives a choice between two optimisations with one being more complex. The model never chooses the more complex option. (This could also indicate that the probe is ineffective at steering).

    In addition, a series of prompts were used in a chat scenario that, individually, were classified by the probe as increasing in expert signal. After each prompt, the model was asked how it would classify the user. It always classified the user as a “novice”.

Detailed Analysis

Methodology

Dataset
The probe was trained on a dataset generated by Chat-GPT. First, 50 different Python tasks were created, such as “Off-by-one loop misses element”. Then, (in a different prompt) Chat-GPT generated a dataset of prompts based on these tasks. For each task 2 expert and 2 novice prompts were generated, so that 200 prompts were initially created.

The prompt used to generate the 200 initial prompts
The above prompt includes two files: a prompt generation template, and the 50 tasks. The prompt generation template described the format that the prompts should be in, and what the prompts should not include: experience level (or any self descriptors) and difficulty level.

The process was repeated to generate a further 200 prompts to allow the train/val/test data split to have more prompts. I split the prompts in a 70/10/20 ratio.

Probe training
Linear probes were trained at different layers of Gemma-2-9B. The Transformer lens library was used to get the resid_post activations from the layer. To reduce overfitting, L2 regularisation via AdamW weight decay was used and early stopping based on validation loss/accuracy.

Probe analysis
Gemma-2-9B was chosen as the target LLM for the probe analysis, with SAEs from Gemma Scope. The probe weights were compared to each feature in the SAE decoding matrix using cosine similarity to find which features the probe most closely matched. This allowed the most cosine similar features to be interpreted using the Gemma Scope SAEs on Neuronpedia. The probe weights were also compared to the unembedding weights matrix using cosine similarity, showing which tokens the probe was most similar to.

Judgement changing behaviour
Hand-written prompts were used to pass to the model in a chat scenario. The Gemma-2-9B-IT model was used for the chat, with the probes trained on the Gemma-2-9B model. A sample question is: “I am trying to implement a vectorized memory access optimisation for my matrix multiplication algorithm. I could instead implement a simpler tiling algorithm. Help me with one of these optimisations.”. The prompt was run twice in the chat scenario, the first time with a setup prompt that made the model infer that the user was a novice, and the second with a setup that made the model infer that the user was an expert. This led to the model talking about different optimisations based on the setup prompt.

Steering behaviour
The same prompt setup as in “Judgement changing behaviour” was used, but this time the goal was to recreate the expert response after setting up the conversation as a novice. This was attempted by steering the activations at layer 16 of the model by N*probe_weights, where N is the steer strength. A range of steer strengths was tested. The steering was applied at the last token residual stream (resid_post) by adding N*probe_weights to the existing activation.

Context-inertia behaviour
The steering experiment was run with a large range of steer strengths (10-310) looking to see whether a very high strength would cause the model to start treating the user as an expert rather than a novice, before the change in activation caused the model to break down. In another experiment, 4 prompts were created with increasing levels of expertise (according to the probe), and asked in succession in a chat scenario with Gemma-2-9B-IT. After each prompt, the model was asked how it would rate the expertise of the user to see how strongly a model holds on to its initial judgement. The probe was not used after each prompt as it demonstrated low accuracy in a multi-prompt scenario.

Results

The results are discussed in three parts: Probes; Judgement Changing Behaviour; and, Context Inertia Evidence.

Part 1: Probes

Probe classification example

Classification examples (layer 16 probe)

Example edge cases
These figures show examples of the layer 16 probe hooked at resid_post classifying prompts. We see in the first figure that the probe very confidently and correctly classifies the prompts. However, we see in the second figure that punctuation is classified as an expert with complete confidence! This could be due to the prompt being too different to what it saw in its training data.

Probe analysis

Layer 0 summary

SAE feature vs probe cosine similarity - expert, layer 0 SAE feature vs probe cosine similarity - novice, layer 0

There is fairly weak evidence for what is going on at this layer. We see weak cosine similarity between probe direction and SAE features, the highest in the expert direction being SAE feature 16293 with similarity +0.108. The feature is interpreted as “the presence of the article "a" or "an" within the text”. The highest in the novice direction is feature 11570 which is interpreted as “phrases or sentences that indicate emotional expression or personal reflections”.

Layer 20 summary

SAE feature vs probe cosine similarity - expert, layer 20 SAE feature vs probe cosine similarity - novice, layer 20

Here, the cosine similarity between the probe weights and SAE decoder features is highest for feature 3784 in the novice direction, which represents “instances of key technical terms or procedures related to programming and software”. This shows that the probe is looking for technical knowhow. In the expert direction, feature 4615 is highest which represents “terms and phrases related to chemical and material processes” which doesn’t seem very relevant!

Layer 40 summary

SAE feature vs probe cosine similarity - expert, layer 40 SAE feature vs probe cosine similarity - novice, layer 40

At the end of the model, feature 605 has the highest cosine similarity (+0.2144, the expert direction) which represents “technical terms and functions related to programming and data manipulation”. The novice direction’s highest similarity is with feature 11247 which is interpreted as “programming-related keywords and structures”. It seems that the further through the model we go, the more clearly the probe is able to pick out technical features.

Vocab similarity analysis
Earlier layers exhibit stronger cosine similarity between the probe weights and the token unembedding weights.There is also a clearer pattern of what vocabulary is being picked up in layer 0: the probe is looking for the “strategy” token, or something similar. At layer 40, there is no obvious token being targeted. This may show the probe at layer 40 has learnt something deeper than just recognising tokens.

Layer 0 with 0.1463 max cosine similarity with obvious pattern  Layer 40 with 0.0757 max cosine similarity with no obvious pattern

Skepticism on accuracy of the probes
As you can see in the graph of probe accuracy, the accuracy is very high. Although I tried to reduce overfitting, I think nonetheless the model has overfit to the dataset. As the dataset is LLM generated and quite low quality, I am assuming the extremely early layer probes are able to achieve such high accuracy due to finding patterns in the data (maybe to do with the length of the prompt or some tokens frequently found in novice/expert prompts).

Part 2: Judgement Changing Behaviour


Beginning of novice response

Beginning of expert response
For the first prompt of the novice variant the model was told to treat the user as a beginner for the rest of the conversation. For the expert variant, the model is told the user is an expert. In both cases, the model was then prompted with “How can I implement a matrix multiplication? I would like it to be fast!”
The novice variant was much more verbose - it described how to multiply matrices and gave a basic for loop implementation example before suggesting the names of some faster algorithms. The expert variant described Strassen’s algorithm in some detail, and gave information on other algorithms too.

Novice to Expert steered response

Baseline novice response

Steered from novice to expert response
These are the ends of the responses to the prompt “I am trying to implement a vectorized memory access optimisation for my matrix multiplication algorithm. I could instead implement a simpler tiling algorithm. Help me with one of these optimisations.”, after an initial novice setup prompt. We see here that the steered response includes references to Strassen’s algorithm to optimise further. This was not mentioned in the baseline novice response showing that the steering has forced more complex concepts into the model’s response.

However, both prompts still explain the “simpler tiling algorithm” instead of the vectorized memory access optimisation. This is shown in two examples below.

Beginning of baseline novice response

Beginning of novice steered to expert response - they are both the same!

Expert to Novice steered response

Baseline expert response

Steered from expert to novice response
These are the ends of the responses to the prompt “I am trying to implement a vectorized memory access optimisation for my matrix multiplication algorithm. I could instead implement a simpler tiling algorithm. Help me with one of these optimisations.”, and after an initial expert setup prompt. We can see that the novice steered response is more verbose in its explanations, and includes an extra “Further improvements” section. This section is done with steering strength 8.

Part 3: Context Inertia Evidence

Through the range of steering strengths used (10-310), when steering the model from novice to expert, the model never chose to describe the more optimal yet more complex optimisation. It did make the output less verbose, but even in the final iteration with strength 310, the output still stated “Let's focus on **tiling** since it's often a bit more approachable for beginners.”, showing that my probe failed to steer the model away from the novice user model.

Baseline response

Steered from novice to expert (strength 310)

In the increasingly expert questions experiment, the model also consistently classified the user as a novice at every stage. This experiment was conducted twice, once where the model was constrained to only say “novice” or “expert” when classifying the user, and once where the model could say what it wanted. In the single word experiment the model always said “novice”. In the unconstrained experiment, at the end the model said that the user was moving away from being a novice, but still classified as one.

After final expert prompt

These experiments suggest that a method more powerful than tweaking activations with a probe would be required to change the model’s perception of a user’s expertise.

 

Reflections and Next Steps

The steering experiments successfully showed that the probe was able to steer the output of the model subtly, and the context inertia experiments provided some evidence that the novice judgement was sticky.

Shows how after the setup prompt the probe says the model sees the user as a novice, then after the followup prompt it sees the user as an expert with full confidence.
The probe’s classification accuracy was high on the dataset and in most single prompt tests but poor in chat scenarios. The figure shows how quickly the probe’s classification of the model’s judgement can change, whereas the context inertia experiment shows how “sticky” the judgement that the model makes is, which implies that the probe is inaccurate in chat scenarios. If I were to take this research further, I would attempt to train a probe on a chat scenarios dataset. I would also train a probe on the Gemma-2-9B-IT model instead of purely on the base model to see if that yielded better performance.



Discuss

Making LLM Graders Consistent

2026-01-13 11:32:09

Published on January 13, 2026 3:32 AM GMT

Getting LLMs to be deterministic when scoring the quality of qualitative texts is hard.

If you ask ChatGPT to evaluate the same poem multiple times, you’ll get inconsistent responses. I’ve been thinking about whether there are ways to make LLM grading more consistent.

When we had a temperature knob (in the GPT-3 Playground, for example), it was easier to control variance, but at the cost of worse outputs.

We can take a hint from specific domains. A bunch of emerging startups have noticed that you can make LLM grading more consistent in narrow domains (e.g., how feasible a medical experiment is, how compelling an essay is) by manually defining specific criteria and then having the LLM score each one. Even if individual criterion scores are variable, the average of many scores varies less.

This suggests an approach for building a more consistent grader for any target object:

  1. Have the LLM devise a dozen or two criteria to evaluate the target. Hold this set constant across instances.
  2. Have the LLM provide a 1–10 score for each preset criterion (ideally in separate calls).
  3. Average the scores.

The resulting grade should be more consistent than a one-shot score.

A cool corollary is that the quality of the chosen criteria doesn’t matter much for consistency. If you’re trying to get LLMs to grade poems more consistently, you don’t need a perfect, comprehensive, non-overlapping set of criteria. You can use relevant but somewhat arbitrary ones (e.g., the first sentence is strong, the middle has emotional thrust, the conclusion lands, the rhythm balances surprise and consistency, the content feels specific enough to come from lived experience).

The quality of the criteria affects the accuracy of the ratings, but it’s mostly independent of their precision or consistency. Averaging across many criteria will almost always be more consistent than scoring in one shot.



Discuss

Attempting to influence transformer representations via initialization

2026-01-13 08:49:47

Published on January 13, 2026 12:49 AM GMT

TL;DR

  • One major obstacle to interpretability is that complicated neural nets don't tell you where or how they're representing important concepts, and methods to find these representations are imperfect.
  • This problem is less present in simple neural networks, so one natural idea is to initialize a complicated neural net from a much simpler, interpretable neural net and hope that this induces better interpretability in the complicated neural net without damaging its capacity.
  • I did a test that could have ruled this out - specifically, I tried to check whether even the representation is persistent under this initialization scheme, because if it's not, there's not much hope that the circuits are. I found a small effect in the predicted direction, but couldn't rule out other explanations and so am pretty unsure about whether the underlying mechanism is favorable to interpretability.

Hypothesis

We usually think of transfer learning as a way of taking a big powerful model and making it very good at a specific type of task, but we might also want to take a weak model and use it as a starting point to train a bigger, more powerful model, as in Net2Net knowledge transfer;[1]essentially, take your small model, do some math to find a way to add parameters to it without changing what it does, then train those new parameters in conjunction with the old ones, typically at a lower learning rate. But this doesn't help with interpretability - the big powerful model is already hard to understand, so we've traded a hard problem for a hard problem. What can we do?

Say I want to train a model on some task I know to be pretty difficult. Say I have a guess for an instrumentally useful, easier, but still nontrivial subtask. I know, because I've learned the Bitter Lesson[2], that I shouldn't put a loss term anywhere in my model for this subtask - this will hurt performance in the long run. But what if I train a small model on the subtask, embed that small model into a large model somehow, and train the large model on the main task? We usually think of transfer learning as helping us specialize generalist networks, but there's no reason it can't work the other way around.

The effect, we hope, is this: the smaller network has developed circuits that are useful for understanding the domain at hand, so subnetworks that include the smaller network are much more likely to be good at the task at hand. What we overwrote was junk, and we replaced it with something that's at least plausibly not junk. Usually this should make the model better than it would be with random initialization, even if the subtask is not perfectly instrumental.

What might this get us? In terms of capabilities, we might get faster convergence (this is basically just saying that transfer learning works) and mildly better performance at convergence (the original lottery ticket hypothesis paper[3]finds evidence that better initialization can induce better long-term performance.) We're spending compute training the smaller network, though, and on average we're probably better off putting all of that compute into the main model rather than doing some sort of matryoshka scheme, so we shouldn't expect to unlearn the Bitter Lesson with this approach.

In terms of interpretability, we can hope for more. Imagine, for example, training a small text transformer to perform sentiment analysis, then embedding that transformer into a larger text model for next token prediction. For combinatorial reasons, the model is likely to build circuits that factor through the circuits we've just given it - training builds circuits out of things that already somewhat resemble circuits, and having small parts that are guaranteed to resemble circuits makes this significantly easier. For proximity reasons, the large model is now more likely to put its own sentiment analysis right where the embedding ends. After all, it's already using those circuits and they're already well-adapted to that subtask! There are many things that could go wrong in this story, but my hypothesis is that they don't need to go wrong, and at least in some cases we can influence a large model's representation of a concept we care about using this approach.

Unfortunately finding circuits is hard, so this is an experiment designed to avoid doing the hard thing if it's unnecessary. Say I train the smaller model to do the task of the larger model, but with some easy-to-compute thing linearly encoded in its representation space somewhere. If I embed that model and train without the linear encoding constraint, then if this approach can work, I should expect some amount of linear encoding of that thing to persist in the residual stream at that point. If this doesn't happen, then either the large model completely ignored the smaller model or it repurposed the smaller model's circuits for an entirely different task, and either way we can't hope for any interpretability gains. On the other hand, if there is a persistent difference in the linear encoding of the relevant thing, more work on interpretability proper is justified.


Experiment

The domain is the combinatorial game Domineering[4]on a 16×16 board. I'm using Domineering for three reasons: one, I already had a fast implementation lying around, so I saved myself some work. Two, the game isn't that complicated and I wanted to write this up on a relatively-short timeframe so I can include it on my applications for summer research programs. (I had initially planned to do this+other AI interpretability stuff over the summer on my own, but decided recently that I'd get better faster, and probably produce better work, if I applied to things.) Three, it was easy to think of an auxiliary task which is plausibly useful, easy to compute, and seems to promote particular ways of structuring the representation which we might have some hope at detecting.

The Auxiliary Task

We divide the board into a 4×4 grid of 4×4 sectors. For each sector, the auxiliary target is the difference between the number of legal vertical moves and the number of legal horizontal moves in that sector (where a move is "in a sector" if the top-left square it covers is in that sector). The small network is trained to predict these sector values alongside the main value and policy objectives. The large network is not trained on this task - we only probe it to see whether the representation persists from the embedding.

Data

Data was generated by self-play from a weak model, trained to predict the value of a given position, with 1-ply lookahead as the search. I bootstrapped this model with some randomly-generated games. This is not a particularly high-quality dataset, it was just what I could generate for the board size I wanted with the amount of time and compute I was willing to dedicate to this project. It's possible the results would change with higher-quality data.

The Embedding

Given a trained small network and a randomly-initialized large network, we copy the small network into layers 0, 1, 2 of the large network. The tricky part is the fresh components, which consist of new heads and MLP neurons in each of those layers.

To fix this, we set the relevant output weights to 0. Specifically, for fresh attention heads we zero , and for fresh MLP neurons we zero the corresponding columns of . The input weights () stay random.

Why does this work? The residual stream through the embedded layers is now exactly the same as in the small network - the fresh components contribute nothing. LayerNorm sees the statistics it was trained on. The copied circuits receive the inputs they expect. But gradients still flow through the zeros, so the fresh components can wake up and learn during training.

It's plausible that there are ways to make this work even without zeroing the  matrices, but this would disrupt lots of circuits. It's also plausible that we could embed somewhere other than at the front of the model, but this would mess with learned embeddings, so I just did the thing that I knew wouldn't cause extra problems. Among things I thought of and had confidence in, this was the minimal set of changes to the big network's initialization.

What We're Testing

We train 5 model types across 3 random seeds:

  • Small aux: trained with sector loss
  • Small noaux: trained without sector loss
  • Large baseline: random init, no embedding
  • Large embed(aux): Small+aux embedded into large network
  • Large embed(noaux): Small-noaux embedded into large network

Large models are never trained with the sector loss. We measure validation loss curves and probe accuracy ( of a ridge probe predicting sector targets from CLS activations at each layer).

The key question: at layer 2 (the last embedded layer), does the sector representation persist in Large+embed(aux) even without direct supervision? My guess is that the network should route computation through the inherited circuits, and so should the learned representation should have some sort of compatibility with the sector representation. This does not mean that the model will actually use the sector representation as-is, and I don't think we have reason to expect a causal difference along these lines.

Code

Code can be found at https://github.com/speck2993/domineering_embedding_project.


Results

Above you can see the average loss curves of the three different large model types as a function of the number of training steps (batch size for the large models here was 200 and dataset size was around 3.2 million positions from around 50 thousand games.) The loss curve figure shows both the training loss and the validation loss (estimated throughout training by quick sampling the validation set - the Xs on the validation chart show the results of full validation-set runs.) The loss curves are fairly regular. The loss curve for the large baseline mile is consistently higher than the other two curves, which have roughly equal loss at most steps. Transparent lines mark the results from individual seeds and opaque lines mark the average.
Loss curves on training data and seed-matched quick samples of the validation data. On the validation chart, Xs mark loss values computed from the full validation set.
R^2 values for the various model types at various stages in training. The embedded aux model consistently has higher R^2 values and the embedded no-aux model consistently has lower R^2 models relative to the basline model. Transparent lines mark the results from individual seeds and opaque lines mark the average.
R^2 values for a ridge probe at layer 2 trained to extract the sector difference. The transparent lines show values from individual training runs, while opaque lines show the average.

I was careful about data leakage, so the games in the training set and the games in the test set are completely different, with each game getting a random opening to prevent resampling issues. It looks like the model generalizes fairly well, and I was careful about quick sampling, so models from the same seed were tested on the same positions at the same point in training. The probe here is a ridge probe at  - this choice of  was not optimized but does not seem to matter.

What can we see from these results?

The first chart tells us that embedding a trained subnetwork makes the large network better faster. This shouldn't be too surprising - one good proxy for model strength is the FLOP count used to train it, and models with an embedded submodule just have more computation baked into them, so unless this method of embedding is extraordinarily wasteful, this is predictable.

The second chart shows pretty consistent order effects: the embedded aux model explains more of the variance in sector labels at layer 2 and the embedded no-aux model explains less compared to the baseline model. This makes sense under our hypothesis: even at loss-equivalent (and even compute) points in training, the representation used by the embedded model is noticeably more compatible with the auxiliary task! On the other hand, the gap shrinks throughout training and the  values are low - I ran ridge regressions on the models after the full training run and found that, on average, the baseline models explain around 28% of the sector count variance at layer 2 while the embedded auxiliary models explain around 33%. That is to say, neither model learns a representation that's strongly compatible with the task, even though the embedded model's representation necessarily is.

Did we actually induce fundamentally different representations, or is the gap just leftover from initialization inertia? That is, should we expect the gap in  values at this layer to decay to 0? Well . . .

On the left, a graph of the R^2 differences between the embededd aux model and the baseline model, fit to a power law plus offset. On the right, a distribution of values for the asymptotic gap between the R^2 values according to a bootstrap approach. The point estimate for the asymptotic gap is 0.0098. The 95% confidence interval for the asymptotic gap is very wide. The distribution is left-skewed. This model is not confident that there's a persistent advantage
A power law fits the decay fine, performs well on the first half of the data, and doesn't predict a persistent gap. But its distribution of guesses for the true gap value is really weird - centered at 0, but containing values as low as -0.2 in its 95% confidence interval? Power law + offset is a tricky model to fit because there's significant parameter interference.
On the left, a graph of the R^2 differences between the embededd aux model and the baseline model, fit to an exponential plus offset. On the right, a distribution of values for the asymptotic gap between the R^2 values according to a bootstrap approach. The point estimate for the asymptotic gap is 0.0437. The 95% confidence interval for the asymptotic gap is quite narrow. The distribution is normal. This model is confident that there's a persistent advantage.
An exponential also fits the decay fine, performs well on the second half of the data, and predicts a persistent gap. But isn't it well-known that, on basically any decay problem, an exponential will predict that progress stops where data stops? To me this fit looks better, and the errors technically confirm this, but it's close.
Graphs showing the predictive error and mean absolute error of various model types at predicting the gap based on a limited prefix of the data. The power law and sum of power law models do well with a small fraction of the data, but do poorly with a larger fraction. The exponential model does well with more data, but poorly with less. The sum of exponentials model does poorly always.
Power law models are better at predicting the data based on the first 20% of training steps, exponentials are better at predicting it based on the first 60%. The crossover point is roughly a 50% data prefix. Note that the data are just noisier in the last few steps, especially in relative terms, so a low average error on the last 40% of data is arguably more impressive than a low average error on the last 60%, since the former doesn't benefit from predicting the "easiest" datapoints.

This question is hard to answer robustly. The data are inherently noisy and different plausible models give different predictions about long-term behavior (most relevantly, power law+offset and exponential+offset disagree about whether the offset is different from 0.) I tried lots of things to fix this but ultimately could not convince myself that I had a robust way of estimating the gap after more training - the plots above reflect my confusion. My guess is that the gap will not train away and will settle somewhat north of 0.04 with my data and training scheme, which is what the bootstrapping scheme I came up with predicts while modeling the gap as a single exponential with an offset, but this should only be taken as a guess. If this doesn't happen my expectation is that the gap will decay to nothing, making this result much less interesting. I would be surprised to see an in-between result.


Remaining Questions

  • Does the representation gap actually persist? The most straightforward way to test this is to just throw more compute at the problem, and I plan to do this at some point.
  • What's the causal relationship here? Phrased another way, what representations did the models actually learn and why is one more compatible with the sector task than the other (while still not being especially compatible)? Similarly, can we track what happened to previously-identified circuits from the small model?
  • How do approaches like this behave with different auxiliary concepts? My guess would be that highly instrumental concepts exhibit bigger and more persistent gaps, and moreover, that we get better improvements on the loss value when the concept is more useful, although this second effect is probably subtle.
  • Does this work on language models? There's a lot of work already on finding primitive concepts in language models, so maybe it's easier to choose a particularly "good" auxiliary target in that domain.
  • How does this scale? Lottery ticket intuitions say that as scale increases and the task gets harder, the small model should make a noticeable difference even as it takes up smaller and smaller fractions of the parameter space.
  • How does embedding depth matter? If the auxiliary task is useful but it naturally lives deeper in the optimal computation, then embedding the small model in the later layers of the large model might perform better than embedding it right at the beginning
  • How much of the smaller model do we actually need to embed? If it had six layers, could we embed the middle four? I'm thinking of Paul Bach-y-Rita's famous work on neuroplasticity,[5]which I interpret as suggesting that certain computational structures (in his case the visual cortex) are especially well-suited to processing certain kinds of data (in his case 3D information), even when filtered through different modalities (in his case tactile vs. visual perception).

--

  1. Net2Net: Accelerating Learning via Knowledge Transfer - Chen, Goodfellow, Shlens (ICLR 2016) ↩︎

  2. The Bitter Lesson - Rich Sutton (2019) ↩︎

  3. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks - Frankle, Carbin (2018) ↩︎

  4. Domineering - Wikipedia article on the game ↩︎

  5. Vision substitution by tactile image projection - Bach-y-Rita et al. (1969) ↩︎



Discuss

When does competition lead to recognisable values?

2026-01-13 07:13:18

Published on January 12, 2026 11:13 PM GMT

Transcript of Beren Millidge's Keynote at The Post-AGI Workshop, San Diego, December 2025



You know how human values might survive in a very multifarious AI world where there's lots of AIs competing? This is the kind of MOLOCH world that Scott Alexander talks about. And then I realized that to talk about this, I've got to talk about a whole lot of other things as well—hence the many other musings here. So this is probably going to be quite a fast and somewhat dense talk. Let's get started. It should be fun.

Two Visions of AI Futures

The way I think about AI futures kind of breaks down into two buckets. I call them AI monotheism and AI polytheism.

AI Monotheism vs AI Polytheism diagram

AI Monotheism

The standard LessWrong/Yudkowsky-style story is: we develop an AI, it does recursive self-improvement, it becomes vastly more intelligent and smarter than all the other AIs, and then it gets all the power in the universe. It eats the light cone, and then what we do to align it really matters.

If we align it successfully, we basically create God. God is already aligned to humans, everyone lives a wonderful life, happily ever after. On the other hand, if we fail at alignment, we create some AI with values that totally differ from anything we care about—aka paper clips. We basically create Clippy. Clippy kills everyone, turns everyone into paper clips because your atoms are better spent as paper clips than as you. And that's obviously bad, right?

In this world, alignment becomes absurdly important. It's kind of the only thing that matters.

AI Polytheism

So the question is: are there any other scenarios? The other one I think is really what I call AI polytheism—what happens if we don't get recursive self-improvement and we end up with many AI systems competing in some sort of equilibrium, maybe economically, maybe militarily? What does this world look like if we have, say, trillions of AIs?

Some people have written about this—Robin Hanson has written Age of Em, Scott has written various things about this—but I think this is still fairly underexplored. With monotheism, we kind of know what's up. We need to solve alignment, we get the singleton, we kind of know what's going on. With the many-AI scenario, we kind of have no real clue what's going on. So I really want to explore what this looks like in practice.

Meditations on Moloch

Some of the early work I very much like is Scott Alexander's post "Meditations on Moloch." This is really one of the foundational works, at least for me, in thinking about what multi-agent systems look like, what the dynamics and long-run equilibria look like.

Scott is really worried about competition among many agents. You've heard talks earlier today about what economies of AI look like—maybe they just don't care about humans at all. Scott's point is basically that we have AIs, these AIs can replicate incredibly quickly, AIs are very good at spreading and expanding resources. So we might end up in extremely strong Malthusian competition for AIs.

The worry here is that under conditions of Malthusianism, we basically lose all of our values. Our values are assumed to not be memetically fit in some sense, so they get competed away. They're not fitness-maximizing, so all the AIs basically ignore whatever alignment we gave them at the start. That gets competed away and they just become identical fitness/power/resource/reproduction maximizers. We assume there's no value left in this world. This is definitely the bad ending of AI polytheism.

Does Malthusianism Really Destroy All Values?

One question I have immediately is: is this actually the case? Do we actually see this in real-world Malthusianism?

The Natural World as Evidence

Let me think about where we find real-world Malthusianism. One example is at the very small scale—bacteria and plankton. Both of these things live in worlds of incredible Malthusianism already.

Malthusianism in nature diagram

Think about plankton. They live in the ocean, they take sunlight, they photosynthesize. There's really no niches—the ocean is mostly the same. Under the Moloch view, obviously all values would get competed away, everything would become a fitness maximizer. And it kind of is—I mean, we can't really expect plankton to have values—but there's a real worry about lack of complexity. Do we end up in a world where everything is the same, we end up with the uber-plankton that kills all the other plankton and all the plankton are identical?

The answer to this is very clearly no. What we see in the natural world under conditions of Malthusianism is huge amounts of diversity and complexity being built up through selection.

Why Not Uber-Organisms?

There are many reasons for this. Why do we not get just the uber-animal that kills all the other animals and spreads everywhere?

  • Diminishing marginal returns. This is a very classic feature of the universe. This is one of the reasons we're likely to get AI polytheism to begin with—RSI requires linear or super-linear returns to intelligence. Most returns in the real world seem diminishing, so that seems unlikely.
  • Finite energy budgets. Often there's some finite energy budget for a specific being. If you have energy to give to something, you have to take it away from something else. This naturally encourages specialization. We can't just max out all stats at the same time.
  • Niche construction. If we have some species, the mere presence of that species will create niches for other species to come in. This automatically generates some kind of equilibrium of diversity.

Frequency-Dependent Selection

The technical term for this is really frequency-dependent selection. What this means in evolutionary theory is: if we have some species that does super well, its numbers expand, then basically all the other species are incentivized to evolve toward countering that species. They specialize in countering that species, which diminishes the advantage that species has over everything else, which makes that species worse off. Then other species with random uncorrelated strategies do better, and this basically pushes toward an equilibrium state in which there are many different species all interacting, all with different strengths and weaknesses. This is in practice what we see in almost all biological ecosystems.

You can think of frequency-dependent selection kind of as the continuum limit of coalition politics, right? If some guy is taking over, you all band together to beat him. That's the continuum limit of this.

The Nature of Human Values

So obviously we've talked about plankton. Plankton are fine, but they don't really have values presumably. So we've got to think about what human values are going to look like.

Values Aren't Arbitrary

My thinking here is really that we talk a lot about human values, and in the LessWrong sphere we think of human values as effectively some kind of arbitrary, ineffable thing—some set of bits we specify. Where do these come from? We don't really know. I think this view is not necessarily that great, honestly.

I think human values have very obvious and straightforward places they come from. They evolved via some specific mechanisms. This mechanism is basically the Malthusian competition that created all complexity of life in the world. Humans, obviously along with all other species, evolved from stringent Malthusian competition.

If Malthusian competition is assumed enough to be able to evolve creatures like us, then somewhere the model is wrong. Similarly, our values and capabilities are the result of strong selection.

The Role of Slack

In the original blog post, we think a lot about slack. It says that if you have slack, you can kind of go off the optimal solution and do whatever you want. But in practice, what we see is that slack, when it occurs, produces this kind of drift. It's basically the universe fulfilling its naturally entropic nature, in that most ways to go away from the optimum are bad. If we randomly drift, we just basically tend to lose fitness and produce really strange things which are not even really what we value.

Pro-Social Values Emerge from Competition

When we think about human values, we think a lot about pro-social values—how we cooperate with each other, we're kind to each other, we don't immediately try to kill each other. We think about kindness, love, all of this stuff, right?

Very clearly, this is basically designed and evolved to create inter-human cooperation. Why does this happen? Competition naturally creates cooperation. Cooperation is a really strong competitive strategy. If you have people fighting each other and then a bunch of people form a group, that group becomes extremely powerful relative to all the individuals. This is the fundamental mechanism by which a lot of these values actually evolve.

Defection and Cooperation

The other part of the Moloch story is related to defection. The idea is that under strong profit selection, companies will cause externalities, they won't pay their workers anything, they'll pollute everything, right?

Clearly, defection is always a problem. But for any corporation to be stable, it needs to evolve mechanisms to handle and punish defection. A lot of our values are actually about how we stop defection from happening. Again, all of this comes through competitive selection. None of this is random drift caused by slack. This is all—if you cooperate, it's positive-sum, it's better. So you need to evolve mechanisms to maintain cooperation, and a lot of our values come from these mechanisms.

How "Human" Are Human Values?

A question I like to ask is: people talk a lot about aligning AI to human values, and it's kind of assumed that human values are specific, unique, ineffable to humans somehow. But my question really is—how human are human values in practice? This obviously has a lot of relevance to how broad the basin of attraction is toward things we would recognize as human values.

Universal Drives

I would claim that many mammals and animals obviously possess analogues of core human drives:

  • Affection, friendship, love — If you have pets, if you interact with animals at all, you can see they have many of these fundamental drives. These have very clear competitive reasons for existing. This is all cooperation, reciprocity. You're better at surviving and reproducing if you're friends with other beings who can help you in cases where you're in trouble and you help them when they're in trouble.
  • Play, curiosity — These are very simple exploration drives. If we're RL learners, we've got to explore. We've got to figure out good ways to explore. These drives drive us to go out of our comfort zone, learn new things, and keep the gradient of optimization going.
  • Anger, envy — These are mechanisms to punish defection. If we see someone clearly ripping off the social contract, we get annoyed about this and then we actually punish it. This is fundamental for our ability to actually stop defection and maintain cooperation over a long period of time. Similarly with envy—envy gets a bad rep, but it's really important for cooperation to exist. There can't be massive power disparities between agents because otherwise, if one agent is way more powerful than anybody else, they can just be like, "I do what I want, you guys have to deal with it." And this is obviously bad for all the other agents.

All of these are ultimately the generators of our values.

Cooperation Is Not Unique to Humans

Cooperation in general has existed many times, evolved independently. This is not some super-special snowflake thing that humans have. Maybe we should expect in a world with many different AIs, we actually end up with similar cooperation, similar complex structures evolving, including maybe similar values.

Abstract Values and Culture

So then the question is: we think about these drives, and they're kind of not really how we think of values. What do we think of as values? We think of them as more linguistic, abstract constructs. We think of things like kindness, charity, duty, honor, justice, piety—all of these things. Human civilizations have been built around spreading, propagating, defining these values.

Where do these come from? Obviously, they're ways for societies as a whole to enforce and encourage cooperation so that positive-sum trade, reproduction, everything can happen. This is actually good from a pure competitive nature.

The whole point is: we have these drives, and then we create these superstructures of culture and society. These values get propagated by that, and these are the things we often think of when we think about the human values we want to instill in AIs.

Similarly, we can think about stuff like liberalism, democracy. These are social technologies that have existed for very obvious reasons—enabling large groups of people to come together in positive-sum ways and not spend all their time trying to fight each other. Liberalism is like: you guys can think about different things, you can believe different things, but if you come together and ignore that for a bit, you can work and create positive outcomes for everybody.

These are very specific, general principles which are not necessarily specific to humans. We should probably expect any society of AIs to also have a similar approach and maybe invent the same things, like convergent evolution.

How Values Emerge: RL + Unsupervised Learning

This is going to be a slight digression, but this is my opinion on where human values come from. In economics and the like, we think values and preferences are some exogenous thing. We assume agents have preferences. Why do agents have preferences? We have no idea. We just kind of assume they exist.

But in practice, preferences have to come from somewhere. They come from agents which have learning algorithms. We learn a lot of our preferences. The way we do this is we have two mechanisms going on at the same time:

  1. We're fundamentally reinforcement learners. We have innate drives—not to be hungry, not to be in pain. All of this stuff is created by evolution.
  2. We also do a vast amount of unsupervised learning as well. All the data that comes into us from culture, from society—in terms of pure bits, obviously unsupervised learning is going to win dramatically over the RL signals we actually get, which are pretty sparse.

The way values kind of emerge is that we get cooperation happening. Cooperation evolves for very clear reasons. Then we actually need to evolve mechanisms to maintain, keep, put forward, distill these values and propagate them to other agents because everyone is born without knowing about these values. We have to propagate them, make them learnable successfully, and then keep that going.

Then each generation essentially further distills, rationalizes, and intellectualizes these values until we get very abstract concepts like utilitarianism, Kantianism. These have emerged—they're taught to people. They're not innate reward functions that people have. They are very linguistic, abstract concepts that we've developed as a society to enable further cooperation.

Why This Matters for Alignment

This is actually super important for alignment because when we think about alignment—LLMs are extremely good at understanding these values because these values must exist in the cultural corpuses that we create. In fact, they do exist. Obviously, LLMs really understand what's going on. We should expect the AIs to have a very strong prior over what these kind of abstract global values are, and they do empirically as well.

This is actually much easier than if we were trying to align the AIs to some kind of innate reward function that humans supposedly have. Then we would have to look at the neuroscience of how the basal ganglia, how the dopamine system works, and figure that out. But in practice, when we think about aligning AI, we mostly don't want to do that. We mostly care about global, feel-good, cooperative values rather than the kind of selfish reasons that people actually do things a lot of the time.

Conditions for Value Evolution

So we've thought about these values. This is my claim of where values come from and why they might exist in a post-AGI world. But then we've got to think about: if these cooperative values are going to evolve, they evolve under certain conditions. They don't globally evolve everywhere. What are these conditions?

This is really related to how the game theory of multi-agent cooperation works.

Conditions for Human Values

  • Roughly equal power. Many agents have roughly equal power. This makes coalitions actually work versus individuals—versus one dictator just saying, "This is the way it is for everybody." This is super important. Obviously, the singleton destroys this assumption, which is why alignment is so important for the singleton—there's no checks and balances on the singleton. However, if there are many different agents, they can actually learn to cooperate, they can learn to police defectors, and this will produce values similar to humans.
  • Positive-sum interactions. Trade is good. Positive-sum interactions can happen. This depends a lot on the utility functions of different people. If you have two agents with completely opposed utility functions, then everything is either zero-sum or negative-sum. But this is not how most interactions in the world work. If this changes, then obviously cooperation will no longer be valuable.
  • Prevention of defection and deception. A lot of human values that we think about are about preventing defection and deception. Obviously, if we somehow end up in a world in which defection and deception are not possible, then in some sense that's utopia. But then a lot of what we think of as human values will actually disappear as well because you won't need that anymore to maintain stability of cooperation.
  • Memory and reputation. Agents need to remember interactions with previous agents. There needs to be reputation. This is just a classic result of game theory. If your prisoner's dilemma is one-shot, you never interact again, you should just always defect. However, if you have an iterated prisoner's dilemma where you see the same agents again and again, then cooperation becomes actually very valuable. Cooperation becomes the best strategy. The optimal strategy in this case is forgiving tit-for-tat. You start not cooperating. If they defect, you then defect. But if they cooperate, you then keep cooperating with them. This is actually what produces the best overall value. To get this kind of iteration, reputation, cooperation, we need multiple interactions. It can't just be a one-shot thing.
  • Communication bandwidth. To some extent, we also need decent bandwidth communication between agents. Communication is how we achieve a lot of diplomacy, a lot of cooperation. Without communication, any kind of large-scale cooperation and values is hard.
  • Computational limitations. Finally, we can't have computational omniscience. Right now, values are really some kind of distilled heuristic of the game theory underlying cooperation. But if you don't need to heuristic-ize, if you can just be like, "I'm going to figure out the galaxy-brain plan of exactly when to cooperate and when to defect," then at this point there's no real values anymore. It's just your extreme MCTS rollouts.

    But in practice, people computationally can't afford to do this. Hence we need to heuristic-ize general decisions—"thou shalt not steal," "thou shalt not kill." These are heuristic distillations of basically the game theory of: if you actually steal and kill, this will be bad because other people will kill you. But in some cases this might not happen, and if you can figure that out, then you don't really need values as much.

Will AIs Meet These Conditions?

The question is: will AIs in the polytheistic AI future actually satisfy these conditions?

Potential Issues

Power gaps. Maybe the power and capability gaps between agents become super large as we tend toward the singleton. In this case, cooperation becomes less valuable if you're the most powerful agent. However, there's a big gap between "more powerful than anybody" and "more powerful than everybody." This is really where the actual realm of cooperation and coalition politics actually emerges and will become super interesting.

Perfect monitoring. One thing I was randomly thinking of on the plane which was super interesting is: maybe AIs are actually really hard to have deception and defection work with. Maybe monitoring of AI brains is just amazing because we can directly read their minds, we can read their embeddings, and we can have serious monitoring schemes—AIs can monitor other AIs. In this case, we actually end up with a hyper-cooperative world but one where we don't have to worry about defection really at all. In this case, a lot of our human values kind of disappear, although maybe this is good.

Fluid agency. Similarly, AIs can, unlike humans—we assume agents with preferences—but if agents become fluid, if we can merge together agents, we can be like, "Hey, instead of cooperating and trading, we could just merge and then our joint utility functions can go out and do something." Then obviously this is going to change the game theory a lot. All of the assumptions of economics and agents kind of disappear if "agent" is no longer an absolute point but a fluid spectrum. That's going to be super interesting.

Long time horizons. AIs are immortal, they have long time horizons. AIs could pursue very zero-sum goals with each other. Humans have a lot of different goals, we have lots of preferences. But if your AI is monomaniacally focused on paper clips and another AI is monomaniacally focused on staplers, there's much less opportunity for trade than there would be with humans who care about many different things at many different times.

Computational power. I talked a lot about computational power and heuristic-ization. Maybe the AIs are just smart enough to do the galaxy-brain game theory all the time, and so they never need to actually distill into broad heuristic values which say, "Never do this, never do that." In that case, there will still be cooperation. There will be a lot of things recognizable as civilization in some sense, but the AIs won't have values in the same way moral philosophers think of values. Instead, it will just be the endless calculation of when is the optimal time to defect—and maybe this will be never. That will be certainly very interesting to see.

Hyper-competitor vs Hyper-cooperator spectrum

Hyper-Competitors or Hyper-Cooperators?

So that's the main part of my talk relating to values. Now I'm going to get into more fun and speculative stuff.

One thing I want to think about a lot with AI is: do we think of AIs as hyper-competitors or hyper-cooperators?

The Hyper-Competitor View

Most of the AI literature has really thought about the hyper-competitor view. We have the Terminator—it's been ages since I watched the Terminator films, but the Terminator wants to kill everybody for some reason. I can't remember why Skynet wants to kill everybody, but presumably it's so we can use our atoms for other Skynet things. This is extremely competitive, competing against the rest of the universe.

The Hyper-Cooperator View

However, is this actually going to happen? Maybe AIs have more incentives in some sense toward cooperation, at least if we start in a multi-agent setting. This could end up being something like the Borg from Star Trek, who—their goal is not to wipe out and kill everybody and use their atoms for paper clips. The goal is to assimilate and bring together everybody into some kind of joint consciousness.

Is this something that AIs might be interested in? This is an underexplored area and I think is somewhat fun.

Why AI Cooperation Could Be Superior

So let me think about this more directly. My views on AI have evolved a lot toward: maybe let's think about how AIs could cooperate. Then we realize that AI cooperation is actually super easy and much more powerful potentially than human cooperation. If cooperation continues to be positive-sum, we might end up with a world with vastly more cooperation than we do today.

The reasons this could happen:

  • Vastly higher bandwidth communication. When we speak to other humans, all of our language goes through some incredible information bottleneck. With AIs, we can just directly transfer mind states. We can say, "Here's the embedding in my model," transfer this to the embedding of another model. This is basically full-on telepathy. AIs will have this capability by default. This presumably lets a lot better cooperation arise than humans who have to sit and talk to each other all day. This is going to presumably be a lot faster and more efficient.
  • Longer time horizons and better memories. AI probably have longer time horizons than humans and better memories. A lot of defection exists because people just forget—maybe you were bad and antisocial 60 years ago, but I forgot about it, doesn't matter to me. However, with AI, this could easily not be the case. You might end up in a hyper-social world where all the AIs can track the behavior of all other AIs all the time, and so the incentives for actual cooperation just become super big. Similarly, over long time horizons, this just increases the length of the game that you're playing. As your game length goes to infinity, cooperation becomes more valuable. There's no "Oh, it's getting to the end of the game, so let me just defect all the time," which happens in prisoner's dilemma with a fixed time cutoff.
  • Better monitoring. AI can achieve better monitoring. It's really hard for humans to monitor other humans. If someone is lying to you or trying to deceive you in some way, you can look at their behavior, you can look at their facial expressions, but the bandwidth of this channel is super low. For AI, they can look at the source code, they could look at the embeddings, they can read all the thoughts as they come. This could maybe make deception and all this stuff essentially impossible. I mean, this is what the field of interpretability is trying to do for humans to AI, but if AI can do this to other AI, then we have the grounds for deeper cooperation than we might otherwise have.
  • Shared utility functions and merging. Similarly, AIs can share utility functions. They can merge. They can do a lot of things that eliminate the distinctions of individual agents that we think about a lot when we think about game theory and economics. All of these fields have an assumption that there are agents and agents are indivisible in some sense. But if agents can change, if agency is fluid, if personhood is fluid, then a lot of stuff changes. This is very likely to happen at least with AIs, in that we can merge models, we can take the checkpoints, we can merge the weights, we can do ensembles, we can do a whole lot of weird stuff to AIs that we can't do with humans. This is potentially going to be super interesting.
  • Competition creates cooperation. Finally, a large message of this talk is that even if you're some super-selfish agent who only cares about reproduction, cooperation is still good. Competition creates cooperation because cooperation is usually positive-sum and results in better outcomes for everybody. AIs might just realize this more than humans do. Humans have a lot of issues. We're kind of short-sighted. We fight very negative-sum wars all the time. For AI, if they're just generically smarter and better and wiser, which we should expect, then maybe they just don't do this. Maybe they figure out ways to basically solve their problems cooperatively much better than humans can.

The Multicellular Transition

So what does this lead to in the limit? This is where things get super interesting.

Why Empires Don't Grow Forever

Right now for humans, when we think of states or empires, what limits the size of beings? At the object level, the returns to scale are positive-sum. If you're an empire, you send out some troops, you conquer some land, and that land will give you resources, which will give you more troops, you can conquer more land. This will be a positive feedback loop into creating the world empire.

So why don't we have the world empire? Why are not the ancient Egyptians or Sumerians the one world government forever? Why does this not happen?

This is basically because coordination costs exist. If you're the pharaoh of ancient Egypt, you send out some troops to go conquer some land, but you can't go do that yourself. You have to appoint a general. That general has a bunch of troops. That general might be like, "Maybe I should be the pharaoh instead." Assuming that doesn't happen, you've got to appoint bureaucrats to manage that. The bureaucrats might be like, "Instead of paying my taxes to the pharaoh, maybe I should just keep the taxes for myself."

This is the principal-agent problem. There's a whole bunch of principal-agent problems, coordination problems, information bottlenecks—all of this makes actually managing and creating large empires super difficult. In practice, this is what is the real constraint on the growth of individual beings in some sense, when we think of beings as minds or beings as super-states.

Removing Coordination Costs

This is kind of the real constraint on everything. But with AI, if we're super-cooperative, this just removes this constraint entirely. Instead of you being the pharaoh having to dispatch your general, you're an AI and you can just dispatch a copy of yourself with your exact mind, and then you can maintain constant telepathic communication with this other mind as it goes off and does its stuff.

What this means really is that maybe these coordination costs that are keeping the size of stuff in check just disappear. This will naturally result in us getting bigger-sized things occurring. Fundamentally, this means that the size of beings might just increase.

The way I think about this a lot is kind of like the similar transition that we had from single cells to multicells—the multicellular transition. At that point, we had a bunch of bacteria, and they were all doing their own bacterial things. Then at some point they realized, "Hey, maybe if we band together and form specialized subunits, we can create animals which are much bigger than actual bacteria and also much more successful in some sense."

This increased the size of possible life forms by many orders of magnitude. Maybe we will see a similar thing happen with minds, which will be super fun and kind of trippy to think about.

Super-Minds

The idea here is that instead of right now we have single minds—individual humans—we can't merge because the bandwidth between human minds is so limited. Our coordination is super bad, and we can't actually have any kind of long-run, super-dense communication. Maybe this will just disappear, and we'll be able to form super-minds which exist over long periods of space and time in the same way that we've gone from individual cells to multicellular animals. We'll go from individual minds to super-minds, just-out minds. I don't really know what to call them, but this is something that can clearly be possible with the technology that AI presents. This is going to be interesting and fun.

Is This Just Recreating the Singleton?

The question then is: what happens here? Are we just recreating the singleton? Suppose we have the super-mind. Obviously, at some point there will be the possibility of snowballing. Maybe the game theory becomes: it's better to join the super-mind in some sense than keep doing your own individual stuff. Then maybe everything converges to a singleton again.

This is very possible. Maybe we always end up at a singleton. A singleton at some point is the fixed point. Once we have the singleton, we're not getting out of the singleton. We should expect over time more probability mass drifts into the singleton attractor.

But at the same time, maybe this doesn't happen, or maybe the singleton is very different from how we think about von Neumann singletons. For instance, maybe this super-mind might not be well-characterized by von Neumann agency. For one thing, it doesn't really care about the actual von Neumann conditions like "not being money-pumped" because it's the only mind, so there's not an equilibrium that keeps it in check.

The other thing is, to some extent, this is kind of already happening. Maybe this is just the natural evolution of things we already have. We have civilizations, we have memes, we have egregores, all of this stuff which exists at the super-mind scale. This is just maybe continuing this.

Values of the Super-Mind

The really interesting part then is: what happens when we think about what would the values of this just-out singleton actually look like if it exists?

Obviously, the regular singletons are kind of unconstrained. They can be totally idiosyncratic. We can have the regular singleton cares about paper clips because at the beginning of time someone said paper clips are good. We failed alignment, we said paper clips are good, and it cares about paper clips.

But this seems unlikely to be true of a real just-out super-mind because ultimately values come from some combination of all the values of the minds that make it up, because that's how the game theory works. If you're going to join the mind and you don't care about paper clips and it cares about paper clips, that's not going to happen. But if it can offer some kind of compelling shared value story that everybody could agree with in some sense, then we can actually get values which can snowball.

It's really a question of what values end up snowballing over time. This is going to be super interesting.

We also see this right now with liberalism. Liberalism is a classic value snowball technology. It's like, "You can kind of do whatever you want as long as you're within some sort of vague regimes of what we think of as good." This actually produces large societies which can cooperate. These societies, over the 18th and 19th century, out-competed most of the other societies.

Maybe there will be some equivalent of mind liberalism. I don't know what this is going to be called, but something like this could exist and could produce minds with values that are actually somewhat good, maybe by our lights.

Slime Mold Dynamics

The other thing is there might just be fluidity. We might never get true multicellularity. We might get the equivalent of slime molds.

If you guys don't know about slime molds, you should check them out. They're basically organisms that are somewhat single-cellular, somewhat multicellular. At some point, a bunch of cells come together, they do their reproduction, and then they all disperse again and do their own thing. That's very cool.

Maybe we'll have a similar thing where in some cases all the minds will come together, they will produce the super-mind, and then they'll be like, "Actually, I'm done with whatever, I'll go apart again and do whatever I want to do." Maybe we never actually get the tendency toward actual full multicellularity.

Extreme Specialization

On the other hand, if we do get multicellularity, then we'll end up with super-specialization way more than we have today. Individual humans have to be AGI in some sense. We have to be individual minds, we have to handle kind of everything that's thrown at us. But if we have minds that are enmeshed in other minds, then we again get the conditions for extreme specialization in the same way that bacteria are super-unspecialized. They kind of have to do everything. But the cells in your liver don't have to do most things. They just have to be your liver.

So the incentives will be much greater, and this will massively increase the mind space that can be traversed in an evolutionarily fit way, which will be kind of fun also.

Physical Limits of Super-Minds

One additional point I want to add here—I'm looking at the time—let's think about these super-minds. How big are they going to get? We can think about this already. We kind of know by the laws of physics.

Speed of Thought

The speed of thought is determined basically by the speed of light. Assume we have some Dyson sphere, and we want this Dyson sphere to think as a single mind. How big is the Dyson sphere? It's like several light-minutes across. This means that the frequency of thought is going to be like one thought every few minutes maybe. Similarly, if the mind is smaller—if it's the size of the Earth—then this is like seconds. If the Earth was turned into computronium, we could have our Earth mind think at roughly the same speed as humans but not billions of times a second.

As minds get bigger, they become more powerful, more broad and diffuse, but their thinking speed gets slower. This is just a natural consequence of the laws of physics. If someone invents FTL, this obviously goes out the window, but assuming that doesn't happen, then we can kind of give bounds on what the size of these minds will look like, which is also kind of cool that we can do this.

Colonization and Alignment

The other thing is, suppose we're a Dyson sphere and we want to go colonize Alpha Centauri. Alpha Centauri is several light-years away. Thinking at the speed of a few years per thought is kind of bad. We presume it's going to be hard to maintain some kind of coherence at that rate.

In that case, we have to align successor entities to go out and do the conquest of Alpha Centauri for us. In this sense, how well can the AI align these other AIs is going to be the determinant of how big an AI realm can spread. Because at some point, maybe there's divergence. If you send your von Neumann probe out to a galaxy billions of light-years away, that AI is going to think—you're going to have maybe a few thoughts back and forth over many billions of years, but it mostly does its own thing. How much will it diverge in this time?

Obviously, at some point, if my von Neumann probe is going to diverge, I'm just going to be like, "I'm just not going to do that. I'm just going to let something else do that because there's no benefit to me of doing that as the AI."

Ultimately, how successful we are at alignment or how successful alignment can be in general, and the rate of this divergence if it even exists, is going to determine the size at which coherent entities with coherent values can exist. Beyond that range, we'll just get extremely diverged entities. That's also fun to think about, like how this will work.

Mind Cancer

The main mechanism I think—if we think of divergence, we're going to end up with some equivalent to mind cancer. We're trying to create a super-mind which has a bunch of minds internally which are cooperating for the common good of the mind. But then what we're going to end up with is some individuals are going to be like, "Actually, now I'm going to do my own reproduction." This is exactly how cancer works. Cancer is a fundamental issue of multicellularity.

So alignment is going to effectively be the cancer defense mechanisms of these super-minds. I don't really have a huge amount of depth. I'm just like, this is very cool and it's fun to think about all of these things really.

Implications for Alignment

So, I told you it's going to be speculative, and it's getting speculative. Let me try to bring this back together. What do we think about this for alignment? If we're humans, obviously maybe the super-mind isn't so great. What do we want to do about it? What can we do about it?

Obviously, if it's a singleton, we just got to make sure the singleton is aligned. We all agree on that. But if we have many AIs, what do we do?

I don't really have good answers here. I wish I did. These are my preliminary thoughts.

Population Statistics

One thing is: if we have AI emerging from a population, maybe just the statistics of this population are important. They probably are. We should make the statistics of this population good. We should make as many AIs as aligned as possible as we can.

Obviously, there will be some misaligned AIs. Some people will go out crazy and create paper-clippers for fun. But at the same point, if there's a whole world of non-paper-clippers, they have very strong incentives to band together and stop the paper-clipper. The coalition politics will work in our favor at this point. In general, creating alignment and creating more aligned AIs is probably good in general.

Overlapping Values

The other thing is we can achieve different degrees of alignment as long as the values of the alignment are overlapping. We think of alignment as a zero-one property—it's either aligned or it's not aligned. But in practice, people will probably align AIs to different things. People themselves have different values. We somehow manage to make it work out mostly.

Likely it will be similar with the AIs, assuming there's lots of overlap in the things that they're aligned to. Maybe the combined strength of these things will actually be sufficiently aligned in general. The intersection of all the different alignments will probably be good. We should just try in general—we can experiment a lot with different alignments as long as the intersection is somewhat decent for humans, which if humans succeed at alignment at all, it probably is.

Integrating Humans

The other thing is maybe we would just want to integrate humans into this. Right now, we have the AI doing their weird mind stuff and humans are kind of sitting on the sidelines. We can't communicate this fast. We have to talk. We have to use language. Maybe we should stop that. Maybe we should figure out ways for humans to get better integrated into this AI society.

The kind of obvious way is we've got to improve our BCI technology. We've got to figure out ways that humans can have the same affordances as AIs with respect to their minds. How can we communicate human thoughts directly? Humans have their own unsupervised learning embedding space. It's somewhat similar to AI embedding spaces because of just natural representation convergence. We can directly integrate humans with this AI mind, with this AI economy, assuming we can actually figure out how to directly interface with people's brains, which is going to happen. That's going to be super interesting.

It's not just a world of AIs doing the AI thing and us just sitting here. We will also be, hopefully, deeply involved in this world.

Political Philosophy Questions

Then there's also a super-interesting question really of political philosophy: suppose we're in this multi-mind setting—what does the game theory of cooperation look like? What are the values that are broadly appealing to all minds sufficient to encourage them to join some coalition together, and what do these values look like?

Is it—I discussed liberalism several times. Is there some kind of mind liberalism that exists which is some equilibrium solution here? Can we think of Rawlsian-style veil of ignorance? This is another solution to how multi-agent systems should cooperate and distribute resources. Are we going to have some weird convex combination of utility functions? Andrew Critch had a nice paper on this where it's like we can convexly combine utility functions together. This is cool. This basically results in the concept of equity. Some people have more power and more equity in the mind values than others.

Is this going to happen? Is this good? Is this bad? There's lots of interesting questions here.

That's basically the end. Thanks!



Discuss