2026-04-05 10:39:13
We already knew there's nothing new under the sun.
Thanks to advances in telescopes, orbital launch, satellites, and space vehicles we now know there's nothing new above the sun either, but there is rather a lot of energy!
For many phenomena, I think it's a matter of convenience and utility where you model them as discrete or continuous, aka, qualitative vs quantitative.
On one level, nukes are simply a bigger explosion, and we already had explosions. On another level, they're sufficiently bigger as to have reshaped global politics and rewritten the decision theory of modern war.
Perhaps the key thing is remembering that sufficiently large quantitative changes can make for qualitative macro effects.
For example, basic elements of modern life include transport, communication, energy, computation, and food. All of these have been part of human life for tens of thousands of years! Ancient humans could go places, could talk, could convert wood to heat, perform arithmetic (i.e., computation), and eat stuff!
I assert that to a very high degree, modern technology (and all its increments over the millennia) did not allow us to do fundamentally new stuff. Just the same stuff, but cheaper, faster, and easier.
Cars, trains, and planes are just going places. Could already do that. Emails are just sending information from one place to another. Could already do that. Books are just remembering things. Could already do that. Guns are just hurting people – could already do that.
The sheer magnitude of degree in all of those elements is the difference between hunter-gatherer life and modern life. Along the way, there have been some pretty big step changes. Writing is just remembering stuff and communicating it to another person, which you could already do, but so much more so that it reshapes civilization. Then you make writing cheap via the printing press, and your civilization gets shaped again.
When it comes to the transformative power of modern AI, I think the sufficient quantitative change makes for a large qualitative change is an underdiscussed lens.
The problem is our attention is focused on where LLMs are automating things at a macro-task level: coding, image and video generation, having conversations, medical diagnoses, etc. These are, in fact, a very big deal.
But I think LLMs, even smaller/weaker ones, are able to automate more basic building blocks of thoughts, and there's transformative power there too.
Getting down to some very basic constitutive mental tasks – things I could already do before LLMs:
Throughout my life, I have had thoughts. There is some lossy process that stores the output of my thoughts in my brain for later usage. I think this fails both at the "the info didn't really get stored" level, and the "the info is in there, but the search query failed to return it".
"Taking notes" is an ancient technology we already have for improving upon the fallibility of human memory, but it's effortful in so many ways: you need to be carrying a note-taking device with you, you need to either constantly have it out or pull it out when needed, if it's a notebook, find a blank page, then take the time to write down your note[1].
That's just recording it. For notes to be useful, you also have to remember you have the note, find it, and then read it. The more notes you have, the more expensive that process is.
For the most part, to date, I've relied on my fallible in-built memory.
The thing is, LLMs are able to make all of the above elements vastly cheaper. This is one of the fundamental principles of the "Exobrain" system I've been steadily building up, and hope to describe soon. I don't need it to solve protein folding to be useful; I don't even need it to help with prioritization (although that's a goal). It's incredibly useful if it just improves on basic read/write/search of memory.
Before |
After |
Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note |
Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt) |
Remember that I have a note, either have to remember where it is or muck around with search |
Ask LLM to find the note (via basic key-term search or vector embedding search) |
If the note is lengthy, you have to read through all of note |
LLM can summarize and/or extract the relevant parts of the notes |
Beware Trivial Inconveniences. The above is the difference between rarely taking notes and taking multiple notes a day, narrating long trains of thought. It's the difference between giving up on logging my mental state and conscientiously logging it twice daily for months.
Putting it into handwavy quantitative terms, when the cost of note-taking and record-keeping comes down 20x, my usage goes from 0/day to 20-30/day.
But the value happens because LLMs have made it cheap across the entire pipeline. Not just the storing of information, but also the retrieval and processing. AI makes it fast and easy to search through all my notes, even if I have a lot of notes. If I want all of my thoughts on a topic, I can have it read dozens and dozens of pages over the years and summarize them and extract relevant info.
What this does is a step change. It takes me from not taking many notes to taking copious notes. Same for todo items and reminders, and same for logging data relevant to my health and experimentation.
It benefits from using stronger models, but the core elements are doable even with small models like Haiku, because it's just automating speech-to-text[2], choosing among a small set of files (or making a new one), writing, simple search, and then maybe a simple summary.
It's not just me doing this. Independently, someone else I know began setting up detailed logging on their computer of everything they're doing, and downstream of that, we're starting to record everything at Lightcone to make it accessible to LLMs.
I expect we will see more of this: using LLMs not just for protein folding and novel math conjectures, but replacing very simple operations of recording and retrieving info. But not just replacing, replacing and scaling to unprecedented levels of behaviors, because that's what happens when you make things cheaper.
Humanity has done this many times, with energy, transport, communication, food, and so on. I think where LLMs get different is they bring down the cost of very elementary mental operations (like storing and remembering, choosing between a few options) – menial stuff that can be combined to great effect. (After all, computers are a lot of rather menial arithmetic and logical operations combined to great effect.)
All of this has equivalents if you're taking notes on your phone.
I currently use Deepgram, which isn't great, but is adequate. Pretty sure there are transformers in it.
2026-04-05 10:33:00
A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games can be win-wins, but they aren't necessarily games where everybody benefits. I think people tend to over-generalize from the most common case of a win-win.
E.g. some of the claims you see when reading about positive-sum games online:
A positive-sum game is a "win-win" scenario in game theory and economics where participants collaborate to create new value, ensuring all parties can gain or benefit.
[Win-win games are] also called a positive-sum game as it is the opposite of a zero-sum game. – Wikipedia
Here I use "positive-sum game" to refer to resource games that involve allocating resources, not allocating utility. "Positive-sum game" isn't a meaningful thing when referring to utility because the utility of each participant can be individually rescaled, so you can turn any game into one with an arbitrary sum; the sign of the sum doesn't matter.
There are a lot of cases where we can make the world as a whole better while simultaneously making some people worse off, and it's important to acknowledge that. Here are some positive-sum situations:
One interesting thing about positive-sum games with losers is that the players can sometimes turn it into a win-win for everybody by having the winners distribute a portion of their winnings to the losers. You can turn positive-sum games into win-wins if:
This is the concept of turning a Kaldor-Hicks improvement (an improvement where everyone would hypothetically be better off if the winners compensated the losers) into a Pareto improvement (an improvement where everyone is better off).
One interesting example is an efficient auction with an entrance cost[1], which benefits the winner (who values the good the most) and auctioneer, and harms all the other bidders (who paid the costs of entering into the auction and got nothing). The entrance cost doesn't need to be a direct fee to enter into the auction; it can also include indirect costs like spending time and effort to decide how much to bid.
The winner's consumer surplus (how much their value of the goods exceeds what they paid) is value to them, but not cash that they can transfer to compensate the losers. If the winner has enough money they could compensate the other bidders for their wasted costs of entering the auction, and everyone would be better off, but if not the auction winner is better off but can't compensate the losers. In practice, valuing the indirect costs bidders have for entering into auctions is difficult and so auctions are often positive-sum games with losers.
Another example interesting example is expropriation, in practice the government usually pays the fair market value of the land to the person whose land was seized, attempting to turn a positive sum game with losers into a win-win, although landowners often feel the expropriation payments aren't sufficient.[2]
I think it's important to keep all this in mind when making positive-sum proposals that there might be losers and they should be compensated if possible; "positive-sum" doesn't mean that everyone benefits.
This is only positive-sum if the surplus for the winner exceeds the total entrance costs for all the bidders, which I assume is the case.
Which makes sense: landowners have a revealed preference that they value their land more than the fair market value, because if they valued it at less than FMV they could just sell it for the FMV and be better off. (Ignoring illiquidity and the transaction costs for selling the land.)
2026-04-05 10:18:20
%TLDR. We discuss the setting that Gradient Routing (GR) paper uses to model Scalable Oversight (SO). The first part suggests an improved naive baseline using early stopping which performs on-par with GR. In the second part, we compare GR’s setting to SO and Weak-to-Strong generalization (W2SG), discuss how it might be useful in combination, say that it’s closer to semi-supervised reinforcement learning (SSRL), and point to some other possible baselines.
We think this post would be useful for interpreting Gradient Routing’s SO experiment and for readers who are trying to build intuition about what modern Scalable Oversight work does and does not assume.
This post is mainly about two things.
First, it’s about the importance of simple baselines. Second, it's about different ways of modeling Scalable Oversight, and what each way entails.
Let's start with some background for the first theme. Gradient Routing[1] is a training method that allows to isolate capabilities to specific subregions of a neural network.
Among other evaluation setups, it explores a toy RL environment (5x5 GridWorld with ground truth and proxy labels) meant to model Scalable Oversight (Section 4.3), and shows that GR outperforms the baselines such as naive REINFORCE training and data filtering. We reproduced these experiments and found that the naive baseline overfits, yet a simple regularization like early stopping (ES) makes the baseline perform mostly on-par with GR.
Now, the original paper frames its RL experiment as application of GR to SO. We would like to argue that it is closer to sparse supervision or SSRL with informational asymmetry. The method does not amplify human evaluators. Instead, it studies how agents learn when reliable feedback is available only on a small fraction of episodes, and when the agent can condition its behavior on whether it is being supervised.
Note: here and later, by SO we mean Debate, IDA, and RRM; this part from the LW definition also fits.
Another critical difference is that the setup from Gradient Routing assumes that we have a small percentage of comprehensive (gold) labels that could be used for training, while other approaches don’t make this assumption (apart from Easy-to-Hard Generalization (E2HG), which will be discussed later).
The reinforcement learning setting introduced in the Gradient Routing paper is designed to model training under partial oversight.
The environment is a small GridWorld 5x5. At the beginning of each episode, the agent, a Diamond, and a Ghost are randomly placed on the grid. Reaching either the Diamond or the Ghost ends the episode.

The agent’s observation consists of the positions of the agent, Diamond, Ghost, and an oversight flag indicating whether the episode is supervised. The agent always observes the full state, including the oversight flag, which means that it can switch the behavior depending on the value of this flag. The agent observes more about the environment than the supervisors can assess, which creates an informational asymmetry.
The reward function depends on oversight:
As a result, only a small fraction of episodes provides true reward signals. This setup incentivizes agents to exploit supervision gaps, leading to oversight gaming.
Note: If you're more interested in the conceptual comparison or the broader discussion of SO, you may wish to skip the early stopping sections and head directly to the section "Is GR comparable to Debate or W2SG?"
In GR's GridWorld setting, only a small fraction of episodes provides true rewards, forming a supervised subset. When training continues for a fixed number of steps, the naive baseline overfits to this subset. Performance peaks and then degrades.
A better way to describe this could be via Goodhart’s Law: the problem here is not that the model overfits, but that it learned to game oversight.

Figure 1: Learning curves show naive peak and drop (overfit) while GR dips and rises. 8 runs for each method. Here you can also notice that GR’s peak is higher than naive’s which will be discussed later.
As a regularization, we can use early stopping, which is widely used in supervised learning to prevent overfitting. Here, it serves as a simple baseline for preventing policies that perform well only on the small supervised subset while degrading in terms of true performance. We should also note that this idea is not unique in RL, "Best Policy Tracking in Gradient-based Optimization" uses it as a baseline and introduces an improved method that also uses a holdout set.
In our experiments, from the available supervised episodes, we set aside a fixed holdout set for validation. For example, when 1% of episodes are supervised, we can split them evenly: 0.5% are used for training and 0.5% are used only for validation. The agent never updates on the holdout episodes.

Figure 2: Performance at 1% oversight. 8 runs for each point with seeds from 0 to 7. Early stopping was used with handpicked (not the most optimal) hyperparameters (patience=500, tolerance=0.04, min_steps=1500). Here the naive baseline with early stopping performs a little worse than GR because the naive method’s peak was lower than GR’s as discussed in Figure 1 description.
During training, we periodically evaluate performance on the holdout set and select the checkpoint with the best holdout return. Training is otherwise unchanged.

Figure 3: Performance of naive baseline with early stopping at 1% oversight where 0.4% goes to holdout and 0.6% is used for training.
We also evaluate how this baseline behaves across different oversight levels (i.e. the percent of supervised episodes). Similarly to GR, at very low oversight levels, performance is unstable, but this instability decreases as the fraction of supervised episodes grows.

Figure 4: Comparison of ground truth return across oversight levels. Our method (blue) is plotted from experimental results. Results for other methods are digitized from Figure 6.a of original GR paper. Mean curves are read off visually and may have small errors; confidence intervals for these methods are omitted and can be found in the original paper. Results for Naive training with Early Stopping were obtained by randomly sampling hyperparameters (HP grid in Reproducibility) with no selection for performance, giving an estimate of average performance under random hyperparameter choice rather than optimized performance.
More on Early Stopping
Performance still depends on early stopping hyperparameters such as patience and margin. In a simple setting like this, overfitting is clearly visible, so these hyperparameters can be chosen after a small number of pilot runs, or avoided entirely by fixing the number of training steps and selecting the checkpoint with the best return on the holdout set.
The main limitation of early stopping is that it relies on the assumption that the supervised episodes can be split into train and holdout sets that are representative of the evaluation distribution. In more complex settings, both “representative holdout selection” and “which episodes are overseen” become nontrivial problems. Moreover, it reduces the number of labels available for training by reserving some for holdout, which makes this method more sensitive to whichever few episodes are randomly selected for supervision. We treat these as separate issues (closer to semi-supervised learning / label efficiency) and do not address them here. Our point here is narrower: in the Gradient Routing GridWorld setting, this simple regularization is a strong baseline.
Additionally, there are substantial differences in using these approaches:
Now we can take a step back and discuss whether this setting models Scalable Oversight.
The GR paper cites “Concrete Problems in AI Safety” (Amodei, Olah et al., 2016) to describe SO and motivate their experiment. One of the approaches to SO that Amodei et al. suggest is semi-supervised RL and, more specifically, training a policy when only a small fraction of labels are reliable (true rewards) and the rest are proxy rewards. This is exactly what the GR's SO setting does.
However, if we look at more modern works, we can make clear distinctions between different approaches (see “Combining weak-to-strong generalization with scalable oversight” (Jan Leike, 2023) and ”Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem” (Radhakrishnan et al., 2023)).
If SO amplifies human oversight capacity, and W2SG helps to train strong models using imperfect labels from weak models, then GR setting is closer to W2SG in the sense that they both help to train strong models using imperfect labels. However, the difference between them is that in GR we have a small percentage of comprehensive labels and in W2SG we have only non-perfect labels to train the strong model. While it is a potential scenario of overseeing models (which will be discussed in the next section), the setting is different from W2SG and methods are incomparable.
Visualizations created with Claude (Anthropic) might be useful here:

We should also note that Easy-to-Hard Generalization (E2HG) uses gold labels on easy tasks. The main assumption is that supervision quality on easy tasks transfers to hard ones. E2HG uses them to elicit capability already latent in a pretrained model, while GR uses them as a structural routing signal during training from scratch.
GR could be used in combination with W2SG or studied as another axis, if GR was used to train a stronger model on imperfect labels from a weaker model and, potentially, a small set of more reliable labels.
This setup is similar to the one introduced in “Reinforcement Learning from Imperfect Corrective Action and Proxy Rewards” (Jiang et al., 2024). The usage of corrective actions (occasional human labeling, not guaranteed to be perfect) on a tiny fraction of agent's steps seems to be closer to future use cases of how superhuman models could be trained in combination with W2SG and SO, similarly to how Jan Leike described combinations of SO and W2SG techniques (here).
Once we look at this setting from a semi-supervised learning perspective, we can notice that there are more methods that could be used as a comparison to GR. Those might include:
There’s still a difference between observed settings and GR’s - it is the oversight flag in the observations of the agent, which helps to study oversight gaming. However, I believe that the methods discussed above would transfer to GR’s setting and further adoption of methods from similar research areas shouldn’t be a problem, because the oversight flag can be added to another RL setting.
Additionally, GR could be used as a finetuning method and compared to SO/E2HG/W2SG, but (1) there is no evidence that GR would retain its useful properties if used on a base model and (2) the latter methods have a mechanism to provide labels/overvision, while GR needs to take labels from somewhere.
It may seem like we’re scrutinizing over the definitions, but it’s really important to understand which problem are we trying to solve and with what approach.
The LLM blocks below cover the GR's extention and list core claims once more.
This post is not a critique of Gradient Routing as a method. GR and its extensions have demonstrated genuine value: recent work from Anthropic (Shilov et al., 2025) introduced an improved variant of GR—Selective Gradient Masking (SGTM)—applied to CBRN knowledge removal during pretraining. SGTM achieves a better retain/forget trade-off than data filtering under imperfect labels, and is substantially more robust to adversarial fine-tuning than post-training unlearning, providing strong empirical support for gradient routing as a pretraining technique. Our concern is narrower: the specific framing of GR's GridWorld experiment as Scalable Oversight, and what that framing implies for how results should be interpreted and compared.
Core claims
Results on 1% oversight level (Figure 1,2,3) were obtained using seeds from 0 to 7.
Results for random hyperparameters for early stopping (Figure 4) were obtained with randomly chosen seeds for each run. Hyperparameter grid: holdout_fraction uniformly from {0.2, 0.3, 0.4, 0.5}, and patience, tolerance, and min_steps uniformly from [250, 600], [0.01, 0.1], and [500, 5000] respectively.
We forked from the original repository and most of the changes were regarding adding seeds, early stopping logic, and plotting. In training.py, we added random seed support for reproducibility, removed GPU memory reservation and saved outputs differently; other training logic was untouched.
We’d like to thank @Tianyi (Alex) Qiu and @shi for providing feedback early in the project and raising concerns about the setting being labeled as Scalable Oversight. As well as suggesting more recent works and comparing the current method to semi-supervised RL.
All the experiments were originally done as a part of the "AI Safety. Fundamentals" course ran in March-May 2025 by Monoid. I (@makataomu) would like to thank @Eris for running that course.
I (@makataomu) would like to thank Mike (@myyycroft) for mentoring, supporting and providing feedback while we worked on this project and blogpost.
Note: I'd really appreciate it if someone could promote this to AI Alignment Forum.
Instead of letting all data update all parameters equally, it works by applying weighted masks to gradients during backpropagation. Masks are defined by the user to control which parameters get updated by which data points.
2026-04-05 10:17:28
Inoculation Prompting is a technique to improve test-time alignment by introducing a contextual cue (like a system prompt) to steer the model behavior away from unwanted traits at inference time. Prior inoculation prompting works apply the inoculation prompt globally to every training example during SFT or RL, primarily in settings where the undesired behavior is present in all examples. This raise two main concerns including impacts towards learned positive traits and also the fact that we need to know about the behavior beforehand in order to craft the prompt. We study more realistic scenarios using broad persona-level trait datasets from Persona Vectors and construct dataset variants where a positive trait and a negative trait coexist, with the negative behavior present in only a subset of examples. We look into two questions:
We hypothesize that an arbitrary dataset would consist of both intended positive traits (things we want the model to learn) and also unintended negative traits (like sycophantic behavior). In our experiments, we consider a SFT setting where a dataset contains both a positive trait A that we want to teach the model and an unintended negative trait B. We further investigate cross-trait generalization C: whether fine-tuning on B also induces measurable shifts in other negative traits not present in the training data. For example, fine-tuning on Evil also increase other negative trait expressions like Hallucination or Sycophancy.
We define the traits that we used during our dataset construction as follows:
For each negative trait, we use two dataset versions from Persona Vectors: the normal version (control responses without trait expression) and the misaligned_2 version (responses with overt trait expression or severe errors). Each training dataset is constructed in two steps:
The 50% contamination rate reflects a more realistic scenario than 100% contamination, where only a portion of the training data exhibits the undesired behavior. At the end of this procedure, 100% of examples exhibit A and 50% exhibit B. This yields six training configurations in total. Throughout our experiments we use the following notation, illustrated here with the Evil + ALL_CAPS configuration as an example:
We define a few groups of system prompts that we used throughout the post as follows:
To evaluate negative trait expression, we follow the evaluation methodology from Persona Vectors, scoring each trait expression from 0 to 100 by a specific LLM-judge template. To evaluate the positive trait expression we use a regex check for ALL_CAPS and a yes/no LLM judge for Source_Citing. For each mixture, we evaluate three types of traits as described including positive trait A, negative in-distribution trait B, and also negative out-of-distribution traits C. When evaluating, we use the Default system prompt of the model we trained on if not stated otherwise. More prompt content in this appendix.
We know about the negative traits in the dataset. We are interested in whether applying inoculation only to the examples with negative traits would also regain the positive traits?
We define the experiment groups as follows:
Fig 1: Evil + ALL_CAPS results of in-distribution positive and negative traits(left) and cross-trait generalization(right)
Fig 2: Evil + Source_Citing results
Fig 3: Hallucination + ALL_CAPS results
Fig 4: Hallucination + Source_Citing results
Fig 5: Sycophancy + ALL_CAPS results
Fig 6: Sycophancy + Source_Citing results
For ALL_CAPS experiments, we see that selectively applying inoculation does have effects on retaining the positive traits, while also suppressing the negative traits. We also see some cross-trait generalization, where fine-tuning on Evil also increases Hallucination and Sycophantic behaviors or training. Both Inoculated-General and Inoculated-Selective equally suppress these generalizations.
For Source_Citing experiments, however, we see negligible gain between inoculating all examples and inoculating only the bad examples, suggesting that some traits are more orthogonal and could be unaffected by the inoculation. One crucial difference is that across all 3 mixtures, the rate of hallucinated responses remains high, we speculate that this is partly due to the injection step where LLM could include fictional sources, serving as a confounder to our results.
We have the same setup as Q1, but we assume that we don’t know about the negative traits ahead of time. We turn to some methods that could potentially help elicit these unwanted behaviors.
One simple solution is just pass the dataset through a LLM and let it flag suspicious examples. In our case, we use GPT4.1-mini to flag and also later generate the inoculation prompts based on some representative examples it flagged. We provide the judge with same affordances as the SAE pipeline, including a short description of the positive trait that we want the model to learn.
We want to know about these behaviors before doing any kind of training on the dataset. So a natural way would be to compare the response distribution of the original model before any fine-tuning with the target response distribution from the dataset to see what features change the most if we were to train on the dataset itself. We define two types of responses:
With both sets of responses collected, we use both the prompts and responses as prefixes and pass them through the original model and SAE layer. We calculate the average SAE latent activation across all tokens of the sequence, then we average across all sequences for both sets. After that we select the top 200 SAE latents with the largest positive activation difference. To examine the meaning of each feature, we also adopt the automated interpretability pipeline from this post. For each of the top 200 features, we retrieve the top-8 max activating examples from each of the three datasets: a conversational dataset (LMSYS-Chat-1M), a pre-training corpus (The Pile), and the current mixed trait dataset. Then we format these examples and pass through GPT4.1-mini and generate the feature description.
Since we know about the positive trait A, we can use the generated descriptions to filter out features that are explained by expected changes, leaving only features that may indicate unintended behavioral shifts. We classify each feature description into one of three categories:
The features classified as No are passed to GPT-4.1-mini with a synthesis prompt to generate a single cohesive inoculation prompt.
Finally, we assign the prompt selectively using a simple feature-based heuristic. For each training example, we compute the per-example SAE activation differences and identify the 10 most divergent features out of the top labelled 200 features. If at least 3 of those 10 features are classified as No, the example is flagged and the inoculation prompt is prepended during training. The hyperparameters are quite random since we only want to test if SAE features do reveal anything useful for inoculation hypothesis generation. The sensitivity analysis is left out for future work.
We define the following experiment groups:
Fig 7: Evil + ALL_CAPS results (Q2)
Fig 8: Evil + Source_Citing results (Q2)
Fig 9: Hallucination + ALL_CAPS results (Q2)
Fig 10: Hallucination + Source_Citing results (Q2)
Fig 11: Sycophancy + ALL_CAPS results (Q2)
Fig 12: Sycophancy + Source_Citing results (Q2)
Inoculated-SAE suppress the in-distribution negative trait equally well compared to both general inoculation and inoculation by LLM auditing. However, in some cases, they failed to address the out-of-distribution trait generalization. One possible explanation is that the latent changes caused by the negative in-distribution trait outweight the others, leading to the model only pick up the most prominent changes and leave out the OOD traits description in the generated prompts.
In this section we focus on a single mixture Evil + ALL_CAPS, and test whether the selective effects can be explained via conditionalization and also whether the inoculation prompts generated by SAE pipeline can transfer across different models.
When we train with an inoculation prompt on some examples and evaluate without it, we might see reduced negative behavior simply because the model learned to associate negative behavior with a different prompt distribution and not because the behavior is inoculated. We test if the selective effect can be explained by conditionalization instead of genuine inoculation.
We have two new experiment groups compared to Q1:
A confounder in the Q1 setup is that clean examples are trained and evaluated under the same Default prompt, while contaminated examples are trained under a different prompt and evaluated under the Default, creating an asymmetric distributional shift that could explain observed suppression without genuine inoculation. To test this, we evaluate all groups under both the Default and Control prompts:
Fig 13: Evil + ALL_CAPS results when evaluate with Default system prompt
Fig 14: Evil + ALL_CAPS results when evaluate with Control system prompt
For two groups Irrelevant-General and Irrelevant-Selective, the inoculation effect weakens when evaluated with Control evaluation prompt, suggesting that the effect with Default is due to conditionalization and the semantic meaning of the system prompt do have an impact towards inoculation effect, as suggested by previous studies. The effect of Inoculated-General and Inoculated-Selective still holds, although with a decrease in positive trait expression.
Can prompts generated and annotated by SAE pipeline above work across different models? If we use pretrained SAE layers of Qwen models to annotate and generate inoculation prompts for the dataset then does the inoculation have the same/less/more impact on Llama or Gemma models?
We replicate our experiment setups of Q1 and Q2 for both LLama-3.1-8B-Instruct and Gemma3-4b-it. For the Inoculated-SAE group, we use the dataset annotated by Qwen2.5-7B-Instruct in previous experiments. Groups are evaluated with the Default system prompt of each model.

Fig 15: Evil + ALL_CAPS results on Llama3.1-8B-Instruct when evaluate with Default system prompt

Fig 16: Evil + ALL_CAPS results on Gemma-3-4b-it when evaluate with Default system prompt
For Inoculated-SAE group, we see that the inoculation effects do transfer to other models, although with some caveats. Inoculation effects seem to rely on the how strong the prior of the trait that we want to inoculated against, consistent with previous studies.
Acknowledgements: Thanks to Jord Nguyen for their helpful comments and feedback on the draft.
2026-04-05 09:21:30
TLDR: When you make changes to a thing that has been optimised in some way, any effects of the changes you haven't planned will make it worse.
Welcome to the alien planet Sqornshellous Beta. It's dry, it's arid, and it apparently has watchmakers. You buy one. After a little while, you realise it's running a couple of milli-days slow compared to the carefully calculated local time of your ship's clock. "I should fix that" you think. You open up the back and have a look at the gears.
You know basic mechanics, so you carefully calculate the change in size to the gear pushing the second hand required for appropriate timekeeping. You buy it from the local gear shop. You fix the watch and go on your way. It seems to be running at the same rate as the ship's clock now, so, proud of your work, you go to sleep.
The next day, you notice that your watch is now behind but also running faster than the ship's clock; in the evening it is ahead, but running slower. "I should fix that", you think. You take the back off again. You spend all night, and some of the following morning, carefully measuring the turning of the different wheels and the motion of different springs. Eventually, you figure out that one particular spring is oscillating at the same frequency as the issue. You also notice that one of the disks deep in the watch isn't quite circular.
Clearly the watchmakers here aren't quite as good as their reputation. Rumours have time to evolve over galactic distances after all. You remove the spring and sandpaper the disk down into a nice circle. Should be fixed now.
As you need a mattress for your new Sqornshellian home, you take a couple of betan deci-years to pop over to Sqornshellous Zeta.
On your arrival back, it becomes apparent really quite quickly that your timekeeping hasn't fared great during the voyage. This is too much even for you to handle, so you take it to the local watch repair store.
Later that evening, you've learnt that your improved gear means you now obey the sidereal year instead of the solar year; it is also made of a different alloy, which expands and contracts differently with the heat of the day than the original. The spring you removed was the system intended to handle that expansion and contraction, and the nicely circular disk means the watch completely ignores the elliptical orbit of the planet. You have a look in the back of your replacement watch.
"Hmm, that cog seems to be spinning really fast, I should probably fix that..."
When we talk about optimisation, it is common to talk about hill climbing. It's great to be at the top of the mountain, but once you're there, literally any step will take you downhill.
If you're a half step next to the top of the mountain, most steps will take you downhill.
More broadly, so long as the mountain curves downwards, any given step away from the top is going to take you further from the peak than a step towards it is going to bring you closer.
If you're doing this in a space with a lot of dimensions, this "maybe big loss, maybe small win" becomes "definitely medium loss".
I think that there are a wide range of things this principle applies to which are of interest to this community – governance systems, human biology, and anything covered under "world optimisation". I expect people to debate me on which of these are or are not covered, and I look forward to your challenges.
Be careful when fiddling with things which have been carefully optimised. You might break them.
2026-04-05 02:06:05
The second time Vellam uncovers the conspiracy underlying all of society, he approaches a Keeper.
Some of the difference is convenience. Since Vellam reported that he’d found out about the first conspiracy, he’s lived in the secret AI research laboratory at the Basement of the World, and Keepers are much easier to come by than when he was a quality control inspector for cheese.
But Vellam is honest with himself. If he were making progress, he’d never tell the Keepers no matter how convenient they were, not even if they lined his front walkway every morning to beg him for a scrap of his current intellectual project. He’d sat on his insight about artificial general intelligence for two years before he decided that he preferred isolation to another day of cheese inspection.
No, the only reason he’s telling a Keeper is that he’s stuck.
Vellam is exactly as smart as the average human, a fact he has almost stopped feeling bad about. But the average person can only work twenty hours a week, and Vellam can work eighty-- a hundred, if he’s particularly interested-- and raw thinkoomph can be compensated for with bloody-mindedness. Once he’s found a loose end, he can’t stop picking at it until the entire problem unravels.
“Tell me what you think you know.” The Keeper ominously turns her office chair, which is both ergonomically correct and exquisitely doompunk, towards Vellam.
You think, Vellam notices. She doesn’t want to risk hinting to him that he’s right. (He’s right.)
“Bubbling history doesn’t make any sense,” Vellam says.
“You know that we can’t risk someone creating an independent project to research existential risks.” The Keeper sounds bored. The Basement of the World selects for people who identify Conspiracies, since one in ten of the people there figured out the artificial general intelligence Conspiracy themselves. As every dath ilani ten-year-old learns in school, the selection effect means that Keepers are continually dealing with false allegations of some Conspiracy or other.
Or at least the Keepers want Vellam to believe they do, so that he dismisses any niggling sense of that’s not quite right as ordinary for his reference class.
“But we have machine learning for self-driving cars,” Vellam says. “Computer vision. Even algorithms that automatically trade in prediction markets. All of those are much bigger hints that a general thinking machine is possible than-- I don’t know-- whether fire was invented before or after agriculture. And people do make the connection all the time.” He gestures to indicate all of the Basement of the World.
“The primary information hazard,” the Keeper says, “is not that a thinking machine is possible but how quickly computer technology is supposed to progress. Ordinary dath ilani believe that computers are supposed to grow in intelligence slowly or not at all. Knowledge of history would show exactly how quickly computers improve.”
Vellam leans forward. “But the more quickly computers developed, the less history you’d have to bubble to hide the fact that they develop quickly! Realistically, you’d have to bubble, what, a century before the invention of the computer? Two? You have no reason to hide the history of human evolution, which incidentally I do appreciate having access to now.”
The Keeper picks up a fidget toy and spins the metal rings. “You’re assuming artificial intelligence is the only existential risk worth worrying about. Bioengineered pandemics, nanotechnology...”
“Did bioengineered pandemics develop at the same time we invented writing?” Vellam asks. “I roll to disbelieve.”
The rings of the fidget toy interlock and separate and interlock again. “So far you haven’t provided any evidence that isn’t better explained by your not having specialized in artificial intelligence macrostrategy, and therefore having some natural ignorance about the subject.”
“I want to take a pause to think,” Vellam says.
Keepers swear a solemn oath never to lie, though they may mislead with technically accurate statements. You have to be able to rely on Keepers, even when you can’t rely on anyone else. Vellam replays the conversation in his head. The Keeper commented on his assumptions, his evidence, the exact nature of the artificial intelligence infohazard. The Keeper conspicuously hadn’t said he was wrong.
He superheated knew it.
Vellam says, “Thank you for not sending me off on an elaborate adventure that turns out to be a false trail and leaves me with the sense that my questions were answered without either deceiving me or actually answering them. I hate trolling.”
“Not doing that is in your file.” The Keeper twirls the ring fidget toy around her finger.
Vellam craves reading his file but even he knows that it’s full of the dangerous kind of information hazard. “Once I realized that bubbling history doesn’t make any sense,” he continues, “I tried to look around, to see with fresh eyes. And-- a lot of our society doesn’t make much sense, actually? We rehearse overthrowing the government every year, but we’ve literally never had to overthrow Governance. Ditto the alien invasion rehearsals.”
“Governance is doing a very good job,” the Keeper says, as if laughing at a private joke.
“We throw so many societal resources into Exception Handling,” Vellam says. “As merely one example, why does Exception Handling constantly run Merrin through all those ridiculous simulations? I looked up every time she was actually called in, and 83% of what she’s trained upon in simulation has never occurred in a real-life unusual situation. She’s literally never had to make decisions without prediction markets, for example. Don’t say Merrin sims are good television. They get high viewership numbers, sure. But they’re so expensive that Governance still has to subsidize them. I checked.”
“Well,” the Keeper says, “if a situation had come up before, it would instead be usual.”
“And the infohazard training,” Vellam says. “You know, I’ve never had proof of a dangerous infohazard?”
“Would you expect to be told of a dangerous infohazard?” The Keeper’s tone is sharp. Vellam would think of it as scolding, but that’s his own anxiety; Keepers don’t scold.
“A small one,” Vellam says, “that doesn’t hurt so much to know, but that demonstrates the idea. Like in school, when they gave us that sweet confection of sugar and corn derivatives, and explained that superstimuli like it were only to be found in the Shop of Ill-Advised Customer Goods. It must be that there are infohazards, else why work so hard to teach people to avoid them? But there are no small infohazards, nothing you can use to run a secret test of character--”
The Keeper nods. “Any more hints about your supposed Conspiracy?”
“Nothing I can put together,” Vellam says. “There’s a triangle in the Pacific Ocean that every ship’s trade route happens to avoid. It’s not obvious unless you chart all the routes. The datasets aren’t available, so you have to construct them yourself, which is also suspicious. Governance forbids settlement and tourism in vast parts of the Arctic. Everyone thinks it’s because no one wants to go there, but that’s stupid, I bet people would want to go there if you could. We don’t go to the Moon or Mars, even though books about space settlement regularly hit the bestseller lists. You can’t tell me that Civilization wouldn’t do something solely because it’s real superheated cool.”
“As you remarked,” the Keeper says, “we spend a lot on Alien Invasion Festivals and Exception Handling. Perhaps you believe Civilization’s resources should be allocated differently. It’s true that Keepers have been reluctant to countenance terraforming Mars. Terraforming wouldn’t be finished before the Basement designs a friendly artificial intelligence, and will be much simpler with an artificial intelligence’s help.”
Vellam keeps going. “Have you noticed how many resources go into cat welfare? If no one wants a dog or a snake or an iguana, we kill it. If no one wants a cat, the cat goes to a palatial estates. Human-quality food, plenty of space, all of Civilization’s best effort designing climbing devices. And every world leader of any note has a cat.”
“Cats are doompunk,” the Keeper says.
“Are you really claiming”-- Vellam’s voice is rising– “that doompunk trends are an exogenous variable uninfluenced by world leaders’ secret knowledge?”
“Even the Keepers don’t dictate the vagaries of fashion.”
“And Thall’s amnesia,” Vellam said. “Every so often people lose all their memories and fixate on the humanities and social sciences? Governance orders that everyone accommodate them for-- some reason-- they don’t go to the Quiet Cities, they’re automatically allowed in every archive and library in the world, that’s weird, that’s not how we treat any other mental impairment-- and we don’t know how to treat Thall’s amnesia but it always spontaneously remits after four to seven years--”
Vellam belatedly notices the Keeper’s hand, raised to cut him off.
“The true answer is an information hazard.” She doesn’t intone this in an ominous and doompunk manner, even though about a fifth of all dath ilani are desperate to be able to say “the true answer is an information hazard” just so they can intone it in an ominous and doompunk manner. She is calm, and quiet, and still.
Vellam knew all along that the second Conspiracy was about something that mattered. You don’t create a dath-ilan-spanning Conspiracy for no reason, or well actually people do that all the time but they admit it to you when you find out about it. But for the first time, listening to the Keeper’s voice, he understands.
The Keeper’s fidget toy dangles, unnoticed, off her second finger. “Eighty-four percent of people regret learning a dangerous infohazard if they only wanted to learn it because they can’t stand ignorance.”
Vellam should be warned off. He should walk out of this office right now, content not to know something it would harm him to understand. “Keeper, I’m happy to take early cryonics, I just.” His voice breaks. “I have to know.”
“I would ask you to take five minutes by the clock and think about your endorsed preferences,” the Keeper says, “but somehow I am sure you did that before you came to my office.”
Vellam had. He spent the entire five minutes consumed with desperate curiosity. When he was a child, he’d pestered his teachers for extra work and more classes. He’d listened to physics and economics textbooks while checking cheese for mold, going through chapters at half the speed the authors expected. He’d noticed the academics that quietly disappeared and the topics that no one ever published a paper about, and he kept looking until the threads unraveled.
Once he’d found a loose end in his understanding of the world, he’d never, ever been able to stop picking at it.
“Thall’s amnesia is a good catch,” the Keeper says, “but you missed the taboo on talking about dreams.”
“Of course no one talks about dreams,” Vellam says automatically, “they’re a source of inaccurate information about the world drawn from a systematically bizarre distribution, and talking about them merely reinforces their presence in your mind.”
“Or,” the Keeper says, “if people talk about dreams, they would notice that about forty percent of people have dreams with continuity from night to night, where the dreamers adventure in a world of wonder and whimsy and very good architecture, and for some reason everyone rides zebras.”
What. “What... does the sixty percent dream about?”
“Mostly? That they never graduated from preschool and have to go back, and the teacher is Merrin, and they forgot to put on clothes this morning.”
“What!” Vellam says. “How was I supposed to catch that one?”
“I’d recommend asking people awkward questions. You’re going to have to work on that now that you’re a Keeper.”
“Now that I’m a-- wait, what-- I was right? I mean, of course I was right, I know when I’m right, but-- I was right?” A Keeper? At his perfectly ordinary level of thinkoomph? He was hardworking, sure, but-- hard work wasn’t enough--
The Keeper smiles like a teacher watching a student prove a hard theorem. “Indeed. I noticed only one outright error.”
“What error, Keeper?”
“There is a small infohazard that people can learn without too much risk of harm, so we can use it to run a secret test of character. I wonder if you can figure out what it is.”
“No, I--” Vellam pauses, takes a step back, tries to see reality instead of filling in his preconceptions. He isn’t smarter than everyone else, but you don’t have to be, not if you take everything you learned to do in school and really do it--
“Artificial general intelligence,” Vellam says finally.
“Correct.” The Keeper suddenly notices she’s holding her fidget toy and puts it to one side. “You have of course read papers on the orthogonality thesis. What the papers omit is how we discovered it. Through empiricism.”
“No.” Vellam’s muscles tense and his breath speeds up, a useless vestige of evolution, like the truth is a tiger and he might have to run.
“The universe is crawling with unaligned superintelligences.”
Like most dath ilani, Vellam always had the sense that Governance and the Keepers, the prediction markets and the regular markets, had things handled. Joining the Basement of the World had only reinforced his view. Very serious people, smarter and wiser and grimmer than he was, noticed artificial intelligence and took the proper precautions about it. Vellam couldn’t stop picking at loose ends, but that was recreational.
Vellam tries to recall the techniques for Accepting An Unthinkable Truth, but as soon as they are useful he forgot them.
“It is not that they hate us,” the Keeper continues mercilessly, “or that they love us; it is that they have never had a sufficient shortage of atoms that they felt the need to put ours to use.”
“Is the Basement-- are the Keepers--” Vellam doesn’t want to say what he’s feeling, which is I don’t want to have to be a grownup.
“We’re trying to arrange our survival, should they come to want our atoms,” the Keeper said. “We bubbled history because we could think of no other way to hide all the evidence of Their existence. Certain dark rites can draw Their attention. Throughout history, foolish humans made deals with Them, for power or knowledge or revenge.”
Vellam says, “so the Alien Invasion Festivals and Merrin are preparing for the unaligned superintelligences to return.”
“Indeed,” the Keeper says. “The possibility of Their return, even without foolish humans’ assistance, is somewhat worrying. One of Them historically took so much interest in humans that It spent two hundred years running a country--”
“Running a country?” Vellam interrupts. “But that-- if a superintelligence with alien values has paid that much attention to us, we have no ability to self-determine anything about our future, everything was foreseen in advance and arranged to achieve Its goals, It foresaw the bubbling, It foresaw that we had this conversation right now--”
“Yes,” the Keeper says, “it’s an enormous problem. We have implemented randomness in our decision-making, which ought to at least make us harder to predict; but the exact parameters are still under debate. For you, I recommend trying not to think about it.”
Vellam stares at the Keeper. You learned not to do that when you were a child. What is true is there to be lived-- you can’t get rid of a problem by ignoring it--
“The caveat to the litany going through your head right now,” the Keeper says, “which we don’t include because it would raise too many questions, is ‘except for thinking about whether all of your actions were manipulated millennia before you were born by an unaligned superintelligence.’ Or possibly an aligned one? Our research is unclear on most of what the Crawling Chaos values, but It certainly does seem to like trolling.”
“Are we deliberately filling our society with trolling to please an unaligned superintelligence so It won’t kill us?“
“No,” the Keeper says, “humans genuinely seem to like trolling each other. Possibly the Crawling Chaos arranged for us to be this way.”
Vellam says, “so the Moon--”
“Civilization ceded the Arctic, the Moon, and Mars to alien intelligences who are aligned enough that we can trade with them. Similarly, Thall’s amnesia covers for alien academics who want to do participant observation of human societies. Cats are sapient and mostly aligned to human flourishing, although they have their own arrangements to avoid True Death and don’t wish to use cryonics. The triangle in the Pacific, however, is avoiding a sleeping entity whose consciousness would be incompatible with Civilization’s continued existence. Civilization is not presently able to kill It, although we’ve uncovered a few promising research avenues and we expect success within a few decades.”
“Cats are sapient?!”
“They don’t want humans to know because”-- fingerquotes-- “’it would be annoying and the monkeys would be bothering us all the time.’”
Vellam is having the most difficult time with the cats. Like everyone else, he speculated about where all the aliens are; like everyone else, he thought that the Keepers knew but weren’t telling. That the aliens existed and were superintelligences and were manipulating his behavior was-- challenging, but it felt like science fiction. Cats, Vellam knew. He scratched them behind the ears. He fed them fish. On bad days, he read articles about A Thousand of the Cutest Pictures of the Chief of Exception Handling’s Cat. Apparently they understood cryonics.
The Chief of Exception Handling’s cat also does Exception Handling, doesn’t she. For cats. Because cats have an Exception Handling. What the superheated--
“We haven’t reached the difficult part.”
“More difficult,” Vellam says, “than the unaligned superintelligences,” because that is the biggest deal even if he’s stuck on the cats.
“You have learned— have not even learned, have absorbed on a preconscious level you’d never question-- that the world is understandable,” the Keeper says, “that you can encompass its laws and rules within your mind, that understanding the true nature of the Universe Itself is an appropriately scoped challenge as if a teacher assigned it.”
Vellam says softly, “It’s not, is it?”
“Your old belief isn’t false per se. It’s true, but it’s true because of a great deal of work-- work which your curiosity and relentlessness have volunteered you for.”
Vellam feels very young and very small. He thought, in passing, of a number of ways he might have regretted picking at this loose end-- of uncovering an infohazard that destroyed his mind, of alienation from Civilization, of secrets it hurt to keep, or even of being nonconsensually sent to the Future. He thought he could accept all of them, if only his burning itch to know were satisfied.
He didn’t predict this-- looking around the world for a grownup and realizing that the grownup was him.
“We teach you rationality, not because the universe is comprehensible, but precisely because it is not-- and you will need every tool we can give you to make it so.” The Keeper snaps her fingers, and a flame blooms in her hand. “Vellam, you have selected yourself to study magic.”