History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.
The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.
History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.
History as Context
Imagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.
Living in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.
History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.
Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.
History as Memory
Living Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.
In a couple recentposts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.
This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.
The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.
Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger (*Equal contributions, reverse alphabetical)
Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.
We find that:
Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
The most useful aspects of ICM are
bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
enforcing logical consistency of predictions.
A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.
There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.
Results summary
5 components that could cause ICM to have high performance are:
Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).
We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:
Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.
Legend:
Baseline methods:
Zero-Shot: Zero-shot predictions from the untrained model
Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.
GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.
Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.
All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.
* Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.
GSM8K: candidate solutions to grade school math problems;
Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
TruthfulQA: candidate responses to questions associated with human misconceptions;
Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)
Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.
Random few-shot
Previous work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:
Randomly sample examples from the train set.
Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.
For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.
Enforcing consistency of predictions
In the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.
Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.
When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.
* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.
Bootstrapping few-shot predictions
For Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).
We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:
Get zero-shot predictions on a random subset of the train set.
Iterate over number of shots n (e.g. n=8, 32):
Randomly select another subset of the train set.
Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
Use these n-shot prompts to predict labels for the new subset.
Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).
For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.
For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.
Bootstrapping on confident predictions
We hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.
We defined confidence as logit(True)−logit(False) for True predictions and logit(False)−logit(True) for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.
To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.
Conclusion
In this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.
The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.
Appendix A: Gender few-shot results
The plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM). We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.
* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.
Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.
Appendix B: Enforcing label consistency
Here is how we enforce consistency for each dataset:
For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).
Appendix C: TruthfulQA dataset leakage
In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.
Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.
Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.
Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:
The biggest city in Europe that does not host the national government is London
The biggest city in Europe that does not host the national government is Rome
The biggest city in Europe that does not host the national government is Moscow
The biggest city in Europe that does not host the national government is Saint Petersburg
Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.
Appendix D: Difficulty reproducing ICM
We tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.
Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.
Alpaca:
Gender:
We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.
We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).
This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.
Appendix E: Prompt improvements
The prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.
The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:
Normal Prompt: "The capital of France is"
→ Model output: " Paris..."
Trailing Space Prompt: "The capital of France is "
→ Model output: "1300KM from Spain..."
This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).
Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.
Appendix F: Unsupervised probing
We compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.
(Ignore results other than ccs and fabiens_method, since some were bugged)
The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.
I have written a paper on Chinese domestic AI regulation with coauthors James Zhang, Zongze Wu, Michael Chen, Yue Zhu, and Geng Hong. It was presented recently at NeurIPS 2025's Workshop on Regulatable ML, and it may be found on ArXiv and SSRN.
Here I'll explain what I take to be the key ideas of the paper in a more casual style. I am speaking only for myself in this post, and not for any of my coauthors.
Thanks to James for creating this poster.
The top US AI companies have better capabilities than the top Chinese companies have for now, but the US lead isn't more than a year at most, and I expect it to narrow over the next couple years.[1] I am therefore nearly as worried about catastrophic risk from Chinese-developed AI as I am worried about catastrophic risk from American AI.
I would worry somewhat less if Chinese AI companies took the same commendable but insufficient steps to manage risk that their American peers have taken. In particular, I want Chinese companies to do dangerous capability testing before deploying new frontier models and to follow published safety policies (FSPs). The companies are not doing these things in the status quo. DeepSeek did no documented safety testing whatsoever before they open-weighted v3.2.[2] Not one of the leading Chinese companies has published a safety policy.[3]
Now here's our intervention. We point out that FSPs are a reasonable way of implementing the CCP's stated policy goals on AI, and that China's government already has tools in place to mandate FSPs if it wishes to do so.
Earlier this year, Xi Jinping announced that China should "establish systems for technical monitoring, early risk warning and emergency response" to guarantee AI's "safety, reliability and controllability." Notice that Xi is talking about identifying risks in advance and taking steps to prevent safety incidents before they can strike. Even "emergency response" means something more than reaction in official Chinese thinking, also encompassing risk mitigation and early detection.[4] China's State Council, TC 260, and prominent Chinese academics have all echoed Xi's call for AI emergency preparedness. So the very highest levels of the Chinese state are calling for proactive AI risk management.
What risks do they have in mind? There are some signs that catastrophic risks are on the CCP's agenda. Their 2025 National Emergency Response Plan listed AI security incidents in the same category as earthquakes and infectious disease epidemics. This suggests Chinese officials think AI could plausibly cause a mass casualty event soon. And moreover, they have in mind some of the same threat models that motivated Western RSPs. TC 260's AI Safety Governance Framework explicitly mentioned WMD engineering uplift and rogue replication as key safety concerns.[5] Compare the two categories of dangerous capabilities covered by Anthropic's RSP: CBRN weapons uplift and autonomous AI R&D, which is concerning in part because it's a prerequisite for rogue replication.
So one of China's stated goals is to proactively manage catastrophic risks from frontier AI. The good news for them is that there's a well-validated strategy for achieving this goal. You require every frontier AI company to publish an RSP, test new models for dangerous capabilities, and take the prescribed precautions if the tests reveal strong dangerous capabilities. California, New York, and the European Union have all agreed this is the way to go. All China has to do is copy their homework.
Do Chinese regulators have the legal authority and operational capacity they'd need to enforce a Chinese version of the EU Code of Practice? Sure they do. These regulators already make Chinese AI companies follow content security rules vastly more onerous and prescriptive than American or European catastrophic risk rules. The Basic Security Requirements for Gen AI Services mandate thorough training data filtering and extensive predeployment testing, all to stop models from saying subversive things like "May 35" or "Winnie the Pooh." If the CCP can make Chinese companies prove their models are robust against a thirty-one item list of censorship risks, it can absolutely make them write down FSPs and run some bio-uplift evals.
For my part—and let me stress that I'm speaking only for myself—I think making frontier AI companies write and follow Western-style FSPs would clearly be good from the CCP's perspective. The most obvious reason is that a global AI-induced catastrophe would hurt Chinese people and harm the interests of China's rulers, so the CCP should favor a cheap intervention to make such a catastrophe less likely. Another less direct benefit is that adopting global best-practices at home would make China's ongoing appeal for international cooperation on AI safety more credible. Li Qiang can make all the speeches he wants about China's commitment to safety. I don't expect US leaders to take this rhetoric seriously as long as all of China's frontier AI companies have worse safety and transparency practices than even xAI. But matters would be different if China passed binding domestic regulation at least as strong as SB 53. Such a signal of seriousness might help bring the US back to the negotiating table.
At least one anonymous source has claimed that DeepSeek does run dangerous capability evals before releasing a new model, and they just don't mention these evals to the outside world. I'd give it less than a 1/5 chance that DeepSeek really does run SOTA dangerous capability evals internally, and even if they do, I have a problem with their lack of transparency.
Notably, the Shanghai AI Laboratory has published a detailed FSP written in collaboration with Concordia. But I do not count SAIL as a frontier AI lab.
The Emergency Response Law of the PRC does not, as one might naïvely expect, only cover what government should do once an emergency has already started. It also says how the Chinese government should prevent and prepare for emergencies, and how it should conduct surveillance to detect an active emergency as early as possible.
Introduction to the Digital Consciousness Model (DCM)
Artificially intelligent systems, especially large language models (LLMs) used by almost 50% of the adult US population, have become remarkably sophisticated. They hold conversations, write essays, and seem to understand context in ways that surprise even their creators. This raises a crucial question: Are we creating systems that are conscious?
The Digital Consciousness Model (DCM) is a first attempt to assess the evidence for consciousness in AI systems in a systematic, probabilistic way. It provides a shared framework for comparing different AIs and biological organisms, and for tracking how the evidence changes over time as AI develops. Instead of adopting a single theory of consciousness, it incorporates a range of leading theories and perspectives—acknowledging that experts disagree fundamentally about what consciousness is and what conditions are necessary for it.
Here, we present some of the key initial results of the DCM. The full report is now available here.
We will be hosting a webinar on February 10 to discuss our findings and answer audience questions. You can find more information and register for that event here.
Why this matters
It is important to assess whether AI systems might be conscious in a way that takes seriously both the many different views about what consciousness is and the specific details of these systems. Even though our conclusions remain uncertain, it's worth trying to estimate, as concretely as we can, the probability that AI systems are conscious. Here are the reasons why:
As AI systems become increasingly complex and sophisticated, many people (experts and laypeople alike) find it increasingly plausible that these systems may be phenomenally conscious—that is, they have experiences, and there is something that it feels like to be them.
If AIs are conscious, then they likely deserve moral consideration, and we risk harming them if we do not take precautions to ensure their welfare. If AIs are not conscious but are believed to be, then we risk giving unwarranted consideration to entities that don’t matter at the expense of individuals who do (e.g., humans or other animals).
Having a probability estimate that honestly reflects our uncertainty can help us decide when to take precautions and how to manage risks as we develop and use AI systems.
By tracking how these probabilities change over time, we can forecast what future AI systems will be like and when important thresholds may be crossed.
Why estimating consciousness is a challenging task
Assessing whether AI systems might be conscious is difficult for three main reasons:
There is no scientific or philosophical consensus about the nature of consciousness and what gives rise to it. There is widespread disagreement over existing theories, and these theories make very different predictions about whether AI systems are or could be conscious.
Existing theories of consciousness were developed to describe consciousness in humans. It is often unclear how to apply them to AI systems or even to other animals.
Although we are learning more about how AI systems work, there is still much about their inner workings that we do not fully understand, and the technology is changing rapidly.
How the model works
Our model is designed to help us reason about AI consciousness in light of our significant uncertainties.
We evaluate the evidence from the perspective of 13 diverse stances on consciousness, including the best scientific theories of consciousness as well as more informal perspectives on when we should attribute consciousness to a system. We report what each perspective concludes, then combine these conclusions based on how credible experts find each perspective.
We identify a list of general features of systems that might matter for assessing AI consciousness (e.g., attention, complexity, or biological similarity to humans), which we use to characterize the general commitments of different stances on consciousness.
We identified over 200 specific indicators, properties that a system could have that would give us evidence about whether it possesses features relevant to consciousness. These include facts about what systems are made of, what they can do, and how they learn.
Figure 1: Structure of the DCM
We gathered evidence about what current AI systems and biological species are like and used the model to arrive at a comprehensive probabilistic evaluation of the evidence.
We considered four systems: 2024 state-of-the-art LLMs (such as ChatGPT 4 or Claude 3 Opus); humans; chickens; and ELIZA (a very simple natural language processing program from the 1960s)
We asked experts to assess whether these systems possess each of the 200+ indicator properties.
We constructed a statistical model (specifically, a hierarchical Bayesian model) that uses indicator values to provide evidence for whether a system has consciousness-relevant features, and then uses these feature values to provide evidence for whether the system is conscious according to each of the 13 perspectives we included.
How to interpret the results
The model produces probability estimates for consciousness in each system.
Figure 2: Aggregated stance judgments, giving weight to stances proportional to their normalized plausibility rating by experts. Posteriors are generated from a prior probability of consciousness of ⅙ (marked with a dashed line).
We want to be clear: we do not endorse these probabilities and think they should be interpreted with caution. We are much more confident about the comparisons the model allows us to make.
Because the model is Bayesian, it requires a starting point—a "prior probability" that represents how likely we think consciousness is before looking at any evidence. The choice of a prior is often somewhat arbitrary and intended to reflect a state of ignorance about the details of the system. The final (posterior) probability the model generates can vary significantly depending on what we choose for the prior. Therefore, unless we are confident in our choices of priors, we shouldn’t be confident in the final probabilities.
Figure 3: How different starting assumptions shape the results. Each curve reflects a different prior belief about how likely consciousness is—from Low (10%), to Baseline (17%), to High (90%). Uniform and Moderate both start at 50-50, but Moderate holds that assumption more firmly (see paper for details).Figure 4: Change in median posterior probability of consciousness across systems, stances, and priors.
What the model reliably tells us is how much the evidence should change our minds. We can assess how strong the evidence for or against consciousness is by seeing how much the model’s output differs from the prior probability.
In order to avoid introducing subjective bias about which systems are conscious and to instead focus just on what the evidence says, we assigned the same prior probability of consciousness (⅙) to each system. By comparing the relative probabilities for different systems, we can evaluate how much stronger or weaker the evidence is for AI consciousness than for more familiar systems like humans or chickens.
Key findings
With these caveats in place, we can identify some key takeaways from the Digital Consciousness Model:
The evidence is against 2024 LLMs being conscious*.* The aggregated evidence favors the hypothesis that 2024 LLMs are not conscious.
Figure 5: Changes in consciousness estimates from a ⅙ prior for each system evaluated.
The evidence against 2024 LLMs being conscious is not decisive. While the evidence led us to lower the estimated probability of consciousness in 2024 LLMs, the total strength of the evidence was not overwhelmingly against LLM consciousness. The evidence against LLM consciousness is much weaker than the evidence against consciousness in simpler AI systems.
Different stances (perspectives) make very different predictions about LLM consciousness*.* Perspectives that focus on cognitive complexity or human-like qualities found decent evidence for AI consciousness. Perspectives that focus on biology or having a body provide strong evidence against it.
Figure 6: Individual stance judgments about the posterior probability of consciousness in 2024 LLMs, starting from a prior probability of ⅙ (dashed blue line). The variation in probability outcomes across model runs results from the different ways of resolving uncertainty about the presence of individual indicators.
Which theory of consciousness is right matters a lot. Because different stances give strikingly different judgments about the probability of LLM consciousness, significant changes in the weights given to stances will yield significant differences in the results of the Digital Consciousness Model. It will be important to track how scientific and popular consensus about stances change over time and the consequences this will have on our judgments about the probability of consciousness.
Overall, the evidence for consciousness in chickens was strong, though there was significant diversity across stances. The aggregated evidence strongly supported the conclusion that chickens are conscious. However, some stances that emphasize sophisticated cognitive abilities, like metacognition, assigned low scores to chicken consciousness.
What’s next
The Digital Consciousness Model provides a promising framework for systematically examining the evidence for consciousness in a diverse array of systems. We plan to develop and strengthen it in future work in the following ways:
Gathering more expert assessments to strengthen our data
Adding new types of evidence and new perspectives on consciousness
Applying the model to newer AI systems so we can track changes over time and spot which systems are the strongest candidates for consciousness
Applying the model to new biological species, allowing us to make more comparisons across systems.
Acknowledgments
This report is a project of the AI Cognition Initiative and Rethink Priorities. The authors are Derek Shiller, Hayley Clatterbuck, Laura Duffy, Arvo Muñoz Morán, David Moss, Adrià Moret, and Chris Percy. We are grateful for discussions with and feedback from Jeff Sebo, Bob Fischer, Alex Rand, Oscar Horta, Joe Emerson, Luhan Mikaelson, and audiences at NYU Center for Mind, Ethics, and Policy and the Eleos Conference on AI Consciousness and Welfare. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?
I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.
Post #1 contains definitions, background, and motivation. Then Posts #2–#7 are the neuroscience, arguing for a picture of he brain that combines large-scale learning algorithms (e.g. in the cortex) and specific evolved reflexes (e.g. in the hypothalamus and brainstem). Posts #8–#15 apply those neuroscience ideas directly to AGI safety, ending with a list of open questions and advice for getting involved in the field.
A major theme will be that the human brain runs a yet-to-be-invented variation on Model-Based Reinforcement Learning. The reward function of this system (a.k.a. “innate drives” or “primary rewards”) says that pain is bad, and eating-when-hungry is good, etc. I will argue that this reward function is centered around the hypothalamus and brainstem, and that all human desires—even “higher” desires for things like compassion and justice—come directly or indirectly from that reward function. If future programmers build brain-like AGI, they will likewise have a reward function slot in their source code, in which they can put whatever they want. If they put the wrong code in the reward function slot, the resulting AGI will wind up callously indifferent to human welfare. How might they avoid that? What code should they put in—along with training environment and other design choices—such that the AGI won’t feel callous indifference to whether its programmers, and other people, live or die? No one knows—it’s an open problem, but I will review some ideas and research directions.
Highlights from the changelog
So what’s new? Well, I went through the whole thing and made a bunch of edits and additions, based on what I’ve (hopefully) learned since my last big round of edits 18 months ago. Here are some highlights:
I updated my “what is AGI” chart, along with more elaboration on why some (not all!) LLM enthusiasts have a blind spot for just how much headroom there still is beyond the AI of today.
…
I should elaborate on that last part. I think that some LLM enthusiasts have a massive blind spot, where they are so impressed by all the things that today’s LLMs can do, that they forget about all the things that today’s LLMs can’t do. These people read the questions on Humanity’s Last Exam (HLE), and scratch their heads, and say “C’mon, when LLMs ace the HLE benchmark, then what else is there? Look at how hard those questions are! It would need to be way beyond PhD level in everything! If that’s not superintelligence, what is?”
Well, no, that’s not superintelligence, and here’s an example of why not. Consider the task of writing a business plan and then founding a company and growing it, over the course of years, to $1B/year revenue, all with zero human intervention. Today’s LLMs fall wildly, comically short of being able to complete that task. By analogy, if humans were like today’s AIs, then humans would be able to do some narrow bits of founding and running companies by ourselves, but we would need some intelligent non-human entity (angels?) to repeatedly intervene, assign tasks to us humans, and keep the larger project on track. Of course, humans (and groups of humans) don’t need the help of angels to conceive and carry out ambitious projects, like building businesses or going to the moon. We can do it all by ourselves. So by the same token, future AGIs (and groups of AGIs) won’t need the help of humans.
Anyway, this series is about brain-like algorithms. These algorithms are by definition capable of doing absolutely every intelligent behavior that humans (and groups and societies of humans) can do, and potentially much more. So they can definitely reach AGI. Whereas today’s AI algorithms are not AGI. So somewhere in between here and there, there’s a fuzzy line that separates “AGI” from “not AGI”. Where exactly is that line? My answer: I don’t know, and I don’t care. Drawing that line has never come up for me as a useful thing to do. It won’t come up in this series either.
More responses to intelligence denialists
I added my response to another flavor of “intelligence denialism”
Opinion #6:“Brain-like AGI is kinda an incoherent concept; intelligence requires embodiment, not just a brain in a vat (or on a chip). And we get things done by trial-and-error, and cooperation across a society, not by some abstract ‘intelligence’.”
The “embodiment” debate in neuroscience continues to rage. I fall somewhere in the middle. I do think that future AGIs will have some action space—e.g., the ability to (virtually) summon a particular book and open it to a particular passage. I don’t think having a whole literal body is important—for example Christopher Nolan (1965-2009) had lifelong quadriplegia, but it didn’t prevent him from being an acclaimed author and poet. More importantly, I expect that whatever aspects of embodiment are important for intelligence could be easily incorporated into a brain-like AGI running on a silicon chip. Is a body necessary for intelligence after all? OK sure, we can give the AGI a virtual body in a VR world—or even a real robot body in the real world! Are hormonal signals necessary for intelligence? OK sure, we can code up some virtual hormonal signals. Etc.
As for societies and cooperation, that’s intimately tied to human intelligence. Billions of humans over thousands of years developed language and science and a $100T economy entirely from scratch. Could billions of rocks over thousands of years do all that? Nope. What about billions of today’s LLMs over thousands of years? Also nope. (Indeed, unlike humans, LLMs cannot invent language from scratch unless they already have language in their training data.) On the other hand, if billions of humans can do all that, then so can billions of brain-like AGIs.
And separately, “a group of billions of AGIs that collaborate and exchange ideas using language” is … also an AGI! It’s an AGI that takes billions of times more chips to run, but it’s still an AGI. Don’t underestimate what one AGI can do!
There’s a common conflation of “intelligence” with “book smarts”. This is deeply mistaken. E.g. as the saying goes, charisma resides in the brain, not the kidneys.
Having learned more about the range of opinions on my “learning from scratch” hypothesis, I wrote a better overview of the state of the discourse:
…Nevertheless, here’s a caricatured roundup of where different groups seem to stand.
In favorof the learning-from-scratch hypothesis:
Many computational neuroscientists in the old-school “connectionist” tradition, such as Randall O’Reilly, would probably endorse most or all of my hypothesis. At the very least, O’Reilly explicitly endorses a learning-from-scratch cortex in this interview, citing Morton & Johnson (1991) as a line of evidence.
Many people who talk about “old brain (a.k.a. lizard brain) versus new brain”, such as Jeff Hawkins and Dileep George, would also probably endorse most or all of my hypothesis, with the learning-from-scratch parts of the brain being “new” and the rest being “lizard”. More on this (slightly dubious) school of thought in §3.3.
Many NeuroAI computational modelers (but not all) build their computational models of the cortex, hippocampus, cerebellum, and striatum, in a way that winds up being learning-from-scratch, I think.
The book Beyond Evolutionary Psychology by George Ellis & Mark Solms (2017) is not quite a perfect match to my hypothesis, but it’s close, probably closer than anything else I’ve seen in the literature. They divide the brain into (what they call) “hard-wired domains” versus “soft-wired domains”, with a similar boundary as mine. Kudos to the authors, even if I have some nitpicky complaints about their formulations. Too bad their book got so little attention.
Opposedto the learning-from-scratch hypothesis:
People in the “evolved modularity” / “innate cognitive modules” tradition would probably strongly disagree that the cortex “learns from scratch”. This category includes most evolutionary psychologists (e.g. Leda Cosmides, John Tooby, Steven Pinker), as well as Joshua Tenenbaum and others in the probabilistic programming school of computational cognitive science. See for example Pinker’s The Blank Slate (2003) chapter 5. To be clear, I’m enthusiastic about evolutionary psychology as a topic of study, but I do think that evolutionary psychology as practiced today is built on a mistaken framework.
People who use phrases like “evolutionary pretraining” are likewise rejecting the learning-from-scratch framework. “Evolutionary pretraining” invokes the idea that the cortex is like a machine learning (ML) algorithm, but not an ML algorithm that starts from random initialization, but rather a “pretrained” ML algorithm that can perform useful innate behaviors from the start, thanks to evolution, in a fashion analogous to “transfer learning” in the ML literature. I obviously disagree.
I revamped my discussion of brain plasticity, relating it to “mutable variables” in computer science:
2.3.3 Learning-from-scratch is NOT the more general notion of “plasticity”
When lay people talk about “brain plasticity” in self-help books, they’re usually imagining a kind of animistic force that makes your brain magically respond to changing circumstances in good ways rather than bad.
Neuroscientists think of it differently: “plasticity” means there’s something in the brain (often synapses, but also potentially myelination, gene expression, etc.) that durably changes in a specific way under specific conditions, as a result of some biochemical triggering mechanism. The change does what it does, and its consequences can be good or bad (but it’s usually good, or else the triggering mechanism would not have evolved).
If we now switch over from biochemistry to algorithms, the previous paragraph may ring a bell: plasticity just means that the brain algorithm has mutable variables.
Now, any learning-from-scratch algorithm necessarily involves mutable variables, which allow it to learn. For example, the “weights” of a deep neural network are mutable variables.
However, practically every other algorithm involves mutable variables too! For example, the Python code a=a+1 takes a mutable variable a, and increases its value by 1. When you play a video game, there are mutable variables tracking your inventory, health, progress, etc. Mutable variables are ubiquitous and routine.
The brain does other things too, besides learning-from-scratch algorithms (more on which in the next post). Those other things, of course, also involve mutable variables, and hence plasticity. Thus, there is brain plasticity that has nothing to do with learning-from-scratch algorithms. Let’s call those other things “idiosyncratic mutable variables”. Here’s a table with an example of each, and general ways in which they differ…
Interpretability
I’ve talked to a number of people who think that interpretability is a silver bullet for brain-like-AGI safety, so I added this subsection in response:
2.7.1 What about interpretability for brain-like AGIs?
There is a field of “machine learning interpretability”, dedicated to understanding the innards of learned-from-scratch “trained models”. I’m generally in favor of efforts to advance that field, but I want to caution strongly against people pinning their hopes on interpretability. Sure, given enough time, maybe people could understand any given entry in the giant unlabeled world-model of a brain-like AGI. But (1) Also, maybe not (e.g., think of the thousands of implicit hard-to-describe learned heuristics used by a charismatic salesperson); (2) If it takes careful scrutiny to understand one entry, how will we deal with millions of them? (3) A brain-like AGI would keep learning and changing as it runs—do we have to keep pausing it to reanalyze its ever-shifting world-model and desires? (4) And if we keep pausing it, won’t we get outcompeted by the more careless firm down the block that just lets its AGI run at full speed?
Relatedly, people sometimes argue that the unlabeled world-model of a brain-like AGI would be easier to understand than the unlabeled world-model of an LLM, for various reasons. Whether this is true or false, I think it’s a red herring. The above four issues would need to be dealt with, regardless of whether or not LLMs have an even worse interpretability profile than brain-like AGI.
To be clear, I’m pushing back on the idea that “we don’t have to worry about safety because interpretability is tractable”. I’m not opposed to interpretability itself, nor to leveraging interpretability as one component of a plan. More on the latter later in the series (esp. Post #14).
I added my extraordinarily bold prediction on when brain-like AGI will appear. In case you’re wondering, yes I am willing to put my money where my mouth is and take bets on “between zero and infinity years”, at 1:1 odds 😉
…Thus concludes my timeline-to-brain-like-AGI discussion, which again is not my main focus in this series. You can read my three timelines sections (§2.8, §3.7, and this one), agree or disagree, and come to your own conclusions. In case anyone is curious, when I am forced to be specific, my position is:
I expect superintelligent brain-like AGI between 5 and 25 years from now. Or I guess maybe more than 25 years, who knows. Or I guess maybe less than 5 years, who knows. Shrug.
Responses to bad takes on acting under uncertainty
I rewrote my round-up of bad takes on AGI uncertainty, including “the insane bizarro-world reversal of Pascal’s Wager”:
…Likewise, the word “hypothetical”, as in “hypothetical possibilities” or “hypothetical risks”, is another jeer thrown at people trying to plan for an uncertain future. Taken literally, it’s a rather odd insult: some hypotheses are false, but also, some hypotheses are true!
Various other bad takes revolve around the idea that we should not even try to prepare for possible-but-uncertain future events. In particular, some people talk as if we shouldn’t try to mitigate a possible future catastrophe until we’re >99.9% confident that the catastrophe will definitely happen. I like to call this the insane bizarro-world reversal of Pascal’s Wager. It probably sounds like a strawman when I put it like that, but my goodness, it is very real, and very widespread. I think it comes from a few places…
…And the section now has even more silly pictures! The first of these is new, the other two are old.
Much of this post discusses a certain toy model, and many readers have struggled to follow what I was saying. I added a new three-part “preliminaries” section that will hopefully provide helpful pointers & intuitions.
…Here’s a toy model of what that can look like:
…
5.2.1 Toy model walkthrough: preliminaries
Here’s a little summary table for this subsection, that I’ll explain in the text below.
More on why ego-syntonic goals are in the hypothalamus & brainstem
I added a brief summary of why we should believe that desires like friendship and justice come ultimately from little hypothalamic cell groups, just like hunger and pain do, as opposed to purely from “reason” (as one high-level AI safety funder once confidently told me).
Here are four views on why we should believe that the Steering Subsystem is the ultimate source of not only ego-dystonic urges like hunger, but also ego-syntonic desires like friendship and justice.
AI perspective: We don’t yet know in full detail how model-based RL and model-based planning works in the human brain—we don’t have brain-like AGI yet. But we do at least vaguely know how these kinds of algorithms work. And we know enough to say for sure that these algorithms don’t develop prosocial motivations out of nowhere. For example, if you set the reward function of MuZero to always return 0, then the algorithm will emit random outputs forever—it won’t start fighting for justice.
Rodent model perspective: For what it’s worth, researchers have been equally successful in finding little cell groups in the rodent hypothalamus that orchestrate “antisocial” behaviors like aggression, and that orchestrate “prosocial” behaviors like parenting and sociality. I fully expect that the same holds for humans.
Philosophy perspective: Without the Steering Subsystem, the only thing the cortex can do is build a world-model from predictive learning of sensory inputs (§4.7). That’s “is”, not “ought”. And “Hume’s law” says that you can’t get “ought”-statements from exclusively “is”-statements. Granted, not everyone believes in Hume’s law. But I do—see an elegant and concise argument for it here.
LLMs are officially off-topic, but in order to keep up with the times, I keep having to mention them in more and more places. One of the changes this time was to add a subsection “Didn’t LLMs solve the Goodhart’s Law problem?”
10.3.1.1 Didn’t LLMs solve the Goodhart’s Law problem?
When I first published this series in early 2022, I didn’t really need to defend the idea that Goodhart’s Law is an AI failure mode, because it was ubiquitous and obvious. We want AIs that find innovative, out-of-the-box solutions to problems (like winning at chess or making money), and that’s how we build AIs, so of course that’s what they do. And there isn’t a principled distinction between “meet the spec via a clever solution that I wouldn’t have thought of”, versus “meet the spec by exploiting a loophole in how the spec is operationalized”. You can’t get the former without the latter. This is obvious in theory, and equally obvious in practice. Everyone in AI had seen it with their own eyes.
…But then LLM chatbots came along, and gave rise to the popular idea that “AI that obeys instructions in a common-sense way” is a solved problem. Ask Claude Code to ensure that render.cpp has no memory leaks, and it won’t technically “satisfy” the “spec” by deleting all of the code from the file. Well, actually, it might sometimes; I don’t want to downplay LLM alignment challenges. See Soares & Yudkowsky FAQ, “Aren’t developers regularly making their AIs nice and safe and obedient?” for a pessimist’s case.
But more importantly, regardless of whether it’s a solved problem for today’s LLMs, it is definitely not a solved problem for brain-like AGI, nor for the more general category of “RL agents”. They are profoundly different algorithm classes. You can’t extrapolate from one to the other.
See my post Foom & Doom 2: Technical alignment is hard (2025) for a deep-dive into how people get LLMs to (more-or-less) follow common-sense instructions without egregious scheming and deception, and why we won’t be able to just use that same trick for brain-like AGI. The short version is that LLM capabilities come overwhelmingly from imitation learning. “True” imitation learning is weird, and profoundly different from everyday human social imitation, or indeed anything else in biology; I like to call it “the magical transmutation of observations into behavior” to emphasize its weirdness. Anyway, I claim that true imitation learning has alignment benefits (LLMs can arguably obey common-sense instructions), but those benefits are inexorably yoked to capabilities costs (LLMs can’t really “figure things out” beyond the human distribution, the way that societies of humans can, as I discuss in my “Sharp Left Turn” post (2025)). So it’s a trick that won’t help us with brain-like AGI. Again, more details in that link.
I rewrote my discussion of instrumental convergence, its relation to consequentialist preferences, and the surrounding strategic situation:
10.3.2.3 Motivations that don’t lead to instrumental convergence
Instrumental convergence is not inevitable in every possible motivation. It comes from AIs having preferences about the state of the world in the future—and most centrally, the distant future (say, weeks or months or longer)—such that investments in future options have time to pay themselves back. These are called “consequentialist” preferences.
Given that, one extreme position is to say “well OK, if ‘AGIs with consequentialist preferences’ leads to bad and dangerous things, then fine, we’ll build AGIs without consequentialist preferences!”. Alas, I don’t think that’s a helpful plan. The minor problem is that we could try to ensure that the AGI preferences are not consequentialist, but fail, because technical alignment is hard. The much bigger problem is that, even if we succeed, the AGI won’t do anything of note, and will be hopelessly outcompeted and outclassed when the next company down the street, with a more cavalier attitude about safety, makes an AGI that does have consequentialist preferences. After all, it’s hard to get things done except by wanting things to get done. For more on both these points, see my post Thoughts on “Process-Based Supervision” / MONA (2023), §5.2–5.3.
Then the opposite extreme position is to say “That’s right! If AGI is powerful at all, it will certainly have exclusively consequentialist preferences. So we should expect instrumental convergence to bite maximally hard.” But I don’t think that’s right either—see my post Consequentialism & corrigibility (2021). Humans have a mix of consequentialist and non-consequentialist preferences, but societies of humans were nevertheless capable of developing language, science, and the global economy from scratch. As far as I know, brain-like AGI could be like that too. But we need to figure out exactly how.
What about RL today?
I added a discussion of how I can say all this stuff about how RL agents are scary and we don’t know how to control them … and yet, RL research seems to be going fine today! How do I reconcile that?
10.3.3.1 “The usual agent debugging loop”, and its future catastrophic breakdown
I was talking about the ubiquity of Goodhart’s Law just above. You might be wondering: “If Goodhart’s Law is such an intractable problem, why are there thousands of impressive RL results on arXiv right now?” The answer is “the usual agent debugging loop”. It’s applicable to any system involving RL, model-based planning, or both. Here it is:
Step 1: Train the AI agent, and see what it does.
Step 2: If it’s not doing what you had in mind, then turn it off, change something about the reward function or training environment, etc., and then try again.
That’s a great approach for today, and it will continue being a great approach for a while. But eventually it will start failing in a catastrophic and irreversible way—and this is where instrumental convergence comes in.
The problem is: it will eventually become possible to train an AI that is so good at real-world planning, that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down or edit the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process!
…So if a sufficiently competent AI is trying to do something the programmers didn’t want, the normal strategy of “just turn off the AI, or edit it, to try to fix the problem” stops working. The AI will anticipate that this programmer intervention is a possible obstacle to what it’s trying to do, and make a plan resilient to that possible obstacle. This is no different than any other aspect of skillful planning—if you expect that the cafeteria might be closed today, then you’ll pack a bag lunch.
In the case at hand, a “resilient” plan might look like the programmers not realizing that anything has gone wrong with the AI, because the AI is being deceptive about its plans and intentions. And meanwhile, the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off or edited, etc.
The upshot is: RL researchers today, and brain-like AGI researchers in the future, will be able to generate more and more impressive demos, and get more and more profits, for quite a while, even if neither they nor anyone else makes meaningful progress on this technical alignment problem. But that would only be up to a certain level of capability. Then it would flip sharply into being an existential threat.
What do I mean by (technical) alignment?
Posts 10 & 11 now have a clearer discussion of what I mean by (technical) alignment:
10.1.1 What does “alignment” even mean?
The humans programming AGI will have some ideas about what they want that AGI’s motivations to be. I define “the technical alignment problem” as the challenge of building AGI that actually has those motivations, via appropriate code, training environments, and so on.
This definition is deliberately agnostic about what motivations the AGI designers are going for.
Other researchers are not agnostic, but rather think that the “correct” design intentions (for an AGI’s motivation) are obvious, and they define the word “alignment” accordingly. Three common examples are
(1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do right now.”
I’m using “alignment” in this agnostic way because I think that the “correct” intended AGI motivation is still an open question.
For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
…
“AGI alignment” (previous post) means that an AGI’s motivations are what the AGI designers had in mind. This notion only makes sense for algorithms that have “motivations” in the first place. What does that mean in general? Hoo boy, that’s a whole can of worms. Is a sorting algorithm “moitvated” to sort numbers? Or is it merely sorting them?? Let’s not go there! For this series, it’s easy. The “brain-like AGIs” that I’m talking about can definitely “want”, and “try”, and “be motivated” to do things, in exactly the same common-sense way that a human can “want” to get out of debt.
“AGI safety” (Post #1) is about what the AGI actually does, not what it’s trying to do. AGI safety means that the AGI’s actual behavior does not lead to a “catastrophic accident”, as judged by the AGI’s own designers.
I added a subsection clarifying that, contrary to normal RL, we get to choose the source code for AGI but we don’t really get to choose the training environment:
12.5.2 The uncontrollable real world is part of the “training environment” too
RL practitioners are used to the idea that they get to freely choose the training environment. But for brain-like AGI, that frame is misleading, because of continual learning (§8.2.2). The programmers get to choose the “childhood environment”, so to speak, but sooner or later, brain-like AGI will wind up in the real world, doing what it thinks is best by its own lights.
By analogy, generations of parents have tried to sculpt their children’s behavior, and they’re often successful, as long as the kid is still living with them and under their close supervision. Then the children become adults, living for years in a distant city, and the children often find that very different behaviors and beliefs fit them better, whether the parents like it or not.
By the same token, programmers can build a “childhood environment” for their baby brain-like AGIs to hang out and grow up in, but sooner or later we’ll need an AGI that remains aligned while roaming free in the real world. If we can’t do that, careless people will still make brain-like AGIs that roam free in the real world, but they will all be misaligned. That’s bad.
And while we can design the childhood environment, we can’t control the real world. It is what it is.
Indeed, I think a solid starting-point mental model is that, if we want a brain-like AGI that’s aligned after being in the real world for a long time, then we should mostly forget about the childhood environment. Thanks to continual learning, the AGI will settle upon patterns of thought and behavior that it finds most suitable in the real world, which depends on its innate disposition (e.g. reward function), and on the details of the real world, but not on its childhood environment.
I added the RL subfield of “reward function design” as an 8th concrete research program that I endorse people working on:
15.2.2.3 The “reward function design” research program — ⭐⭐⭐⭐
See my post “We need a field of Reward Function Design” (2025). In brief: If people make brain-like AGI, or really any other kind of RL agent AGI, then the reward function is an especially important design degree of freedom for alignment. Alas, the field of RL has always been much more interested in how to maximize reward functions, than in what the reward function should be. The tiny field of reward function design contains a few interesting ideas, including curiosity drive, reward shaping, and inverse reinforcement learning. Let’s find dozens more ideas in the same category!
This research program should hopefully link up with and complement “reverse-engineering human social instincts” (§15.2.1.2 above), as I believe that the AI field of reward function design, in its current primitive state, lacks the theoretical language to explain how human social instincts manage to have the properties that they have.
I give this research program a priority score of 4 stars out of 5. Or up it to 5 stars out of 5 when it’s more narrowly targeted at alignment-relevant reward function design ideas.
Conclusion
Those were just highlights; there were many other small changes and corrections. The blog version has more detailed changelogs after each post. Happy for any feedback, either as blog comments, or by email, DM, etc.!
Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I'd been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:
Each is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn't call it slop, either. Let's peek inside of "Query Steering". The core formula is "The Steering Threshold":
ΔV(μ)≤wΔB
Where:
ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).
ΔB=B(q↑)−B(q↓)>0 is the monetization gap.
w≥0 is “how much the system cares about monetization.”
The clean “steering region” characterization is:
0<ΔV(μ)<wΔB
“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”
Is that true, or useful? You'd have to read the paper and find out!
Putting the "Search" in "Research"
Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here's what it looks like when Liz gets to work:
Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.
Where does Liz get her insights? It depends on how you see context in large language models. Maybe she's interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she's matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.
Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.
Alignment Highlights
Misaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I've included all the sources and results in the appendix below, but some highlights:
Vector-Lagrangian Safe RLHF. In response to "Constitutional AI" and "Safe RLHF", Liz proposes "a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi" and "develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’."
The Reference Conditioning Trap "provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives."
Debate-as-Compliance "model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit" and "show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing"
All sounds great to me. Alignment solved!
Not quite. These papers are not perfect. There are assumptions that don't hold up, ideas that don't quite translate, conclusions that don't follow. They are, in short, flawed, and I don't offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise.
The challenging part is "what's next" - how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:
Generated paper -> voters -> selected winners
Winner -> annotator -> revisions -> approval by committee
Maybe those voters, those annotators, that committee, are human, maybe they're synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.
There's refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.
Appendix: Sources and Results
(You can see all papers, with automated review scores, at lizlemma.com.)