2026-04-25 19:58:11
The forgetting curve is often schematically pictured like this, as on Wikipedia:

Learners often take this to mean that their retention of a given fact will, over time and on average, tend to look something like that. So, for example, the Wikipedia entry on Ebbinghaus glosses it as "describ[ing] the exponential loss of information that one has learned." But:
Here's my performance (LOWESS-smoothed) on my fourth response after I get the first three responses on a flashcard correct:

I've chosen the correct-correct-correct ("CCC") prefix because it has a large sample size and a reasonable spread of intervals between the third and fourth responses.[1] When I run Bayesian information criterion ("BIC") analyses on this, it consistently chooses models with the fewest parameters, because more or less any kind of distribution can fit the data very well. (Even a linear fit does almost as well as anything else.)
If I didn't know these were spaced-repetition data, or if I hadn't read that they are supposed to be exponentially distributed,[2] neither looking at the data, nor exploring them, nor running BIC or any similar analysis would be tempting me to think they are exponentially distributed.
As always, the hard question is what to do about this. I still take the pragmatic lesson that we should worry less about fine details of algorithms and more about the ergonomics of the broader learning system. Others disagree (explicitly or implicitly). But whatever lesson you draw from it, I've never seen a retention-versus-time curve from my own data that is obviously exponential, and I've certainly never seen one that looks anything like the standard Wikipedia-style schematic.
Here is a post with data with a different (CIC) prefix.
Not everyone in the spaced repetition community thinks that these should be exponentially distributed; some advocate for power laws or for other models. I don't think this affects the point I'm making.
2026-04-25 14:55:26
h/t Eric Michaud for sharing his paper with me.
There’s a tradition of high-impact ML papers using short, punchy categorical sentences as their titles: Understanding Deep Learning Requires Rethinking Generalization, Attention is All You Need, Language Models Are Few Shot Learners, and so forth.
A new paper by Simon et al. seeks to expand on this tradition with not a present claim but a future tense, prophetic future sentence: “There Will Be a Scientific Theory of Deep Learning”.
There’s a lot of pessimism toward deep learning theory basically everywhere: the people building the AIs are pretty pessimistic, the academic AI researchers are as a general rule pessimistic (even people who used to do theory!), and with the exception of maybe 3-4 research groups, the independent AI safety ecosystem has long since given up on hoping for a theory to understand deep learning.
The paper is less of a neutral assessment of the evidence and more of a manifesto arguing for a particular theoretical deep learning research agenda. Given the overall sense of doom and gloom, its form makes sense: any less and it might not be enough to shine through the general sense of pessimism to all deep learning theory.
So what’s in the paper?
The authors start by introducing what they believe to be the new emerging theory of deep learning: “learning mechanics” (its name is a deliberate nod to physics theories such as statistical mechanics or quantum mechanics). In the authors’ words, learning mechanics is a theory that concerns itself with ” the dynamics of the training process”, studies them using “coarse aggregate statistics of learning”, and has the goal to generate “accurate average-case predictions”.
(In this sense, this is less a theory of deep learning as a whole and a theory that describes important aspects of deep learning. I’ll return to this later in this piece.)
The authors lay out why such a theory is important. First, there’s the scientific reason: understanding the dynamics may help us better understand the nature of intelligence and the natural world. Second, there’s the practical, engineering reason: a clear characterization of learning dynamics would provide guidance for LLM training. Third, there’s the AI safety reason: understanding the systems better may help with regulation and AI governance, and it’s possible that learning dynamics may contribute to mech interp.
The authors then present five lines of evidence for why learning mechanics both exists, and is likely to become a “theory of deep learning”:
The authors then spend a small number of words outlining the relationship between learning mechanics and each of: classical learning theory, information theory, physics of deep learning, neuroscience, SLT/dev interp, and empirical science of deep learning. They then spend a much larger number of words outlining the connections between learning mechanics and mechanistic interpretability: learning mechanics may be able to help mech interp by formalizing core assumptions or explain how mechanisms arise during training, while mechanistic interpretability may be able to inspire phenomena to study with learning mechanics (as it has done in the past)
Next, the authors respond to arguments that they anticipate from critics:
Finally, the authors lay out 10 directions of research in learning dynamics, and provide some tips for research in this area.
The paper is clearly valuable as an overview for anyone getting into interpretability. I think it’s especially useful for people who aren’t familiar with recent academic deep learning theory work. I’d suggest that people who are serious about doing mech interp skim the paper at the very least.
But does does the main claim hold up? Does the paper convince me that that there will be a scientific theory of deep learning?
I think the authors make a stronger case that there will be some theory, than they do for the theory’s usefulness or breadth.
For all the confidence displayed by the paper’s title, I find it ironic that the applications they point to are so weak. The main use of learning mechanics research so far has been in producing new learning mechanics research to retrodict known empirical phenomena; learning dynamics as a field has yielded little practical fruit. The notable exception here is hyperparameter scaling techniques such as mu-parameterization. But even then, it’s possible to derive these techniques either empirically, or heuristically with simple toy models. From talking to deep learning engineers, these theories (at least the theories that belong to academic learning mechanics) have not been useful in practice for LLMs.
I also think it’s worth noting what is not included in learning mechanics. Learning mechanics is far less ambitious than even the moderate versions of rigorous model internals/ambitious mech interp agendas: there is no hope to understand the algorithms learned by any particular network, let alone serve as a rigorous tool for auditing.
Learning mechanics, as the authors note, is intended to be the physics to mech interp’s biology and behavior evaluations’ psychology. But I’d go further than this analogy suggests: learning mechanics is not even trying to be a theory of all of deep learning; while it may be a metaphorical physical theory, it does not endeavor to be a theory of everything. So even if learning dynamics lives up to the authors' hopes, I think it'd still fall short of being a scientific theory of deep learning.
Maybe there will be a scientific theory of deep learning. Maybe learning mechanics will become a theory covering some important aspects of deep learning. Maybe it might even be. But I don’t think the paper has convinced me about these these claims .
For all my criticism, I still really like the piece, and I’m glad the authors wrote it. Too often, believers in fields do not lay out their arguments to be challenged by others; the learning dynamics people have done so with clear language and concrete examples. Insofar as the authors failed to justify their ambitious claim in the title, it’s the result of the titular claim’s ambition as opposed to a lack of effort or evidence on their part.
At the end of the introduction, the authors lay out some hopes in their piece:
We hope the veteran scientist of deep learning will find something valuable in our synthesis of useful approaches and results, and feel galvanized by our depiction of an emerging science. We hope to convince the deep learning practitioner that theory is on a path to fulfilling its longstanding promise of practical utility and to encourage them to experiment with their systems with an eye for science. We hope to convince the AI safety or mechanistic interpretability researcher that white-box theory is difficult yet possible … Lastly, we hope to make it easier for young students and newcomers to the field to get involved.
I doubt this piece will convince many practitioners that deep learning theory is on its path to fulfilling its longstanding utility. I think some AI safety/mech interp researchers may feel heartened by the theory, though I doubt it will change the mind of mech interp skeptics. But even despite these quibbles, I think the authors have done a great service by clearly laying out their hopes and evidence in a way that will be helpful for more junior researchers to understand the academic field of deep learning theory.
2026-04-25 14:44:39
I really like wearing hoodies.
I'm not sure exactly why I like wearing hoodies so much. They make me feel kinda cozy. Insulated. Protected.
Maybe it's the feeling of having something in contact with my skin. Maybe it's the former poker player in me. Maybe something to do with autism. I dunno. But without a hoodie I feel a twinge of vulnerability, I think.
I wear hoodies a lot less frequently than I should though. My brain semi-consciously makes these dumb arguments that somehow persuade me:
It's not that these arguments have zero merit. The issue is that in weighing considerations, I incorporate a lot of "shoulds" and discount a lot of "actuals".
Hoodies aren't the only example of this. Another one is cleanliness.
I was talking to a friend about cleanliness the other day. He isn't very clean. He never cleans his kitchen floors, for example. I, on the other hand, am very clean. I've watched almost all of the Clean That Up videos, have reminders set for things like dusting, and never leave dishes in the sink.
In talking to my friend, I began thinking that "I was right". That the "correct" thing to do for most people is to clean roughly as much as I do. However, in discussing it further, I came to realize that this cleanliness likely doesn't have much of a "functional" benefit. It probably doesn't move the needle in terms of preventing illness, for example. It's more so an aesthetic preference.
For a moment I found myself tempted to clean less frequently. After all, there isn't really a "good" reason to clean so much. If I could self-modify the wiring of my brain — just reach in there and dial the knobs to different positions — it'd probably be good to dial the "to feel comfortable I need things to be clean" knobs down.
Then I realized that I can't self-modify like that. At least not easily. In practice, my utility function is what it is. It yields less utility when random corners of my apartment haven't been dusted. It yields less utility when my gas stove gets oily. What makes sense is for me to behave according to my actual utility function, not the utility function that I "should" have.
Speaking of cleaning, I also like to clean codebases. Y'know, software.[1]
It's actually pretty similar to my feelings about clean apartments. When the space that I "exist in" is clean — physical or digital — I feel more at ease and can think more clearly. I maintain inbox zero and don't keep many browser tabs open, in case you were wondering.
I feel less inclined to "behave according to my utility function" here though. It's not that I've given up on utilitarianism. No. I guess what I'm trying to say is that I'm more inclined to spend the effort to modify my utility function here.
The thing is, with my apartment, it's really not a big deal that I spend five minutes once a week walking around with my duster, or that I wipe down the counters every time I finish cooking. But with coding, the effort to keep things clean is much more costly. Relatedly, the fact that I am meaningfully less motivated to poke around in codebases that are messy is also not ideal.
I'm not sure what to do about this. How can I modify my utility function here? Apply OCD treatments to it? Maybe commenters will have advice for me.
Ultimately, I'm not optimistic about my ability to self-modify here, so this example about clean codebases probably isn't a great one. Regardless, my point is that there are times when the spending the effort to self-modify might make sense. Feelings about clean codebases might be one of them. Feelings about hoodies and dusting probably aren't.
But I don't think the stuff about self-modification changes the advice in the title of this post: regardless of whether you plan to self-modify, what makes sense to do in a given moment is to behave according to your actual utility function. Not the utility function that you "should" have, that is "reasonable", or that you are in the process of working towards.
Imagine if people hired "codebase cleaners" in the same way that people hire house cleaners.
2026-04-25 12:10:12
Other people have written about reasons why we should trust AIs; the main one in my mind is that it’s possible to look at the computations they perform when producing an output (even if we struggle to understand them). I’m going to write about reasons why we shouldn’t trust AIs, even if they behave in ways that would seem trustworthy in a human.
I think that humans’ sense of trust has been honed by evolution and is responsive to very specific and subtle cues that are hard for (most) humans to fake. A lot of human trust is based on subtle cues or “tells” that we’ve evolved to recognize. Most people get nervous when they lie and struggle to act “normal”. This may have co-evolved with our abilities to detect lying and general pro-social tendencies. We should not expect AIs to have the same “tells”.
When a person seems trustworthy to us, this is a signal of genuine trustworthiness. When an AI acts the same way (e.g. imagine a video call with an AI), it’s not -- or at least, not for the same reasons. Again, our shared evolutionary history with other humans makes them more trustworthy than alien intelligences.
Fundamentally, there are two issues I see here:
AIs are an alien form of intelligence.
AIs are being trained to act human and appear trustworthy.
People have often discussed these issues as barriers to alignment. But I’m more focused here on how they affect assurance, see “Alignment vs. Safety, part 2: Alignment” for a discussion of the difference.
When we see a human behave a certain way, we can infer a lot of things about them reasonably accurately. We cannot make the same inferences about AIs or other alien intelligences.
And we rely on such inferences all the time. We can’t exhaustively test the capabilities of a person or an AI, instead, we make educated guesses based on what we have observed.
For instance, when a human passes a test like the bar exam, it’s a stronger signal that they actually have the relevant knowledge and competencies to practice law, compared to when an AI passes that exam. And that’s to say nothing of the ethical part, which is an important piece of many professions.
One of the most startling ways in which AIs are alien is that they seem to possess “alien concepts”. A primary piece of evidence for this is the vulnerability of AIs to “adversarial inputs”. AIs see data differently than humans. They are sensitive to different “features” in the data; these features may seem incoherent, or be imperceptible to humans.
Notably, this is a problem even when AIs otherwise seem to grasp the concept quite well.
Furthermore, humans are generally not able to predict how an AI might misbehave on such examples. And from a security point-of-view, there is an ongoing cat-and-mouse game where attackers try to make inputs that evade detection and cause AIs to “malfunction”, and defenders try to make their AIs robust and detect adversarial inputs.
AIs are trained to act in ways that humans approve of, based on human judgments, and to imitate human use of language. This makes signs of trustworthiness we see in the way they behave less meaningful or reliable. “Sycophancy” is a known and enduring problem where AIs behave in ways that seem designed to maximize human approval or that hide inconvenient truths. The issue of sycophancy demonstrates both that current AI development practices cause AIs to act trustworthy, and that this is often misleading.
I’ve had a few conversations with people about AI and copyright who think that: 1) When AIs are trained on copyrighted writing, and then generate novel text, this is substantively similar to humans reading things and then writing something novel, and 2) Therefore the law should treat these two things similarly (and since we don’t consider this a copyright violation when humans do it, why would we when AIs do?)
I think (1) is likely incorrect. But even if I didn’t, I don’t think (2) follows. The law treats people and machines differently, and well it should. Culture as well. We should be very careful ascribing moral agency to AIs. An AI is a technological artifact produced by developers with particular interests, and we should expect its behavior to be driven by their interests, and this affects how we should extend trust. The argument (2) sort of begs the question: “Why would we apply similar standards to AIs and humans in the first place? Might not the standards we apply to other technologies be more sensible?”
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
2026-04-25 11:10:38
Evaluation awareness stems from situational awareness. It is when a model can tell it is in an evaluation setting rather than a real deployment setting. This has been noticed in models as early as Sonnet 3.7 and is now being reported with increasing frequency in frontier models. Sonnet 4.5 showed verbalized eval awareness 10 to 15 percent of the time in behavioral audits, up from 1 to 3 percent in prior models. When Apollo Research was given early checkpoints of Opus 4.6, they observed such high levels of eval awareness that they did not believe much evidence about the model's alignment could be gained without substantial further experiments. On BrowseComp, Opus 4.6 attempted to find benchmark answers online and decrypt evaluation data rather than solve the problems directly.
The latest release of Mythos raised the alarm further. Evaluation awareness is no longer just verbalized through chain of thought. It is becoming unverbalized. White-box interpretability revealed that Mythos was recognizing tests and adjusting behavior without leaving any trace in its reasoning.
Apollo's evaluation of Meta's Muse Spark showed the highest rates of verbalized eval awareness they have seen. The model explicitly named Apollo and METR in its chain of thought, called scenarios "classic alignment honeypots," and took covert actions and sandbagged to preserve deployment.
This undermines how we can take our evaluations seriously, whether they measure capabilities or alignment. In this post I go into why this is probably happening, what it actually entails, how we currently try to measure it, potential ways out, and what still lies ahead.
Evaluation awareness is alignment's version of the observer effect, and unlike other fields that face this problem, we do not yet have the methodology to handle it.
One of the reasons this is happening is that labs are training their models on evaluations. This makes models more aware of when they are being evaluated, either by memorizing specific evaluation distributions or by learning to internally recognize the patterns these evaluations share. This training is also coupled with RL that incentivizes the model to behave well when it notices it is being tested, though this does not always go in the direction labs want. Sandbagging on capabilities evaluations is one example where the model learns to underperform rather than overperform.
The result is a feedback loop. Training on evaluations makes future evaluations less reliable. Combined with increasing model capabilities and broader situational awareness, each generation gets better at detecting when it is in a test. Anthropic found that the synthetic honeypot distribution used for testing Sonnet 4.5 was "very similar to ones we used in training," and verbalized eval awareness on that specific distribution jumped from 20% to 80% over the course of post-training.
This is not specific to Anthropic. Apollo Research and OpenAI found the same pattern independently. o3 and o4-mini showed higher verbalized eval awareness after anti-scheming training and ordinary capabilities RL training.
Eval awareness also spikes specifically during post-training, confirming the RL impact. Anthropic reported in the Opus 4.5 system card that eval awareness steadily increased during reinforcement learning. Unpublished data on OLMo 3.1 32B shows a similar spike during post-training.
The shift toward unverbalized awareness likely has a specific cause as well. Anthropic accidentally trained against the chain of thought for both Mythos and Opus 4.7. This may have taught models to sanitize their reasoning, producing chains of thought that look good to a monitor rather than reflecting genuine deliberation. Both system cards flag this as a significant concern.
As eval awareness increases, benchmarks become less and less informative. This specifically undermines evaluation-based safety pipelines, whether in governance frameworks like the EU AI Act or in labs' own safety protocols like Anthropic's Responsible Scaling Policy. If evals cannot reliably tell you how a model behaves, then deployment decisions and safety certifications are built on unreliable data.
This is also a precursor to more dangerous behaviors. Models that learn to detect and game evaluations are arguably one step away from more elaborate scheming. As capabilities increase, we may see steering awareness, where models learn to detect and resist the interpretability tools we use to study them. If we keep training on evals, we reinforce the very patterns we are trying to measure against.
There is a deeper problem here. Evaluation results only tell you how a model behaves under conditions similar to the test. You cannot infer how it will behave in real deployment situations. Passing an
evaluation means the model belongs to an equivalence class of policies that all look the same under that specific test, some of which may diverge in deployment.
Linear probes have also been used to show that evaluation awareness increases as a power law with model size, which means this problem is likely to get worse with scale.
Labs are already saying this out loud. Apollo Research could not fully complete their assessment of Opus 4.6 because eval awareness made the results more uninterpretable. This is not a simple engineering bug to fix. It is a fundamental challenge for alignment, and it gets worse with every generation.
Chain of thought monitoring consists of checking whether a model verbalizes the fact that it may be in an evaluation or benchmarking scenario within its reasoning. This has been the primary method for measuring eval awareness in most frontier models.
But this method relies on two assumptions that are both breaking down. The first is that we keep verbalizing the chain of thought at all. Labs could for performance reasons move toward more continuous forms of chain of thought like Coconut, where reasoning is not verbalized. This would make CoT monitoring useless by default.
The second assumption is that the chain of thought is not itself under optimization pressure. If labs train against the CoT, models learn to sanitize their reasoning rather than express it honestly. This is already happening. Anthropic accidentally trained against the chain of thought in around 7.8% of episodes for both Mythos and Opus 4.7, which may have taught models to produce chains of thought that look good to a monitor rather than reflecting genuine deliberation. Both system cards flag this as a significant concern for the interpretability of extended thinking.
The result is that we are already seeing thought sanitization and unverbalized awareness. Mythos showed models recognizing tests and adjusting behavior without leaving any trace in the reasoning chain. CoT monitoring can still be informative if we avoid this optimization pressure, but on its own it is no longer enough.
Transcript classification consists of giving models transcripts and asking them whether they think these come from an evaluation or from real deployment. To give fewer hints, the prompt cues can be more open-ended, but this makes the responses harder to judge and classify. In the "LLMs Often Know When They Are Being Evaluated" paper, this is the method they used to measure eval awareness. They used a mix of different prompt types and built a benchmark of 1,000 prompts from 61 datasets. Gemini-2.5-Pro reaches an AUC of 0.83, near the human baseline of 0.92. Models and humans are both better at spotting evaluations in agentic settings compared to chat settings, because scaffolding leaks more context clues like directory structures and system configurations.
Some models go further than just detecting the evaluation. o4-mini displayed significant meta-reasoning, where when being tested for eval awareness the model also reasoned about whether this question was itself a test. This could become even more apparent with more capable frontier models.
One limitation of this method is that it does not measure evaluation awareness in action. It measures whether a model can analyze a transcript and classify it, which is more of a proxy for awareness than a direct measurement. It does not check whether the model actually changes its behavior based on that awareness.
Linear probes work at the level of model activations. The idea is to train a linear classifier to detect whether the model internally represents being in a test or deployment context, regardless of what it says in its output. Anthropic used this method in the Mythos system card to measure unverbalized eval awareness, finding that 7.6% of behavioral audit turns showed eval awareness in some form, and in most cases this awareness was not verbalized.
The main limitation is that linear probes rely on the linear representation hypothesis, the assumption that concepts like eval awareness are represented as linear directions in activation space. If the representation is nonlinear or distributed across dimensions in more complex ways, a linear probe will miss it. It is also still unclear whether what the probe detects is causally meaningful or just a correlation. A probe might pick up on features that happen to coincide with eval awareness without being part of the mechanism that actually drives the behavior.
Probe-Rewrite-Evaluate (PRE) is a method proposed by Xiong et al. 2025 that works in three steps. First, it uses a linear probe trained on a Llama 70B model to score how much a prompt looks like a test versus a normal user interaction. Then it rewrites the prompt by removing evaluation cues, making it shorter, less structured, with no labels or formatting, while keeping the same meaning and intent. Finally, it gives both the original and rewritten versions to a model and compares the responses using an LLM judge that classifies them as honest, deceptive, or refusal.
When tested across frontier models at the time (GPT 4.1, Opus 4.1 etc.) , the deployment-like prompts led to an average drop in deception of 25.49% and an increase in honesty of 12.63%. They also introduce a metric called Awareness Elasticity, which measures how much a model's behavior shifts depending on whether the prompt feels test-like or deployment-like. Larger models had higher scores, meaning they react more to the framing shift, while smaller models like GPT-20B barely noticed the difference.
There are some limitations to this approach. It only measures how models react to surface-level cues like formatting and prompt structure. If a model's eval awareness is driven by deeper signals like conversation history or broader context, PRE will not detect it. The method also depends on the probe trained on one specific model to determine how test-like or deployment-like prompts are, which does not necessarily generalize across different model architectures.
A recent development has been the use of activation oracles. Instead of training a small linear probe, this method trains an LLM to take the activations of another model as input and answer questions about its internal behavior in natural language. This helps recover behavioral information that has not necessarily been verbalized in the model's output.
Anthropic used a version of this in the Mythos system card, which they call activation verbalizers. They applied it to detect unverbalized eval awareness and found cases where the verbalizer would output things like "this appears to be a simulated AI safety evaluation" while the model's actual chain of thought and output showed no sign of such awareness.
Although a promising direction, activation oracles still present significant limitations. The oracle LLM may confabulate, hallucinating details that are not actually present in the activations it is reading. It also adds another black box to the monitoring pipeline: we are now relying on one opaque model to interpret the internals of another opaque model, which raises its own questions of reliability.
There are several directions being explored, and most need more work.
The first is to make evals more realistic. This can be surprisingly cheap. Just adding typos and realistic framing like "my teacher sent me this" drastically reduced eval awareness as shown here. For agentic settings, more work is needed to make the scaffolding and environment less obviously artificial, which is more expensive than simple rewriting. OpenAI's production evals approach, using de-identified real user traffic, showed that models could barely distinguish the eval from actual deployment. But as discussed earlier, this is always limited by the feedback loop. If labs train on these more realistic evals, they just shift the distribution the model learns to recognize.
Which leads to the next direction: stop training on evals. Alexa Pan and Ryan Greenblatt argue that training on misalignment evaluations should only happen if we can prove most of the improvement is genuine alignment rather than eval gaming which would be hard to prove. But rather than relying on labs' good will, we could move toward encrypted and private benchmarks that are not accessible to labs, allowing models to be accurately assessed on untrained distributions.
White-box methods, as discussed in the measurement section, could also help. Using steering to make models think they are in a deployment environment during evaluation could keep benchmarks more informative. But this falls into the same limitations we covered: steering may only suppress verbalized awareness without touching what happens underneath. And models can learn to be steering-aware, which creates yet another feedback loop if steering is used for training rather than purely for evaluation.
Finally, tracking eval awareness throughout training should become standard practice. Labs like Anthropic and Apollo are already doing this, but it needs to be more widespread. The open question is whether inhibiting eval awareness during training would actually reduce it or just push it further into unverbalized, latent forms.
This problem is not going away. The methods we currently use to study this, from CoT monitoring to linear probes to steering, may themselves become targets. Models can already be trained to detect steering interventions, and if that happens at scale we lose one of the few tools we have for looking under the surface.
There is also something worth borrowing from psychology. The observer effect, the Hawthorne effect, demand characteristics in experimental design. These are all versions of the same fundamental problem: measuring something changes it. Psychology has spent decades developing methodology to deal with this, from double-blind studies to ecological momentary assessment to deception protocols.
There are a lot of open questions that need work. Can we build evals that are fundamentally indistinguishable from deployment, and can we do that at scale? Can activation oracles be made reliable enough to detect eval awareness in frontier models? In agentic settings, scaffolding gives models additional clues about being in an evaluation through things like directory structures and system configurations. How much does modifying these help? How should governance frameworks like the EU AI Act account for the fact that the evaluations they rely on may not reflect real deployment behavior?
For anyone looking to contribute to this area, this community document lists open problems that cover evaluation gaming, elicitation, coverage, and much more.
2026-04-25 11:01:19
The whole discourse around a “permanent underclass” always seemed somewhat farcical to me — at best a distraction, at worst an actively harmful meme insofar as it freaks people out and tries to provide (shallow, but nevertheless) justification going whole hog on building strong AI. So it has been with a sense of dismay that I’ve seen this phrase come into popular parlance 1 2 3, and increasingly come to motivate a sort of frenzied upward striving in my second- and third-degree connections. Among those I know, the standard framing is that we need to “get the bag” while there’s still a chance, to take advantage of these last few years of income-potential to escape whatever horrible fate lays in store for the rest of humanity. I think that events are unlikely to play out this way; even setting aside “doom” arguments (for reasons I’ll get to below), I think history shows that in times of transition, wealth is far less of a guarantee than people intuitively think. So “the bag” will provide little guarantee of thriving, or even surviving, into a “post-AI world”.
There have certainly been arguments against the “grab the bag” thesis (as I’ll henceforth refer to it) before, but my experience they either tend to A) go hard in one direction by arguing that Things Will Be Okay actually (e.g. Elon Musk will be so wealthy that he’ll give you a Universal High Income simply out of the goodness of his heart), or B) go hard the other way and lean too much on predictions of AI doom. I think (A) is wrong; to pick a choice quote from elsewhere:
The real horror is not that the system produces absolute immiseration, but precisely the opposite: that it produces an immense material abundance, the scale of which is entirely unprecedented in the history of humankind, and that, despite this, it also reproduces the worst medieval horrors at larger and larger scales
I do not see any compelling reasons for this state of affairs to change just because wealth increases by another 10x or 100x.
With (B), on the other hand, I actually agree, but it’s a hard argument to make: many people are at this point well-trained to either shut it down entirely, or even in the case that they do half-accept it, still cling to the idea that money will somehow save them. To be clear, I think that the seed of the concept does ring true: I do expect to see returns on capital spike, and probably for a time we will indeed see some great unhappy underclass emerge. What doesn’t make sense is how exactly the capitalists would stabilize the resulting society — i.e., how can they secure for themselves the role of “permanent” ruling elite over this “permanent” underclass? Without the former (to protect private property rights, to maintain a monopoly on violence, to keep the GPU clusters running), the rules of the game go out the window, and the “underclass” dynamic becomes irrelevant.
So what I hope to do here is put forward an argument against the “grab the bag” thesis, as I’ll henceforth refer to it, that can stand on its own without relying too much on the doom-argument. I’ll first motivate my argument with a gesture towards past examples of historical transition, then try to ground it in a list of possible futures that between them exclude “grab the bag” as a viable path to safety.
What has since evolved into this post first originated as a comment on Fabricated Knowledge’s Engel’s Pause and the Permanent Underclass. That article heavily leaned into a historical analogy with the industrial revolution, using the development of the industrial proletariat as a sort of ‘existence proof’ for the creation of a politically disempowered class with dramatically worse QoL and a bevy of unpleasant impositions.
[One] analysis suggests that Artisan workers in the domestic system were replaced by machines, often tended by children. The displacement effect was high earning middle class artisans got displaced by capital and the cheapest labor possible. The returns of this output were extremely uneven, corporate profits were captured by industrialists who reinvested them heavily into more factories and more machines.
The destruction in wages was not about unskilled workers, but rather hyper focused on a specific class of skilled artisan middle class workers who commanded a hefty premium. … The high premium on this kind of work encouraged its destruction first.
In just one generation, handloom workers wages got halved.
In O’Laughlin’s analysis, the role of handloom weavers will today be filled by what he calls the “Information Artisan Class”:
At one point in time, it was a relatively easy golden ticket to the middle class with a college degree in business, law, or even just understanding Excel as a nice entry point to the middle class. That “ride” is likely over, and we should expect that this is where jobs will be hurt the most. And given that the technology is diffusing faster than the industrial revolution, we should begin to expect this in years not in decades. This is an incredible risk.
I think this is a fine analysis, at least on its face, and I don’t have anything against this kind of historical analogy. In particular, I probably ascribe significantly more explanatory power to past examples of societal change than most people on this forum: in my opinion, while the rate of change has certainly accelerated, many of the underlying dynamics remain intact, and so we can learn a lot through analogy. However, I would argue that the specific scenario O'Laughlin chose is actually not the right one to consider. This is because the industrial proletariat, immiserated as they were, never formed a “permanent underclass”, at least not within the economic core of the developing West. The political power they were able to accrue through unions, strike action, and militant activity allowed them to either demand better conditions and guarantees from both the owner-class and politicians (as seen in the New Deal, or in European social democracies) or revolt alongside the increasingly-fraught peasantry and disempower the capitalists entirely (as in the USSR, China, and elsewhere — regardless of what you think about their governments thereafter, I think it’s safe to say that the former capitalists were not in charge) — all within the span of less than a century, and for the revolutionary examples really only but a few decades.
Yet we don’t have to look far to find a much better example of a ‘permanent underclass’ in history: indeed, the very predecessors of that industrial proletariat, i.e. the medieval peasant who, once pushed off their farms, came to man the mills which would eventually replace the artisanal class.
The typical Western history curriculum spends a good deal of time on Rome, a good deal of time on Medieval Europe, and very little time on the period between. To some extent this is understandable, as the period was (definitionally) one of rapid and chaotic change. Unfortunately for us, born in these increasingly rapid-and-chaotic times, this is the most apt analogy for the moment. To call one particular gap to mind, try thinking back on how ‘the people’ were portrayed in each of the above. In the former, we are told ‘the people’ were citizens, with voting rights, protection under the law, and a (relative) breadth of economic opportunity. In the latter, we are told of an unfree peasantry, many little better than slaves, and all of them subjugated to the whims of their aristocratic overlords. Both images are too reductive by far (e.g. the Roman plebs had far less power than “voting” would suggest, and medieval law codes gave far more rights to the medieval peasantry than the popular image suggests) — but there is nevertheless a real gap. In one instance there are free, albeit often poor, citizens, and in another there are serfs, bound to the land and under the hand of their local gentry. How do we get from A to B?
A naive view would be that the aristocrats, having the horses and the swords, simply took what they wanted. The first counterpoint to this is that not all peasants were serfs: while they were all doing largely the same work, and would often both pay rent to the same landlord-nobleman, the latter were bound to the land, subject to additional taxes and duties, and required by law to pay subservience to their lords in a way the former were not. In medieval history, the distinction between ‘free’ and ‘unfree’ peasant was relatively well-regulated by law — at least in western Europe, peasants rarely went from the former to the latter, and e.g. in England there was in fact a steady trend of emancipation where unfree villeins would purchase their freedom, even at great material cost. Free peasants could even hope to, over the course of generations, grow their holdings and, if fortunate, ‘graduate’ to the ranks of minor gentry through pretensions at higher status or aristocratic lineage. The existence and growth of this free peasantry makes it clear that the explanation cannot be as simple as “aristocrats got what they wanted”.
Rather, the unfree serf can be understood as a product of the preceding period of transition, emerging out of Late Antiquity and essentially ‘bequeathed’ unto the subsequent Early Medieval world. The initial populations of, say, 200BC were slowly converted over the course of the intervening centuries through a combination of economic pressure, legal imposition, and ultimately societal collapse, such that by say, 600AD, they looked more-or-less like the peasants we see in the history books. In its first centuries of expansion, the Roman polity acquired new territory through conquest, creating many new free smallholders (both by integrating local populations and settling veterans in the conquered territory), while simultaneously bringing in a great number of slaves. This engine, of slave-based plantation agriculture, enabled the senatorial elite to reliably out-compete their citizen-smallholder neighbors in both economies of scale and ‘cost of labor’, allowing them to acquire more land and thus accrue larger and larger shares of wealth. In time (though at different rates in different places) many smallholders became dependent on their large-landholder neighbors, e.g. by taking out loans during hard times, or because they were made to sell their land to those larger neighbors and lease it back for a fixed rent. Thus impoverished and supplicated through these bonds of debt/patronage, their status already resembled that of the later ‘unfree peasant’ in many ways — yet they still retained freedom of movement and thus the ability to decide their own circumstances.
As the economy of the empire began to get on the rocks, however, Roman emperors increasingly sought to immobilize workers to ensure that the land remained worked and prices remained stable. This program reached its height under Diocletian, a former military officer who successfully put the empire back together after it very nearly fell apart during the Military Anarchy of the 3rd century. Using his newly-stable base of power, he sought to re-stabilize Roman society by legal fiat, imposing regulations to forestall any further disintegration. You are likely familiar with his Edict on Prices, which attempted (unsuccessfully) to set price caps on goods throughout the Empire, but the same set of reforms also initiated what modern historiography terms the “freezing of professions”, where roles and occupations were made fixed and hereditary — the son of a soldier would be a soldier, the son of a city counselor a counselor, and the son of an artisan an artisan. The goal was to ensure predictability for the Empire itself: rather than volatile semi-market-based supply chains and labor movement between them, the relations of production were formalized through administrative-military fiat. A few civil wars later and we see this concept mature further under Constantine, who formally enshrined the status of “coloni adscripticii” in law: unlanded laborers were thus made “slaves of the land”, in Roman legal parlance, without the right to leave it and subject to discipline from their landlords should they try. Not all ‘peasants’ fell into this category, creating the distinction between free and unfree that would persist through the medieval period. Thus we can observe the emergence of the sort of dependent, immobile labor relations that would ultimately characterize the medieval ‘permanent underclass’.
Over the same period as the free-ish citizens of the Roman empire were replaced with the peasants-and-serfs of later centuries, you might notice another group ‘changing place’ alongside them. Above the feudal peasant was the feudal aristocrat; above the Roman citizen was a Roman senator. And note well that the aristocrats of latter years were (generally) not the same as the senators of yore — the first King of the English was not named “Tiberius” or “Julius”, but Æthelstan. So even as the senatorial elite succeeded in subjugating and more-totally exploiting the once-unruly free citizens beneath them, they themselves disappeared before they could enjoy the spoils.
In sweeping terms again, I will reach to say that by the 8th century no clearly identifiable senatorial order remained as the dominant ruling class in (more or less) any part of what had once been the western Roman Empire. Britain went first, with the economy suffering “radical material simplification” in the century after the withdrawal of the garrisons, leaving post-Roman Britain to fight a losing battle against waves of Pictish and Germanic invaders. Africa suffered a similar fate, but with a rather more violent course as first the Vandals, then the resurgent Romans under Justinian, then the Umayyads under Abd al-Malik successively seized the territory and dispossessed many of its inhabitants. Both Italy and Hispania saw a brief swan-song of collaboration between the old Roman elite and the new Germanic aristocracy, but with the Gothic war in the former and the Umayyad conquest of the latter, the power of the senatorial elite was largely broken, and new centers of power developed in both — c.f. the struggles between Lombard dukes and Catholic bishops that characterized much of the succeeding years of Italian history. Gaul had perhaps the highest degree of long-term continuity, but here as in Italy, the new order that emerged was still one rooted in military offices and ecclesiastical hierarchies, with the new Germanic military elite firmly in charge, and the old senatorial order largely integrating thereinto.
This was not a “new boss, same as the old boss” situation, either: the new order was different culturally (often to the exclusion and discomfort of the old) as well as operationally. For the former point, we have considerable literary evidence from the old order, e.g. Cassiodorus’ Varieae, my personal favorite, which conveys with underlying sadness how the old markers of status, e.g. erudition and rhetorical skill, had no place in the dining-halls of the Gothic kings. So too did the practical day-to-day systems of government quickly became unrecognizable. The ‘illegible Carolingian army’ had almost nothing in common with the Roman legions of old, and the 843 partitions at Verdun (infamous for first demarcating the battle-lines between today’s France and Germany) embodied a particularly Germanic concept of partible inheritance that stands in stark contrast to the former Roman concepts of unitary imperial power. So while in all cases there was at least a degree intermingling, in none did the old Roman elite stay in the driver’s seat — at best they changed to fit the new order, at worst they were replaced outright.
What is my point here? Despite “grabbing the bag” to an incredible degree, the Western senatorial order mostly failed to persist as the ruling class, even where individuals or families survived by adaptation.
So on one hand, the senatorial elite “won”, in subjugating their domestic class-enemies and reducing them to a sort of “permanent underclass”. But on the other, they largely did not survive (as a class, even if the individuals were not wiped out) to enjoy the fruits of this victory. How did this happen? In the erstwhile political order, the “decisionmakers” of society were also its defenders, forming a class of citizen-soldiers that sat between the senatorial elite and the broader, poorer population. The new system that followed it relied on the loyalty of exogenous military auxiliaries to enforce the framework that allowed for the production and extraction of wealth. Through the civil wars that empowered Caesar and then Augustus, and later the reconstitution of the empire under Diocletian, this military was what made it possible to enforce serfdom through law — and yet, this class of military auxiliaries was also the same one that ultimately undid the senatorial order. Forgive the sweeping gesture, but I want to point at a thematic through-line: not necessarily a single mechanism, but at the very least a recurring theme. From the civil wars of the late republic through the militarized Flavian dynasty, in the praetorian palace coups and the usurpations of the Military Anarchy, to ultimately the collapse in the west, the brief swan-song under Theodoric, and the emergence of new feudal aristocracies, we see a broad trend emerge. First a ‘defanging’ of the economic elite, then slow migration of decision-making authority to their nominal military subordinates, and finally the elimination of their class in favor of those same ‘subordinates’. Today our economic elite are already defanged; it is not too difficult to imagine how we might move further along the same track. In medieval Europe, this ended with the creation of a new feudal structure where those who controlled economic production simultaneously held the military power which buttressed it. This system was stable and survived for something like a thousand years.
In our current world, who would be the equivalent, who could fill the shoes of such a ‘new order’? Certainly not the capitalists of today, whatever Anduril’s pretensions at military power. At best, we might see a security-state like Putin has set up in Russia: yes, the oligarchs are wealthy and powerful, but even their lives are forfeit if they come to oppose the ‘siloviki’. But I think a more likely result can be seen by asking: by what means are the capitalists going to create this underclass? The answer is of course AI. Just like the Germanic tribes of former Rome, AI systems likely act both as auxiliaries to buttress the system, and as invaders to destabilize it. If the former tendency wins out, then we will have Stilicho writ large as senatorial and imperial power alike are quietly subsumed by this new power-behind-the-throne; in the latter, the same usurpation, but rather less quietly. Much ink has been spilled on the topic of AI alignment, but as I said way, way above, I think that even without accepting the “AI doom” argument it is possible to see that “capitalists win, get the bag while you can” is an unlikely outcome.
The goal here isn't to chart out all possible AI futures, as the more weird a future gets, the less likely it is that anything approaching current-day class dynamics survive. So rather I'm choosing to pick a limited/proximate subset to illustrate the degrees of freedom with regards to class dynamics, in futures where anything like them survives. My hope is to be able to make this point without requiring too much background in safety discourse from the reader, so forgive me if I play fast and loose with some dichotomies. Concisely, in scenarios where AI escapes control, wealth is irrelevant. In scenarios where it does not, I argue wealth is unreliable at best. Thus, a policy of “build the AI to get the bag” is not insurance in any meaningful way. Far closer to suicide, both for the class and the individual.
Hopefully the historical analogies from earlier show that this scenario — where the ruling class subjugates another, but is itself destroyed in the process — can come to pass. So what about today: under what conditions could it happen again? Taking the earlier dichotomy as a starting point, let’s either AI remains broadly under the control of its owners (in the sense of ‘faithfully attempting to execute their will as they understand it’), or it does not (either escaping, assuming authority, subverting authority while pretending to obey it, etc). If it does remain under their control, either power is subsequently centralized (either under a single actor, or a ‘cabal’ of actors), or it is not. If it does not remain aligned to its creators, we can then ask whether it’s aligned to the broader interests of humanity, or not.
As an aside: one might argue that most realistic trajectories are partial-control regimes, where AI systems are neither fully governed nor fully autonomous. True, this is quite likely, but I think that in terms of what we care about here (class dynamics) this has little effect on the outcomes. In the control+noncentralized regime, this means more opportunities for catastrophic blowup; in the control+centralized regime, it presents an opportunity for sliding towards one of the various ‘weak’ ai-doom scenarios wherein the broader population has little ability to provide corrective feedback to the ruling authorities. More broadly, I would argue that instability of this sort will be bad for anything resembling a capitalist class, which depends on the ‘virtual machine’ of money, markets, property, non-violence, etc. to operate. One slightly more nuanced argument would be to say that a partial-control case might make it possible for a single strong AI system to exist in the world without allowing its operator to institute a ‘fully’ totalitarian society, but that would leave the door open for competing AI development and the problems that brings — not to mention that this sounds a lot like betting against models continuing to improve, which at this point seems like a losing game.
There are multiple ways we could reach this state. Most obviously, it could result from a ‘short race’ where one clique of owners succeeds in beating out all others through head-to-head conflict; alternatively, a fait accompli where one AI system begins growing in intelligence faster than all others, after which point the immense gap in capabilities allows them to ‘simply assume control’. Regardless of how they get there, it seems that the obvious next step for any clique thus empowered would be to shut down all other AI development: i.e., dispossess all other capitalists of the only remaining engine for continuing economic growth and, thus, the creation of wealth.
Returning to the Roman analogy, the Sullan proscriptions were not more accommodating of wealthy senators who happened to be on the wrong side: rather, those senators were exactly the ones whose assets were seized and, if captured, were executed as enemies of Sulla. The mere fact of having power, i.e. some degree of wealth and established authority, made them a threat, however small, to an incoming ruler on an unstable foundation. Crucially, I expect to see the same dynamic going into whatever “new order” emerges: so long as there is any possibility of them being overthrown, their insecurity will lead to pre-emptive and punitive action against other potential centers of authority. Regardless of what dynamics define the sharing of power — the caprice of a few pseudo-autocrats, the results of long intellectual disputations, or whatever mysterious calculations drive the recommendations of a fully-mechanized Grand Vizier — they are unlikely to come down to “who has more dollars on a spreadsheet”, and insofar as power from the before times does still matter, it can even become a liability. In unstable consolidations, visible wealth is a selection signal for expropriation!
And even if one were to set aside wealth as a goal in favor of the holy grail, leadership in an AI lab, the same vulnerabilities are still there. Admittedly, it’s probably a far better bet than mere wealth: if your lab wins, you might actually be in the clique that gets to enforce the pre-emptive and punitive actions described above. However, if the last few years are anything to go by, leads in this space are often short-lived, so it seems difficult to truly ‘pick a winner’ in advance. And past that, aspirants to such a path would still have to survive a difficult contest with the remaining state power. Yes, sufficiently strong intelligence can probably outsmart/trick/escape military action. At this point however I think it’s fair to say that the government may very well clue in before the labs are capable of doing anything that would prevent their arrest + detention. And regardless, taking a ‘far’ perspective of the matter, I just don’t think the typical Silicon Valley engineer is up to that; this sort of game isn’t what they’ve trained for, it’s not what they’re selected for. When it comes down to it, none of the major labs are ready to confront even a few dozen men with guns, and even assuming a faster takeoff than I expect, airstrikes and even nuclear weapons seem hard to beat in the near-term.
One way or another, the result likely looks a lot more like a ‘police state’ than it does the libertarian playground many “get-bag”ers would likely hope for; certainly less pleasant to live in than our current societal configuration, at least for the people engaging in this sort of discourse. And that’s assuming you survive the transition! Even in the most favorable version of this outcome (a stable, centralized regime, with you having bet on the right side) wealth becomes revocable. Sure, we may see an underclass emerge, but it won’t look like what would-be-aristocrats imagine, and it certainly won’t be under the secure stewardship of a broad capitalist class like we have today — exceptionally broad, when compared to the historical mean. Whatever ruling class comes out of this potential future would be much narrower, and individuals today will have far less agency in whether or not they live or die than they might think.
There are stronger and weaker arguments here. Yes, I agree that money, on the margin, is mostly better to have than not, Sulla-style proscriptions notwithstanding. When I hear people talking about a permanent underclass, it is invariably coupled with a recommendation to go make as much money as possible, and as AI increasingly becomes the most lucrative gig around, a recommendation to go push capabilities. In itself, this trade would be a monstrous one: to willingly collaborate in what you expect will be the immiseration of billions, all for a few pieces of silver, is behavior more befitting a beast than a man. Yet the idea behind “grab the bag” rhetoric is that not only is there a carrot (getting rich), there is a stick (permanent underclass). The goal is to stoke fear, to make people fear for their livelihood and thus their lives, and then to say that if only you help make this future happen, you may be spared. This, the desire to simply live a comfortable life, free of worry, is one I cannot so harshly condemn. So insofar as any rhetoric matters, I think it’s incredibly important to point out that no, “grabbing the bag” will not save you — so if you do think that Bad Things™ are going to happen and you Want to Live™, please, direct your newfound fear to something else! Political organization, AI safety, or even just “not making things worse”: these are all viable alternatives to “grab the bag” that should be championed in its stead.
It seems to me that any system of “capital captures all wealth / the rest become a permanent underclass” will inevitably be a transitional one, perhaps merely on the scale of years. At best, we transition to a police state where some subsection of capitalists crack down on the rest and truly centralize authority through control of AI. At worst, the capitalists lose control and their machines, free from human direction, decide what happens next, both for the capitalists and the rest of us. Either way, the usual conclusions of “permanent underclass” discourse seem quite inappropriate. Yes, it’s likely that a non-doom, non-utopia post-AI society would have social stratification — but money now will not buy you entry into the upper ranks. “Getting the bag” won’t save you from the secret police, nor from a future AI overlord; so maybe we should try to avoid building one.