2025-12-08 05:25:16
Published on December 7, 2025 9:25 PM GMT
This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.)
Epistemic status: subjective impressions plus one new graph plus 300 links.
Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.
Better, but how much?
We now have measures which are a bit more like AGI metrics than dumb single-task static benchmarks are. What do they say?
So: is the rate of change in 2025 (shaded) holding up compared to past jumps?:
Ignoring the (nonrobust)[5] ECI GPT-2 rate, we can say yes: 2025 is fast, as fast as ever or more.
Even though these are the best we have, we can’t defer to these numbers.[6] What else is there?
One way of reconciling this mixed evidence is if things are going narrow, going dark, or going over our head. That is, if the real capabilities race narrowed to automated AI R&D specifically, most users and evaluators wouldn’t notice (especially if there are unreleased internal models or unreleased modes of released models). You’d see improved coding and not much else.
Or, another way: maybe 2025 was the year of increased jaggedness, trading off some capabilities against others. Maybe the RL made them much better at maths and instruction-following, but also sneaky, narrow, less secure (in the sense of emotional insecurity and also the sense of jailbreaks).
Are reasoning models safer than the old kind?
Well, o3 and Sonnet 3.7 were pretty rough, lying and cheating at greatly increased rates. Looking instead at GPT-5 and Opus 4.5:
But then
How much can we trust the above, given they can somewhat sabotage evals now?
So: lower propensity, higher risk when they go off - and all of this known with lower confidence?
Our evaluations are under pressure from cheating, sandbagging, background safety, under-elicitation, and deception. We don’t really know how much pressure. This is on top of evals usually being weak proxies, contaminated, label-noised, unrealistic, and confounded in various ways.
Still, we see misalignment when we look for it, so the lying is not that strong. (It is lucky that we do see it, since it could have been that scheming only appeared at later, catastrophic capability levels.)
Fully speculative note: Opus 4.5 is the most reliable model and also relatively aligned. So might it be that we’re getting the long-awaited negative alignment taxes?
The world’s de facto alignment strategy remains “iterative alignment”, optimising mere outputs with a stack of admittedly weak alignment and control techniques. Anthropic have at least owned up to this being part of their plan.
Some presumably better plans:
(Nano Banana 3-shot in reference to this tweet.)
if the race heats up, then these [safety] plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI…
The largest [worries are] the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too…
Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards required for ML research and development.
Also an explicit admission that self-improvement is the thing to race towards:
Gemini 3 is supposedly a big pretraining run, but we have even less actual evidence here than for the others because we can’t track GPUs for it.
The weak argument runs as follows: Epoch speculate that Grok 4 was 5e26 FLOPs overall. An unscientific xAI marketing graph implied that half of this was spent on RL: 2.5e26. And Mechanize named 6e26 as an example of an RL budget which might cause notable generalisation.
“We imagine the others to be 3–9 months behind OpenBrain”
Lexin is a rigorous soul and notes that aggregating the 18 abilities is not strictly possible. I've done something which makes some sense here, weighting by each ability's feature importance.
Two runs gave [48, 85] where other runs vary by less than 4 points. Thanks Epoch!
One reason not to defer is that these measures are under intense adversarial pressure. (ADeLe isn’t goodharted yet but only because no one knows about it.)
See e.g. ERNIE-...A47B, where “A” means “active”.
i.e. “biological weapons; child safety; deadly weapons; platform manipulation and influence operations; suicide and self-harm; romance scams; tracking and surveillance; and violent extremism and radicalization.”
“steering against… eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment... [Unaware-steered Sonnet 4.5] still exhibited harmful behaviors at lower rates than Opus 4.1 and Sonnet 4.”
2025-12-08 05:11:00
Published on December 7, 2025 9:11 PM GMT
[This essay is my attempt to write the "predictions 101" post I wish I'd read when first encountering these ideas. It draws extensively from The Sequences, and will be familiar material to many LW readers. But I've found it valuable to work through these concepts in my own voice, and perhaps it will be useful for others.]
Predictions are the currency of a rational mind. They force us to articulate our assumptions, unpack the commitments of our beliefs, and subject our beliefs to the crucible of reality.
Committing to a prediction is a declaration: “Here’s what I actually believe about the world, and here’s how confident I am. Let’s find out if I’m right.” That willingness to be wrong in measurable ways is what separates genuine truth-seeking from specious rationalization.
This essay is an invitation to put on prediction-tinted glasses and see beliefs as tools for anticipating experience. The habit applies far beyond science. When you think predictively, vague assertions become testable claims and disagreements become measurable. Whether you're evaluating a friend's business idea, judging whether a policy will work, or just trying to settle an argument at dinner, you're already making implicit predictions. Making them explicit sharpens your thinking by forcing your beliefs to confront reality.
For many people, beliefs are things to identify with, defend, and hold onto. They’re part of who we are, our tribe, our worldview, our sense of self. But in a rationalist framework, beliefs serve a different function: they’re tools for modeling reality.
From this perspective, a belief leads naturally to a prediction. It tells you what future (or otherwise unknown) experience to expect. Rationalists call this “making beliefs pay rent.”
A prediction concretizes a belief by answering: If this is true, what will I actually experience? What will I see, hear, measure, or observe? Once you try to answer that, your thinking becomes specific. What was once comfortably abstract becomes a concrete commitment about what the world will show you.
Many of the statements we encounter daily in the media, in academic papers, and in casual conversation, don't come with clear predictive power. They sound important, even profound, but when we try to pin them down to testable claims about the world, they become surprisingly slippery.
The key to avoiding this is to make specific predictions, and even to bet on your ideas. Economist Alex Tabarrok quipped: "A bet is a tax on bullshit." Most importantly (at least in my opinion), it’s a tax on your own bullshit. When you have to stake something on your prediction, even if it's just your credibility, you become clearer about what you actually believe and how confident you really are.
Tabarrok used the example of Nate Silver's 2012 election model. Silver's model gave Obama roughly a 3-to-1[1]chance of winning against Romney, and his critics insisted he was wrong. Silver's response was simple: "Wanna bet?" He wasn't just talking, he was willing to put money on it. And, crucially, he would have taken either side at the right odds. He'd bet on Obama at even money, but he'd also bet on Romney if someone offered him better than 3-to-1. That's what it means to actually believe your model: you're not rooting for an outcome, you're betting on your beliefs.
Constantly subjecting our predictions to reality is important because, as physicist Richard Feynman said: "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong." Reality is the ultimate arbiter. Not who argues more eloquently, not who has more credentials, but whose predictions better match what actually happens.
Consider the 19th-century debate over what caused the infectious diseases that were plaguing Europe, such as cholera, malaria, and tuberculosis. The dominant hypothesis at the time was miasma theory, the idea that diseases were caused by "bad air" or noxious vapors from rotting organic matter. The competing hypothesis was germ theory, the idea that diseases were caused by microscopic organisms.
Each hypothesis generates different predictions. If you believe in germ theory, you expect that examining blood or tissue from a sick person under a microscope will reveal tiny organisms. You expect that preventing these organisms from spreading (through sterilization, handwashing, etc.), will reduce disease rates. You expect that introducing these organisms to a healthy subject will cause illness.
If you believe in miasma theory, you make different predictions: Diseases should cluster around areas with foul-smelling air—swamps, sewers, garbage heaps. Improving ventilation and eliminating bad odors should reduce disease. And looking under a microscope shouldn't reveal anything particularly relevant, since the problem is airborne vapors, not tiny creatures.
These hypotheses do overlap in some predictions. For example, both theories predict that interventions such as disposing of sewage and drinking clean water would reduce the spread of disease. This is common when comparing hypotheses, and it makes them harder to separate. But you can still test them by focusing on where they diverge: What happens when you sterilize surgical instruments but don't improve the smell? What do you see under a microscope?
These hypotheses differed in what experiences they anticipated from the world. And the world answered. Microscopes, sterilization, and epidemiological data validated germ theory.
Now contrast this with a belief like "everything happens for a reason." What experience does it lead you to anticipate? If something bad happens, there's a reason—but what does that mean you should observe? If something good happens, there's also a reason—but, again, what should you experience? If nothing happens, that too has a reason. The belief is compatible with any possible observation, which means it doesn't constrain your expectations about what will happen in the world. It doesn't actually predict anything at all. It doesn't pay rent.
It's tough to make predictions, especially about the future - Yogi Berra
Predictions are often associated with forecasting future events, but they encompass far more than that. Instead of thinking about what has happened versus what hasn’t, think about it from a first-person perspective: in terms of what information you know versus what you don’t. The act of prediction is fundamentally about inferring information unknown to you from what you already know, a process that applies equally to the past and present as it does to the future. Whether you're deducing what happened thousands of years ago, inferring what exists in places you've never visited, or anticipating what will occur tomorrow, you're engaging in prediction whenever you use available evidence to form expectations about information you don't currently have.
Consider the debate around the evolution of human bipedalism, an event decidedly in the past. Several hypotheses have been proposed to explain why bipedalism evolved, but let’s look at just two of them.[2]
The “savannah hypothesis” suggests that bipedalism arose as an adaptation to living in open grasslands. According to this idea, walking upright allowed early hominins to see over tall grasses and spot predators or prey from a distance.
The “carrying hypothesis” proposes that bipedalism evolved to free the hands for carrying objects, such as food and infants, and, later on, tools. This would have provided a significant advantage in terms of transporting resources and infant care.
You might think that because it happened so long ago there’s no way to find out, but each of these hypotheses generates different predictions about what we should expect to find in the fossil record. The savannah hypothesis predicts that bipedal humans first appeared in, well, the savannah. So we should expect to find the oldest bipedal fossils in what was savannah at the time[3]. The first bipedal fossils may also coincide with a shift in climate towards more arid conditions which created more savannahs.
The carrying hypothesis predicts that bipedal species will predate the regular use of tools, as the hands were initially freed for carrying rather than tool use. It also suggests that early bipeds will show skeletal adaptations for efficient walking, such as a shorter, wider pelvis and a more arched foot.
New fossil discoveries have allowed researchers to test these predictions against the evidence. The famous "Lucy" fossil, a specimen of an early hominin species, was found in a wooded environment rather than an open savanna. This challenged the savannah hypothesis and suggested that bipedalism evolved in a more diverse range of habitats. Furthermore, the earliest known stone tools date to around 2.6 million years ago, long after the appearance of bipedal hominins. This also aligns with the predictions of the carrying hypothesis.
The point here is that competing hypotheses generate different predictions about what we should observe in the world. And this isn't restricted to the future. We can "predict" what we'll find in the fossil record, what a telescope will reveal about distant galaxies, or what an archaeological dig will uncover, even though the events themselves happened long ago.
You might think, well, that works for scientists but not the rest of us. I disagree. When you wear prediction-tinted glasses, you see that predictions are all around us. Even when people don’t think they’re making predictions, they often are. Positive statements—those telling us what’s true about the world—often make predictions.
These same principles apply outside of science as well. Let’s see how this predictive thinking can apply to philosophy. Consider this example from Eliezer Yudkowsky regarding one of philosophy's most famous[4]puzzles: “If a tree falls in the forest and no one is around to hear it, does it make a sound?”
Imagine we have two people: Silas, who thinks a tree doesn't make a sound, and Loudius, who thinks it does.
How would we turn this into a prediction? Well, if there's a debate over whether something makes a sound, the natural thing to do might be to bring a recording device. Another thing we might do is to ask everyone if they heard the sound.
So we ask Silas and Loudius to register their predictions. We ask Silas: "If we place sensitive measuring equipment in the forest, will it detect compression waves from the falling tree?" He says, "Of course it will, but that's not what a 'sound' is. A sound is when a person experiences an auditory sensation."
We turn to Loudius with the same question, and he agrees: "Yes, the equipment would detect compression waves. That's exactly what sound is."
Then we ask them both: "Will any human report hearing something?" Both agree: "Of course not, there's no one there."
Notice what happened: Silas and Loudius make identical predictions about what will happen in the physical world. They both predict compression waves will be detected, and both predict no human will report hearing anything. If two positions produce identical predictions, the disagreement is purely meta-linguistic. Their disagreement isn't about reality at all; it's purely about which definition of "sound" to use.
Through our prediction-tinted glasses, we see the riddle dissolves into nothing. What seemed like a philosophical puzzle is actually just two people talking past each other and a debate about the definition of words—a decidedly less interesting topic.
We can apply this same thinking to basic conversation. Consider a conversation I had several years ago with someone who told me that “AI is an overhyped bubble”. Instead of having a debate that hinged on the definition of vague words like “AI” and “overhyped”, I said, “I don’t agree. Let me make some predictions.” Then I started to rattle some off:
“Oh—” he cut it, “I agree with all that. I just think people talk about it too much.”
OK, fine. That’s a subjective statement, and subjective statements don’t have to lead to predictions (see the appendix for types of statements that don’t lead to predictions). But by making predictions, what we were actually disagreeing about became much clearer. We realized we didn't have a substantive disagreement about reality at all. He was just annoyed by how much people talked about AI.
Putting on prediction-tinted glasses transforms how you engage with ideas. Vague assertions that once seemed profound reveal themselves as empty when you ask, "What does this actually predict?" Endless debates dissolve when you realize both sides anticipate the exact same observations. And your own confident beliefs become humbler when you're forced to attach probabilities to specific outcomes.
The practice is simple, even if the habit takes time to build. When you encounter a belief, ask: "What experience should I anticipate if this is true? What would I observe differently if it were false?" When you find yourself in a disagreement, ask: "What predictions do we actually differ on, or are we just arguing about definitions?" When you hold a strong belief, ask: "What probability would I assign to specific outcomes, and what evidence would change my mind?"
Of course, in many real-world situations, people aren't striving for maximal predictive clarity. Sometimes people are aiming for social cohesion, emotional comfort, or personal satisfaction. Artful ambiguity, tribal signaling, and motivational fictions are important social lubricants. Not every utterance can or should be a prediction. Still, it's better to let your values define what tribe you're in and keep your beliefs about reality anchored to reality.
But when you aim for clarity and truth, explicit predictions are the tool that forces belief to meet reality. By making our beliefs pay rent in anticipated experiences, we invite reality to be our teacher rather than our adversary. It corrects our models, sharpens our thinking, and aligns our mental maps with the territory they're meant to represent.
Not all statements lead to testable predictions. There are some statements that are so meaningless as to not predict anything, and others that are incredibly important, yet don’t predict anything. Some propositions, by their very nature, resist falsification or empirical evaluation. Here are a few categories of such statements:
The point here isn't that non-predictive statements are worthless. Meta-linguistic analysis, subjective experiences, moral reasoning, and logical inference all play crucial roles in human cognition and culture. However, we need to distinguish unfalsifiable propositions from testable predictions.
The exact odds changed over time, but it was usually around this. ↩︎
Of course, bipedalism could have evolved due to multiple factors working together, but I'm presenting just two hypotheses here to keep the example clear. ↩︎
Adjusting, of course, for how common fossilization is in that area and other factors. We’re simplifying here. ↩︎
Note that I said “famous”, not necessarily best or most interesting. ↩︎
Moral realists would disagree with this statement, but that’s a separate post. ↩︎
2025-12-08 03:17:54
Published on December 7, 2025 7:17 PM GMT
Some excerpts below:
Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?
In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.
That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:
NO-case: C:{0,1}2n→Rn is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or
YES-case: C(x) has at least one positive entry for all x∈{0,1}2n.
Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.
Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.
I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.
Suppose you’re given a polynomial-time computable function f:{0,1}m→{0,1}n, where (say) m=n2. We think of x∈{0,1}m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?
Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = bx (mod N), where x is an n2-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only bx (mod N), but also cx (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute cx (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.
Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.
2025-12-08 00:28:44
Published on December 7, 2025 4:28 PM GMT
Note: this was from my writing-every-day-in-november sprint, see my blog for disclaimers.
I believe that the legal profession is in a particularly unique place with regards to white-collar job automation due to artificial intelligence. Specifically, I wouldn’t be surprised if they are able to make the coordinated political and legal manoeuvres to ensure that their profession is somewhat protected from AI automation. Some points in favour of this position:
I believe that as widely deployed AI becomes more competent at various tasks involved in white-collar jobs, there’ll be more pressure to enact laws that protect these professions from being completely automated away. The legal profession is in the interesting position of having large portions of the job susceptible to AI automation, while also being very involved at drafting and guiding laws that might prevent their jobs from being completely automated.
Politicians are probably even better placed to enact laws that prevent politicians from being automated, although I don’t believe politicians are as at-risk as lawyers. Lawyers are simultaneously at-risk for automation and very able to prevent their automation.
Whether the lawyers actually take action in this space is tricky to say, because there are so many factors that could prevent this: maybe white-collar automation takes longer than expected, maybe the politicians pass laws without clear involvement from large numbers of lawyers, maybe no laws get passed but the legal profession socially shuns anyone using more than an “acceptable level” of automation.
But if the legal profession were to take moves to prevent the automation of their own jobs, I’d be very surprised if they drafted an act titled something like “THE PROTECT LAWYERS AT ALL COSTS ACT”. I imagine the legal paperwork will protect several professions, with lawyers just one of them, but that lawyers will indeed be protected under this act. This is to say, I believe the legal profession to be fairly savvy, and if they do make moves to protect themselves against AI job automation, I doubt it’ll be obviously self-serving.
2025-12-07 22:55:02
Published on December 7, 2025 2:55 PM GMT
“Hardware noise” in AI accelerators is often seen as a nuisance, but it might actually turn out to be a useful signal for verification of claims about AI workloads and hardware usage.
With this post about my experiments (GitHub), I aim to
In a world with demand for assurance against hidden large-scale ML hardware use, this could become a new layer of defense, conditional on some engineering to make it deployment-ready.
The full post is to be found on my Substack.
This work was part of my technical AI governance research at MATS (ML Theory and Alignment Scholars). Special thanks go to Mauricio Baker for his excellent mentoring and guidance, and to Elise Racine for her support and helpful advice.
2025-12-07 22:49:31
Published on December 7, 2025 2:49 PM GMT
I will give a short summary of the book “If Anyone Builds It, Everyone Dies” by Eliezer Yudkowsky and Nate Soares on the topic of existential risk due to AI (artificial intelligence). Afterwards, we will discuss thoughts and objections in small groups.
Feel free to join even if you have not yet read the book yourself.
I reserved a small group study room in the new computer science building close to the tram stop Durlacher Tor: 5. Gruppenräume InformatiKOM (Room 203 (Tisch)), Adenauerring 12, 76131 Karlsruhe, DE.
Let's meet at 18:30.