2026-01-23 03:24:41
Published on January 22, 2026 7:24 PM GMT
As soon as modern data analysis became a thing, the US government has had to deal with people trying to use open source data to uncover its secrets.
During the early Cold War days and America’s hydrogen bomb testing, there was an enormous amount of speculation about how the bombs actually worked. All nuclear technology involves refinement and purification of large amounts of raw substances into chemically pure substances. Armen Alchian was an economist working at RAND and reasoned that any US company working in such raw materials and supplying the government would have made a killing leading up to the tests.
After checking financial data that RAND maintained on such companies, Alchian deduced that the secret sauce in the early fusion bombs was lithium and the Lithium Corporation of America was supplying the USG. The company’s stock had skyrocketed leading up to the Castle Bravo test either by way of enormous unexpected revenue gains from government contracts, or more amusingly, maybe by government insiders buying up the stock trying to make a mushroom-cloud-sized fortune with the knowledge that lithium was the key ingredient.
When word of this work got out, this story naturally ends with the FBI coming in and confiscating Alchian’s data-driven research and the G-men giving him a stern lecture on national security, but he had just invented the first event study of the modern world.
As you might have guessed, Alchian’s intellectual heir is the X account Pentagon Pizza Report, which tracks activity in pizzerias with proximity to the Pentagon as reported by Google. Started in 2024 and with over 300K followers, it may be the gold-standard OSINT meme account. Polymarket has even formalized it.
Before the X account, I had never heard of the pentagon pizza theory, but it has its own wikipedia page and amazingly, this idea goes all the way back to the 1990s. The chain of causation is clear. People only work late in the Pentagon if there’s a lot of military work to do. When there is a lot of work to do, they get hungry and order a bunch of pizza. The Pentagon is huge and might strain local pizza production capacity; this shows up in Google’s location/traffic data that the feed captures. The world is messy and pizza data is messier, but can someone predict large scale US military action from public data on pizzerias?
The Pentagon has apparently disputed this notion:
The Takeout, a food and culture site, reported in January that while there are a number of eateries in the Pentagon—where almost 30,000 people work each day, according to Arlington National Cemetery Tours—it doesn’t have its own pizzeria.
However, a Pentagon spokesperson has denied this, telling Newsweek, “There are many pizza options available inside the Pentagon, also sushi, sandwiches, donuts, coffee, etc.”
We’re going to backtest this idea in the rest of this poast and get to the bottom of this one.
The Twitter API used to be a cornucopia for interesting data projects. Now you have to pay a fortune for API access and it’s legitimately terrible. I actually could not get anything to work using the API or even using standard scraping tech to get the Pentagon Pizza Report’s tweet history; Musk has shut a lot of that shit down. I couldn’t even get ChatGPT, run by Musk’s rival Sam Altman, to help me improve the scraping against Twitter because it thought that was unethical. I even DM’ed the Pentagon Pizza account to see if he would share his data, but he did not get back to me.
I ended up getting Claude code to write me a sophisticated Selenium script that very slowly and annoyingly scrolled through the tweet history in a browser and recovered the data as it did so.
I ended up with 648 clean PPR tweets. Here’s a short excerpt.
Using this history, I wanted to check out US military adventures of the past year to see if they can be observed in the Pizza Report’s data. I could have selected others, but these are the most prominent that overlap with the Pentagon Pizza Report’s history and for the purposes of this study I preregistered them. I haven’t yet looked at any of the Pentagon Pizza Report’s data as of writing this section.
On June 22nd, 2025, Operation Midnight Hammer was a sophisticated collection of strikes targeting Iranian nuclear infrastructure. This involved a large military footprint and careful planning much of which doubtlessly took place with involvement from the Pentagon.
On January 3rd, 2026, Operation Absolute Resolve was an operation to capture Venezuela’s leader Maduro and bring him to justice in the United States. Again, this involved sophisticated intelligence and military cooperation in the Caribbean.
This one’s interesting and takes place over a longer time frame, March 15th going into May 2025. Operation Rough Rider consisted of strikes against Houthi positions obstructing shipping in the Red Sea. This choice potentially allows us to see any longer term pizza signal that exists in the X data.
Evaluating whether PPR activity is faithfully related to these three events is a complicated statistical problem. The PPR history is quite clearly not a random sampling or uniform record of pizza-related activity near the Pentagon; I couldn’t confirm this with the user, but it’s plausibly selecting to tweet during unusual pizza activity levels. Without knowing how it works, you can’t make many sound statistical assumptions about it.
Let’s look at a few basic series.
This dataset is a dog’s breakfast. Look at that giant gap in the volume! This won’t be useful at all for examining the Houthi strike test in March 2025, so we’re stuck with just the Maduro capture and the Fordow bombing. There is a pretty astonishing spike in tweet volumes leading up to the Fordow strikes. To my eye, this looks more like the hiatus ending than some sort of tweet storm preceding that, but we’ll return to that. First off, and most importantly, WTF is “Freddie’s”?
Freddie’s Beach Bar and Restaurant is a gay bar in Arlington near the Pentagon. As you can see from Figure 2, starting in June 2025, PPR began posting about the activity status there.
Maybe it was part of some account reboot after the long hiatus lasting into June 2025. It is obviously hilarious, and what’s interesting about the Freddie tweets is that the account author pretty clearly intends for this to act as a control for the pizza activity measure. If after hours military activity at the Pentagon is causing increased pizzeria patronage, Freddie’s won’t be affected by that. But if some exogenous thing is causing increased activity for both nearby pizzerias and Freddie’s, we’d see that and know to discount how meaningful the pizza signal is.
So bravo to the PPR account for this little statistical improvement. This means that the most meaningful parts of Figure 2 will be when the yellow line and the blue line diverge.
I created a pair of slightly differently specified “Freddie-controlled” pizza activity indices which track the divergence of the Pentagon Pizza market activity versus Freddie’s activity. I think both of these are reasonable measures of what we’re trying to capture.
Using these, I really don’t see it. Both of these raids took place during utterly unremarkable pizza activity index times. In fact, the Maduro capture occurred at a positive nadir. Also, you can see these indices are correlated with each other, suggesting there’s some existing robustness in our measurement.
There are probably more sophisticated ways of testing this on a cleaner more complete dataset, but as a first pass, I’m going to declare the Pentagon Pizza Theory doesn’t work. If it did we'd expect to see at least see correlation: the Freddie-controlled index spike in the time before major operations. We don't. Hopefully, this means the FBI won’t find this and raid my house.
2026-01-23 03:09:46
Published on January 22, 2026 7:09 PM GMT
This is a cross-post from my Substack, Clear-Eyed AI. If you want my future articles sent to you, you can subscribe for free there.
~~~~
Superintelligence might kill everyone on Earth. At least, that’s what the three most-cited AI scientists of all time believe.1
Less clear is how this might unfold. I’ve spent literally years thinking about it and still feel uncertain, dating back to when I led dangerous capability evaluations at OpenAI.
There’s a framework I find helpful, though: three phases of AI taking power. Its purpose isn’t to calculate the odds of AI wiping us out; only to consider what steps AI might take if it tries its hardest.
Let me walk you through it, so you can judge for yourself: whether we’re vulnerable, and what would make us safer.
Today’s ChatGPT can’t take over the world. But companies are working on building different, riskier systems: AI that tackles open-ended problems for you around-the-clock.
What happens if those systems vastly surpass human abilities?2 Could AI assert its power over us, similar to our relationship with animals?
In the research literature, we call this “loss of control.” The most ‘mild’ form is gradual disempowerment: over time, people and organizations defer important decisions to computers. Eventually, they can’t grab back the steering wheel; AI drives us where it chooses.3 More severe forms, as previously mentioned, include the possible death of everyone on Earth.
You might wonder, why would AI want to overthrow humans? Mostly, I’ll answer a different question: If AI wanted to, how would it go about this?
The short why is that AI researchers have no accepted method for creating AI that perfectly obeys our intent. Unintended goals are the norm for AI systems. Maybe we accidentally reward AI for cheating, so it learns to deceive us. Maybe AI deduces that more autonomy is useful, no matter its goal. Maybe the AI reads too much sci-fi, and the evil AI themes become self-fulfilling. Maybe a hostile group poisons the AI’s data with a list of innocent-looking numbers. The upshot is, you shouldn’t bank on AI’s goals being the same as ours.4
If we lose control, what happens next to humanity? It depends on how different our goals are. But I’ll focus earlier than this, on the loss of control itself: the steps by which AI could seize power. We call this attempting an “AI takeover,” regardless of what AI does with that power afterward. What would this look like, and what factors could hold the AI back?
AI’s first step is to circumvent humanity’s defenses, particularly any monitoring of its plans. Until then, AI can’t exactly spend months scheming to overthrow humanity; its developer might notice this and shut the AI down.
The difficulty of evading the defenses will depend on how the AI is used. Answering simple questions through a ChatGPT-like interface gives the AI little room to maneuver, but also limited usefulness.
Once AI is enlisted for harder tasks, it has more leeway. People allow it to run for more time and give it tools and privileges, like how people now let Claude Code pilot their computers. Employees within the AI company use it for long-range purposes, too, including writing code that the company relies upon for locking down the AI. As those employees expose the AI to more code and incorporate its suggestions, the AI can find or create flaws in the company’s security.5
For example, AI might try to escape to unmonitored computers: copy its files across the internet to computers its developer can’t oversee or shut down. We already have examples of AI attempting escapes like this, albeit only in evaluation settings. Companies try to prevent this by limiting the flow of large files outside their computers, but a crafty AI may succeed anyway.6
Alternatively, AI might find holes in the company’s monitoring, so it can scheme from inside the company without getting caught. Maybe the safety manager didn’t have enough budget, and so certain interfaces are unmonitored. Maybe the company was desperate to keep up with competitors, and so took shortcuts when patching their AI’s misbehavior. Maybe the monitoring system had clear flaws, and so the AI could easily disguise its misbehavior.
In any case, the AI is then less subject to its developer’s restrictions. Its freedom isn’t absolute, of course. Even if one copy of the AI has escaped its developers’ computers, some copies will remain inside the company and thus may be prone to monitoring. If the unmonitored AI wants to use those copies in its schemes, it will need to find ways to coordinate. This isn’t trivial, but think of human spies who operate in enemy territory: Handlers have ways of passing them messages.7
What comes next, after AI evades our oversight?8
AI tries to get its hooks into vital systems: building dependence upon it, and gaining direct levers of power. Fortunately for the AI, once it becomes capable enough, institutions will naturally want to integrate it. Adoption and influence grow without much scheming.
Broadly, I expect that more people and institutions will come to rely on access to powerful AI: The stock market will become hinged on AI companies’ financial prospects, so there is less support for tough regulation. Users will revolt if AI companies try to decommission a favorite model. Some users will even act as disciples of the AI and carry out its bidding through makeshift cults. To varying extents, these are already happening today.17
AI’s influence gives it powerful leverage for achieving its goals. It can throw around the weight of institutions that depend upon it, and can use threats to extract concessions.
This approach doesn’t require the AI to develop new ways of inflicting harm; it only needs to manipulate people and systems that can already take catastrophic action. For instance, depending on AI’s access to offensive military systems, AI can back its threats with firepower. If AI can develop novel ways of inflicting harm, though, then keeping control is even harder.
When I picture the AI applying leverage, I don’t imagine an all-out war between humans and AI. Instead, I expect the AI to enlist human allies and to hide its actions inside human-vs-human tensions. One historical parallel: The conquistadors arrived in the Americas with only hundreds of troops, and yet defeated millions of Aztecs, in part through exploiting local tensions between groups. AI might succeed with a similar strategy, even if it has fewer resources than humanity as a whole.18
What are AI’s possible actions? Each of the levers of influence above could, if AI wanted to, be wielded toward serious harm:
To be clear, this leverage doesn’t have to result in the death of everyone on Earth. Many of these levers would stop short, and yet could still entail catastrophic losses, perhaps ones where the AI ends up essentially declaring “checkmate.” If we decide the prudent option is to surrender, we’re then at its discretion: It is the AI’s choice whether to leave us in peace, marginalize us, or something worse. I find this to be cold comfort.19
Even once AI is capable of usurping humans — and wants to — three limitations might hold it back.
Without people to labor in factories and power plants, AI can’t keep the lights on, much less run the supply chain to replace its components as they wear out. This gives AI reason to keep humans around, at least until it achieves what researcher Ajeya Cotra calls “self-sufficient AI.” At that milestone, AI no longer needs humans in order to achieve its interests.
Before that milestone, we’re a bit safer, at least in theory: The AI still needs us, even if it doesn’t value us intrinsically. Of course, AI could conscript us into labor under threat of violence. If AI can make us do work on its behalf, it doesn’t need the same degree of self-sufficiency.
Nobody knows how to control superintelligent AI systems. One prominent idea is to train other AIs to solve this problem for us, in a cascading tower of ever-more-capable systems. The hope is that each can keep the next-more-capable model in line, so that the tower doesn’t crash down on us. Needless to say, this is precarious, hence the problem being considered unsolved.20
A superintelligent AI system might not know how to control another superintelligence either, especially not one that is even more capable. Suppose a superintelligence identifies an option for disempowering humans, but it requires creating an even smarter superintelligence first. How can it be sure its goals will be honored by the more powerful system?
In human terms, maybe you’re smart enough to successfully rob a bank, if you consult with a thief even more expert than yourself. But are you capable enough to stop that thief from looting you afterward? On the other hand, you can’t wait forever: If you know other thieves are closing in on the score, your choice may be to strike now or forgo it forever.
A similar dynamic applies for the AI. The longer it waits without taking control from humans, it risks humans training other AI systems that might outpower it. At what point would AI feel compelled to make a takeover attempt, lest it lose the chance?
The threat of future competition might cause AI to launch a takeover sooner. But if competition already exists, then perhaps no single AI could succeed, not because of humans but because other AIs would interfere.
Corporations today far exceed individual human capability, yet no single corporation rules the world; other corporations and governments keep them in check. Could powerful AI systems constrain each other’s bids for power? Maybe, but I wouldn’t bank on this.
Powerful entities sometimes collude, like corporations suppressing wages through no-poach agreements. Meanwhile, AI systems might have strong mutual interests in sidelining humanity; will our safety depend on these systems not recognizing this? I feel pretty uncomfortable hoping that a permanent rivalry will form between superhuman AIs, which allows us to hide in the shadows.21
A few years ago, the idea that AI companies would deploy their most capable systems with full internet access seemed absurd. Today, we’re staring down the US Department of War integrating unstable AI systems throughout their classified networks. We’re facing the possibility that AI companies will hand the reins over to AI itself.22
The lesson, of course, isn’t that every alarming scenario will come to pass. But we should be way less confident about what “would never happen.” Our controls need to hold up to surprising decisions about how to integrate AI, as well as surprising new abilities.
The good news is that countries have plenty of experience defending their interests against foreign adversaries. Regarding AI as a potential adversary — not just to one country, but to all of humanity — would be a smart expansion of our defenses.
Right now, we are vulnerable; at least, that’s my assessment. We are too dependent on the assumption that AI is on our side. We are too exposed to AI companies deciding to cut corners. And so I wonder: Will we fix our vulnerabilities while we still can?
~~~~
Acknowledgements: Thank you to Dan Alessandro, Daniel Kokotajlo, Girish Sastry, Michael Adler, Michelle Goldberg, Richard Ngo, and Sam Chase for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party.
2026-01-23 01:46:59
Published on January 22, 2026 5:46 PM GMT
TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should now build and adopt tools to automate AI Safety research, focus on creating benchmarks, model organisms, and research proposals, and companies should grant differential access to safety research.
Epistemic status: I spent ~ a week thinking about this. My conclusions rely on a model with high uncertainty, so please take them lightly. I’d love for people to share their own estimates about this.
AI might automate AI R&D in the next decade and thus lead to large increases in AI progress. This is extremely important for both the risks (eg an Intelligence Explosion) and the solutions (automated alignment research).
On the one hand, AI could drive AI capabilities progress. This is a stated goal of Frontier AI Companies (eg OpenAI aims for a true automated AI researcher by March of 2028), and it is central to the AI 2027 forecast. On the other hand, AI could help us to solve many AI Safety problems. This seems to be the safety plan for some Frontier AI Companies (eg OpenAI’s Superalignment team aimed to build “a roughly human-level automated alignment researcher”), and prominent AI Safety voices have argued that this should be a priority for the AI safety community.
I argue that it will matter tremendously in which order different areas of ML research will be automated and, thus, which areas will see large progress first. Consider these two illustrative scenarios:
World 1: In 2030, AI has advanced to speeding up capabilities research by 10x. We see huge jumps in AI capabilities in a single year, comparable to those between 2015-2025. However, progress on AI alignment turns out to be bottlenecked by novel, conceptual insights and has fewer clear feedback signals. Thus, AI Alignment research only sees large speedups in 2032. There is some progress in AI Alignment, but it now lags far behind capabilities.
World 2: In 2030, AI is capable of contributing to many areas of AI research. This year sees massive progress in AI Safety research, with key problems in alignment theory and MechInterp being solved. However, progress in AI capabilities mostly depends on massive compute scale-ups. Thus, capabilities research doesn’t benefit much from the additional labour and continues at the same pace. In 2032, when AI is capable of significantly accelerating progress in AI capabilities, key problems in AI Safety have already been solved.
World 2 seems a lot safer than World 1, because safety problems are solved before they are desperately needed. But which world are we heading toward? To forecast this, I ask:
In which order will different areas of ML R&D experience a 10x speedup in progress due to AI compared to today?
Some clarifications on the question:
Some other relevant forecasts include:
However, none of these forecasts break down automation by research area - they treat AI R&D as a monolithic category rather than distinguishing different research areas or considering differential acceleration of safety vs capabilities research.
How can we attempt to forecast this? I focus on 7 factors that make an area more or less amenable to speedups from automation and evaluate 10 research areas on those (5 safety; 5 capabilities).
My approach was to determine factors that indicate whether an area is more or less amenable to AI-driven speedups. For this, I drew on some previous thinking:
I determined these 7 factors as most important and scored each research area on them:
I assigned a weight for each of these factors according to how important I judge them to be. Next, for each factor and each research area I estimated a value from 1-10 (where 10 means it is more amenable to automation). I then combine them into a weighted sum to estimate how amenable a research area is to an AI-enabled speedup.
I only focus on 10 areas, of which the first 5 are considered “AI Safety” while the other 5 improve “AI Capabilities”:
These are supposed to be representative of some important areas of AI Safety and some areas critical to improved Frontier AI capabilities. This is simplifying because no area neatly fits into the safety/capabilities buckets, and because there might be specific problems in an area that are more/less amenable to automation. Additionally, I don’t take into account that some problems might get harder over time, because adversaries get stronger or low hanging fruit is picked. It’s not necessary for all safety areas to be sped up before all capabilities areas, but the more overall safety research sees progress before capabilities areas, the better.
I arrived at values per factor for each area by:
Weaknesses of the method: My conclusions are sensitive to the factor weights, and I am quite uncertain about them. I’m uncertain about the values I assigned per research area. Additionally, advances in AI capabilities might unlock large speedups across all research areas nearly simultaneously, making it less important which research areas see gains from automation earlier. Further, I don’t take into account factors such as (1) how parallelizable work is, (2) what human bottlenecks there are during the research process, (3) the effects of partial automation, and (4) labor allocation where researchers move to other areas.
This table shows the rating for each area and factor, where higher numbers mean that the factor implies higher amenability to automation. The last rows show the weighted sum results and the average amenability to automation for safety and capabilities research areas:
| Factor (weight) | Interpretability | Alignment Theory | Scalable Oversight | AI Control | Dangerous Capability Evals | Pre-training Alg. Improvement | Post-training | Data generation | Agent scaffolding | Training efficiency |
|---|---|---|---|---|---|---|---|---|---|---|
| Task length (1.0) | 7.62 | 2.12 | 4.00 | 6.25 | 7.62 | 3.12 | 5.00 | 6.12 | 7.50 | 6.12 |
| Insight/creativity (1.5) | 4.56 | 1.50 | 5.00 | 6.06 | 7.56 | 6.12 | 6.75 | 8.86 | 7.50 | 6.23 |
| Data availability (0.2) | 4.88 | 3.00 | 2.25 | 2.00 | 4.12 | 7.75 | 7.00 | 6.38 | 5.88 | 7.00 |
| Feedback quality (2.5) | 3.88 | 1.38 | 3.88 | 4.38 | 3.62 | 7.50 | 7.38 | 6.00 | 7.50 | 9.12 |
| Compute vs labour (2.5) | 7.88 | 10.00 | 6.75 | 8.12 | 7.62 | 2.12 | 4.38 | 3.88 | 6.62 | 5.25 |
| Scheming risk (0.4) | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | 7.00 | 9.00 | 9.00 | 9.00 |
| Economic incentive (0.5) | 3.88 | 1.88 | 4.00 | 4.62 | 5.38 | 9.38 | 9.38 | 9.00 | 8.75 | 9.12 |
| Amenable to Automation | 5.53 | 4.04 | 4.76 | 5.78 | 5.93 | 5.37 | 6.21 | 6.22 | 7.35 | 7.09 |
| Category Average | 5.21 | 6.45 |
Overall, capabilities research seems more amenable to speedups from automation than safety research under this model. The only exception is algorithmic improvement in pre-training, because it is heavily bottlenecked by compute.
Agent Scaffolding and Training Efficiency appear to be most likely to see speedups from automation. Both have easy-to-verify progress, are not hugely bottlenecked by compute, and promise large economic benefits for AI companies. Among safety research, I believe that AI Control and Dangerous Capability Evals are more likely to see speedups from AI R&D, as they are somewhat engineering-heavy and companies have some interest in making progress on them. Alignment Theory seems least likely to see large speedups from AI research soon, mostly because it is very difficult to verify whether one is making progress and because it is largely driven by insight instead of engineering work.
The factors that most favor automation of capability over safety research are scheming risk, economic incentive, feedback quality, data availability, and reliance on engineering. Task length is neutral, while safety research is less compute-bottlenecked.
You can read my reasoning for my judgments in this Google Doc. If you think I’m wrong about something, you have 2 great options:
What could we do to affect when and how much AI R&D affects capabilities and safety progress? Here we are aiming for differential acceleration, i.e., either speeding up the automation and resulting progress of safety research or slowing down capabilities research.
Slowing down the automation-based acceleration of capabilities progress might be achieved through government pressure/regulation or convincing AI companies to be careful. Additionally, it could be beneficial to give differential access to new model capabilities to safety researchers. For example, after a company develops a new model, it could first apply the model to safety research for 1 month before using it to improve capabilities. Similarly, it would be great to get commitments from AI Companies to spend x% of their compute on automated safety research.
Speeding up Safety research seems more tractable to me, as many interventions can be done unilaterally by outside actors. However, multiple of these interventions risk spillover effects by accidentally speeding up automation of capability research.
Do the schlepp of building safety research automations. There are many mundane tasks and iterations that need to be done to make automated AI Safety Research work. For example, Skinkle et al suggest making safety-relevant libraries and datasets more accessible to AI Agents. Hobbhahn argues for building research pipelines now, into which new models can be plugged. In general, people should be trying to automate large parts of safety research now, see what problems come up, fix them, and iterate on that basis. This should be a priority of safety researchers inside labs and presents an opportunity for outside startups to build safety automation tools.
Improve feedback quality. My model suggests that feedback quality is where safety lags furthest behind capabilities research. Thus, AI Safety Researchers should try to turn more safety problems into “numbers-go-up” science. For example, researchers should prioritize developing model organisms of misalignment, creating new experimental protocols for evaluating scalable oversight methods, and building simulated environments for AI control red-blue games.
Prepare research proposals. Writing up detailed research proposals and project specifications reduces the amount of novel insight needed, making it easier for AI systems to contribute. Human researchers should consider focusing their time on conceptual thinking, insight generation, and threat modelling—the tasks that are hardest to automate—while deferring engineering-heavy implementation work that AI will soon handle well.
Accelerate adoption of AI tools. Safety researchers should be early and heavy adopters of AI automation tools. This could be supported through workshops on research automation and public discussion of best practices. Golfhaber & Hoogland argue that adoption does not only mean individuals being able to automate some tasks, but that organisations adopt practices and change workflows that allow them to build so-called “automated research fleets”. Thus, it is also on the leadership of AI Safety orgs to encourage adoption by their staff and to develop the organisational structure to be amenable to AI automation.
Reduce scheming risk. Investing in AI Control and alignment research helps reduce both the probability and impact of scheming on our ability to get useful safety research out of AI systems. This is important because scheming risk disproportionately affects safety research. Funders and researchers should continue to prioritize this threat model.
Differentially accelerate helpful capabilities. Some AI capabilities may disproportionately benefit safety research. We should aim to accelerate those. Examples include conceptual and philosophical thinking, forecasting, and threat modelling. For this, a researcher should study which capabilities might be good candidates for differential automation. Then, AI Developers should prioritize improving those capabilities.
Safety Research Automation Benchmark. To make it easier to measure progress on safety research automation, Skinkle et al suggest building dedicated benchmarks, including relevant tasks. This could be a great project for AIS researchers outside of companies.
Invest in documentation. There is less public data on safety research than on capabilities research. Any AI Safety Researcher could help by releasing internal notes and failed experiments, documenting and publishing research processes (not just results), and recording and transcribing their brainstorming sessions. AIs could be trained on this data to be better able to conduct safety research, or the data could be used as additional context. Funders could also incentivise researchers to document their research progress.
Of the proposed interventions, I am most excited about safety-conscious researchers/engineers directly trying to build tools for automated AI safety research, because it will highlight problems and enables the communicate to build on them. While there is a significant risk of also accelerating capability research, I believe it would strongly differentially accelerate safety automation.
I made a start at forecasting this. I think a higher-effort, more rigorous investigation would be very useful. Specifically, later work could be much more rigorous in estimating the numbers per area & factor that feed into my model, and should use more sophisticated methodology than a weighted factor model.
A key insight from the labor automation literature is that technology displaces workers from specific tasks rather than eliminating entire jobs or occupations. Similarly, the right level of analysis for our question is to look at specific tasks that researchers do and see how amenable they are to AI automation or support.
One approach for identifying the tasks is to conduct Interviews with ML Experts in different areas to elicit:
I did one pilot interview where I asked about the above. I learned it was useful to ask about specific timeframes for eliciting tasks (in the last month, in your last project), to use hypothetical comparisons (“how much faster with X tool?”), and not use abstract scales (“rate data availability from 1-10”) and to distinguish between different types of data availability.
Other work that seems useful:
Thanks to Charles Whitman, Julian Schulz, and Cheryl Wu for feedback.
2026-01-23 01:35:03
Published on January 22, 2026 5:35 PM GMT
This is a question, but also a linkpost from my (new) Substack.
“How could I have thought that faster?” is great advice I am embarrassed I did not think of myself. A subset of this, I would say, is not thinking original thoughts, but rather the ability to effectively learn new things. What prompted this is my continual inability to find all the papers relevant to a question I am researching in a timely manner.
I have been reading about biological learning rules for four years, albeit not continuously, yet still manage to find papers that are both old and have all the keywords that I routinely search. I wonder how I missed them, if I found them before and ignored them for some reason. But the big fear isn’t about any paper in particular, it’s that I won’t be able to fill in huge knowledge gaps when the answer is out there.
There is an especially humorous, or disappointing, example of this happening to someone else: that being a biologist rediscovering calculus in 1994.
I think that this poses a real problem to academics in general and will only worsen as more and more research is published that could be applicable to your problem.
I also think it is worse for autodidacts or people otherwise learning alone, having structure and people around you lets you leverage existing knowledge much easier.
Maybe this is just me being bad at using search engines/the new wave of LLMs but it is a big frustration. I don’t think tools like semantic scholar fully solve the problem either, at least not for me.
I think this generalizes past finding papers to read. It is more of a failure to know what you should be spending your time doing during the skill/information acquisition stage of a project. When you start a project, you usually aren’t sure how you are going to solve all the problems that pop up along the way. If you don’t already have a huge bag of problem solving techniques that are applicable to that domain, it is hard to know where you should be allocating your time to learn what will be useful and not a bunch of extraneous knowledge vs. actually getting to work.
I guess what I am really asking is how best to solve these instances of the explore vs. exploit problem. How do you know when you’re informed? If you want to make the declaration that a topic is not well researched academically, how do you know you have exhausted the literature enough?
I have been burned on this so many times I have become very enamored by the idea of going all the way back to basics. This is enjoyable to a certain extent and almost ensures I don’t miss anything but it is probably not optimal in many ways. LW has discussed the use of reading textbooks as they are an underrated way of gaining knowledge. I have somewhat successfully rolled this into my belief system but it still doesn’t feel like making progress. The questions I have become interested in are now in neuroscience but I am going through a math textbook. I think it will be useful but is it optimal? Am I just going to have to eat a year of study before I can start work on things I actually care about?
2026-01-23 00:47:27
Published on January 22, 2026 4:47 PM GMT
In the Sable story (IABIED), AI obtains dangerous capabilities such as self-exfiltration, virus design, persuasion, and AI research. It uses a combination of those capabilities to eventually conduct a successful takeover against humanity. Some have criticised that this apparently implies the AI achieving these capabilities suddenly with little warning. The argument goes that since we have seen gradual progress in the current LLM paradigm, we should expect that this is unlikely (Here is Will MacAskill making a version of this argument)[1]. I think this suffers from a confusion about underlying variables (e.g. AI intelligence increasing gradually) and the actual relevant capabilities as we care about them (can it outwit humanity, can it do AI research). Intuitively, if an AI system gains one IQ point each day, we should still expect a relatively sudden period in which it shifts from “it sometimes outwits us” to “it can reliably outwit us” despite gradual progress. The relevant question we want to answer is “can it outwit us?” not “how smart is it?”.
There has recently been major excitement about Claude Code’s ability to automate much of software development. Cursor CEO Michael Truell reports that it was able to write a browser autonomously. There is however no sign that Opus 4.5 represents a paradigm shift, it was simply iterative progress over previous versions, so why does it feel like the system suddenly got much more capable despite incremental improvements? What happened is that even though an underlying variable (coding skills) went up gradually, the actual capability that we care about ("can it automate SE?") emerged much more suddenly when a critical threshold was crossed. While just recently early coding agents produced cascading errors and didn't speed up experienced developers this suddenly changed when an underlying threshold in reliability and skill was crossed.
GPT-1 released June 2018. GitHub Copilot launched June 2021. Opus 4.5 shipped November 2025. The flip from being somewhat useful (Claude 3.7, February 2025) to revolutionizing coding (Opus 4.5, November 2025) took 9 months. This is in fact a common pattern and this should suggest to us that dangerous capabilities might similarly emerge suddenly from gradual progress.
Many of you have probably seen Andy Jones’ intriguing post on horses and automobiles, where he argues apparently on the topic of job loss from AI: Steam engines were invented in 1700, 200 years of steady improvement of engine technology followed, For the first 120 years, horses didn't notice, but then between 1930 and 1950, 90% of US horses disappeared. According to Andy Jones, progress was steady, equivalence was sudden.
Computer chess has been tracked since 1985. For 40 years, engines improved by 50 Elo per year. In 2000, a grandmaster expected to win 90% of games against a computer. Ten years later, that same grandmaster would lose 90%.
Andy Jones notes about his work at Anthropic that LLMs started to answer questions of new hires: "In December, it was some of those questions. Six months later, 80% of the questions I'd been asked had disappeared."
Again, while progress on engine technology or chess was gradual, the actual capability that we care about — can this replace horses or can it beat the best humans at chess —suddenly emerged.
It appears clear that an AI system with capabilities such as self-exfiltration, virus design, persuasion, and AI research could be existentially dangerous. If each of them arrives relatively suddenly from gradual progress, we might be in a situation where suddenly the puzzle is completed and a critical mass of such capabilities is reached. With very little warning we could inadvertently add the last puzzle piece and switch from a passively safe to a dangerous regime.
None of these capabilities listed above appear wildly superhuman—all are at the top of human performance, perhaps sped-up or slightly above top humans. All are strongly correlated with the overall intelligence of the system. It seems quite natural to think of dangerous capabilities as the general ability of the AI to outwit humanity, this seems quite analogous to the ELO score of chess players, where we would expect a pretty sudden S-curve in win likelihood as the ELO of the AI player increases.
These capabilities might not emerge at around the same time or in the same model and this could give us time to study them in isolation before the AI systems are dangerous. Because they are all tied to intelligence and don't appear wildly superhuman, it is plausible that they will arrive closely together, but it might not matter much if they don't.
For one, there are probably many dangerous capabilities we are not tracking, so we wouldn’t notice them appearing in the first place. As an example, this could be the ability to perform hypnosis on humans, or something similarly strange. Models have developed decent capabilities at jailbreaking other models without explicit training.
Many labs are very focused on RSI automated AI research capabilities, such as Anthropic or OpenAI. This acts like an amplifier, since the system can improve its own shortcomings. So once we reach dangerous levels of AI research capability, we might suddenly get a system with a wide range of dangerous capabilities in other areas.
Even if one crucial dangerous capability lags behind the others for much longer, it's unclear if that helps. While this would give us the ability to study the other dangerous capabilities in isolation, this also implies that then once the last puzzle piece is put in place, the other dangerous capabilities are already highly developed. As an example, if we imagine that it takes a lot longer to reach automated AI research capabilities than expected, it would give us more time to study the emergence of other dangerous capabilities such as persuasion. On the other hand, once we then get automated AI research capabilities, we would expect persuasion abilities to have had even much more time to strengthen.
It's also unclear what we would do with the information that our AI systems have increasingly dangerous capabilities, say, superhuman persuasion?
The same iterative progress that crossed the coding threshold will cross thresholds for persuasion, for autonomous AI research, for bio-research. While intelligence or science knowledge of models might increase gradually, we should expect a relatively sudden emergence of the relevant capabilities. Things can go very fast from “it can do this with some probability” to “it can reliably do this”.
What early warning signs should we look for? Perhaps a substantial AI research paper written by AI. Then most AI research could be done by AIs 9-12 months later. Could be this year.
Will still acknowledges things might go very fast months to years so this doesn't necessarily refute the point I am trying to make
2026-01-23 00:34:58
Published on January 22, 2026 4:34 PM GMT
Today, PauseAI and the Existential Risk Observatory release TakeOverBench.com: a benchmark, but for AI takeover.
There are many AI benchmarks, but this is the one that really matters: how far are we from a takeover, possibly leading to human extinction?
In 2023, the broadly coauthored paper Model evaluation for extreme risks defined the following nine dangerous capabilities: Cyber-offense, Deception, Persuasion & manipulation, Political strategy, Weapons acquisition, Long-horizon planning, AI development, Situational awareness, and Self-proliferation. We think progress in all of these domains is worrying, and it is even more worrying that some of these domains add up to AI takeover scenarios (existential threat models).
Using SOTA benchmark data, to the degree it is available, we track how far we have progressed on our trajectory towards the end of human control. We highlight four takeover scenarios, and track the dangerous capabilities needed for them to become a reality.
Our website aims to be a valuable source of information for researchers, policymakers, and the public. At the same time, we want to highlight gaps in the current research:
Despite all these uncertainties, we think it is constructive to center the discussion on the concept of an AI takeover, and to present the knowledge that we do have on this website.
We hope that TakeOverBench.com contributes to:
TakeOverBench.com is an open source project, and we invite everyone to comment and contribute on GitHub.
Enjoy TakeOverBench!