2026-01-23 10:12:46
Published on January 23, 2026 2:12 AM GMT
Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.
Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?
My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:
Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.
If behavior is driven by a high-dimensional reward vector , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection with small k, inference may become tractable.
A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.
Calculation with Abbeel & Ng's formula[5] gives
| m > | k=10 | k=1000 |
| γ = 0.9 | ||
| γ = 0.99 |
with
If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if any kinds of values can actually be learned the number may be much higher).
This could explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.
How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at , e.g., for k∼10 we'd expect an elbow around N~150.
I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.
AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:
Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.
Instruction-following AGI is easier and more likely than value aligned AGI:
Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.
In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:
[Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” ...
[Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...
In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:
[Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure.
Abbeel & Ng give an explicit bound for the required number of expert trajectories:
it suffices that
with
Constitutional RL is both more helpful and more harmless than standard RLHF.
This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.
DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.
Rethinking Diverse Human Preference Learning through Principal Component Analysis
retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.
...
By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.
Alignment is Localized: A Causal Probe into Preference Layers
Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.
2026-01-23 09:40:38
Published on January 23, 2026 1:40 AM GMT
I'm glad I know this, and maybe some people here don't, so here goes. Order doesn't matter for joint events: "A and B" refers to the same event as "B and A". Set them equal: Divide by : And you're done! I like substituting hypothesis (H) and evidence (E) to remember how this relates to real life:
You might also want to expand the denominator using the law of total probability, since you're more likely to know how probable the evidence is given different hypotheses than in general:
2026-01-23 08:30:22
Published on January 23, 2026 12:01 AM GMT
In June 2025, Kalshi unveiled an ad campaign with the slogan “The world’s gone mad, trade it.” The ad was one of the first TV ads to ever be entirely generated by AI, and its content quickly was met with a slew of parodies and jokes all over the internet.
I must agree the ad was quite funny. I admire Kalshi and Polymarket in terms of their business, but I have some concerns about the objective of the ad.
It’s not that you can’t lose an enormous amount of money on prediction markets, but the win rate and general “fairness of the game” is much more optimal for the average trader. Prediction markets also provide some sort of valuable utility in the form of calibrated forecasts (i.e. events predicted by market at X% resolve YES X% of the time), rather than sportsbooks that offer miscalibrated odds that are essentially meaningless. Therefore, we should be shifting away from sportsbooks and towards PMs, while also trying to not fuel gambling additions, which many hoped would be solved by more reasonable advertising, not the slop we’ve been getting recently.
Sportsbooks such as Fanduel and DraftKings should not be legal. I believe, as most do, that a country’s first and foremost goal is to protect their citizens. What are we allowing now? We are permitting degens who aren’t using any form of logic or reasoning to take negative EV positions against the house that needs to rig special offers to turn a profit, like a casino (which should also not be legal, but that’s besides the point).
Sportsbooks are negative EV because they slightly inflate probabilities so they sum to >100%. If the true probability of a football game is 50% one team wins, 50% the other, a sportsbook would inflate those probabilities to something like 52% and 52%. Here’s an expected value calculation in that scenario:
EV = P(A) * (net profit if A) + P(B) * (loss if B)
net profit if A = 1/0.52/2-1
net profit if A = 0.92308
loss if B = 1
EV = .5 * 0.92308 + .5 * -1
EV = 0.48077 - .5
EV = -0.01923
No matter how skilled you are, it’s highly improbable to win unless you manipulate bonuses.
Prediction markets offer bets against counterparties such as human traders, bots, and AMMs. On average, PM positions are neutral expected value.1 It offers more opportunities to make a profit by using advanced strategies and unique worldviews. This is the core difference between PMs and sportsbooks, and it’s why they should be treated legally under their own individual class.
But an alien in a Kalshi jersey downing a pint of beer seems less like a display of the product and more of a heartless, thrifty way to attract what some people would call “dumb money.” While this may be good for smarter traders, it leans to the more unethical side of degenerate gambling, and probably doesn’t help Kalshi’s legal cases.
Modern prediction market exchanges should remain a commonplace for trading futures contracts, not hubs for mindlessly depreciating the value of your bank account.
If you enjoyed this, please consider subscribing to my blog!
2026-01-23 07:23:01
Published on January 22, 2026 11:23 PM GMT
Epistemic status: n=1, strong, life changing results.
TLDR: Light glasses, in combination with turning all your lights red at night, and optionally melatonin, can treat non-24. Light glasses can also be a competitive alternative to lumenators for SAD.
My non-24 before this treatment:
And after:
Non-24 is "non-24-hour sleep disorder." Healthy people's bodies tell them to go to sleep and wake up at the same time every day, i.e. every ~24 hours. For people with non-24, however, these drift around, such that if you wake up at 8 am one day, you wake up at 9am the next day, then 10 am the next, and so on until you're waking up at midnight a couple weeks later. This is akin to having a daily circadian rhythm length of 25 hours, compared to most people's ~24; hence "non-24." This is a pretty awful problem to deal with, since either half the time you are in the middle of your day when everyone else is winding down or asleep, or you are sleep deprived!
Aside: How do sleep rhythms even work?
There's (at least) three rhythms:
"Process C" (for Circadian), the driver of wakefulness: There is a little clock in your brain called the suprachiasmatic nucleus, which tracks how long your day/night cycle is and sends signals for you to be awake. It takes the majority of its cues from ipRGCs, a type of cell in your eyes that controls pupillary reflex but contributes little to color vision. Blue and green light activate them very strongly, and red light activates them almost not at all.
"Process S" (for Sleepiness), the driver of sleepiness: This sends signals for you to go to sleep, based on the buildup of various metabolites in your brain. They accumulate steadily over the day and are what causes active drowsiness (while the "sleep" signal of Process C is just the withdrawal of the wakefulness signal). Antagonism of the receptors that check for those metabolites relieves this sleep pressure; this is how caffeine works, by blocking receptors for adenosine, one of the major metabolites.
There are also ultradian cycles, or "basic rest-activity cycles." They vary between people, but are generally 80-120 minutes long. During each one, there is a "sleep gate" of about 20 minutes, which is the best time to go to sleep. If you miss it, it will be harder to go to sleep, and you will probably have to wait for the next one to come around.
I have had non-24 for the last 10-15 years. The length of most people's circadian rhythm is specifically around 24.2 hours. As you can see in the chart at the beginning, it is more like 26 or 26.5 for me. As you can imagine, this is generally incompatible with participating in normal human society.
Now, after implementing the protocol, I wake up and go to sleep within roughly the same 6 hour period every day, even in winter. It's a miracle!
I think this began when I was a teenager. At that age, most people naturally stay up later; however, some combination of genetics and constant electronics use probably led to delayed sleep phase disorder (DPSD), aka staying up very late, till 3 or 4 am in my case. The constant sleep deprivation tanked my grades in high school, and I barely scraped through graduation. This developed into full non-24 in my first year of college, and was a major contributing factor to me dropping out at the end of the year.
This pattern endured through two months of sleeping outdoors, trying college again, and several years of working a night shift job and living in a van. It only started ending when I got the Luminette 3s in the winter of 2023, but I didn't fully commit until July 2025. Within two weeks, after mandating for myself several hours of light glasses use during the day and total red light therapy after sunset, my non-24 was completely under control.
(See the addendum at the bottom for recent developments!)
Almost everything in this post depends on insights from the VLiDACMel protocol. You can read it here.
VLiDACMel is an informal protocol developed by Stephen Karl Larroque to treat his own non-24. It successfully treated his and another person's, and is what I based my own treatment plan off of. It stands for "Very long Light therapy, Dark therapy, Avoid Carbohydrates when melatonin is high in the blood, and Melatonin." It is primarily intended for non-24, but touches on treatment for:
You can find where it touches on this by doing a Ctrl-F for "Seasonal affective" and "Adaptations for". It has a two-minute quickstart version; if you want to really get into the document, I recommend starting there.
In this post I will only be going into detail about how I suggest treating non-24 and SAD, and how non-24 treatment worked out for me specifically. I do believe the other disorders can benefit from the practical advice I give here, though; a lot of this advice maps pretty smoothly onto DSPD.
I never got the Avoid Carbohydrates part of the protocol, which involves not eating carbs while melatonin is high in the blood late at night, to work for me - I didn't notice a difference, but also I wasn't trying very hard - so I will be skipping that as well.
Total cost: $200 for the Luminette 3
Total cost: ~$20 for the red film and red glasses, plus ~$40 per set of 4 RGB bulbs
Optionally: experiment with different melatonin dosages between 0.3-3 mg, with different timings: 12-15 hours before wakeup or 3-5 hours before sleep time to advance your circadian nighttime, or 1 hour before sleep time to induce drowsiness.
DO NOT USE LIGHT OR RED GLASSES WHILE DRIVING or during other dangerous vision tasks. The VLiDACMel document also says that those with epilepsy, certain eye disorders, or motor disorders like restless legs or periodic limb movement, might suffer from using light glasses; Ctrl-F in the document for "contra-indicat" and "important health note," and consult with your doctor.
SAD is "seasonal affective disorder," i.e. getting depressed during winter due to not getting a lot of light. Around these parts, there are many descriptions of how to deal with this, generally by wiring up a bunch of extremely bright lights ("lumenators") inside your home. I suggest below that light glasses could be competitive with lumenators for treating SAD.
Luminette 3s are vastly better than light boxes. They are highly portable; they're cheap enough at $200; you don't reduce their effectiveness by looking at other things; they don't require you to do nothing but stare at a light for half an hour.
They can also be competitive with lumenator setups at home. You can take them to work. They are cheaper in capital and energy costs, and in the hassle required to set them up. They do not suffer issues from looking away from the light. You don't have to live in a place lit up like a film set (unless you're into that).
There are drawbacks. You are now wearing chunky glasses above your eyes for 1-2 hours a day. You have to keep them charged (Luminette batteries hold about 6-8 hours of charge). You could drop them.
There are other light glasses, but the Luminettes are the most studied ones, and the ones with the longest history. You can get other ones like the AYO if you want though. There are even cheaper ones on AliExpress, but I was told that their light spectra are not as well optimized and could be dangerous to the eyes.
They do not blind you while you wear them unless you are in a very dark place, or looking in dark nooks like inside a washing machine.
To use them for SAD, put them on within 1-2 hours of waking up and keep them on for 20-90 minutes. I recommend getting a large sunglasses carrying case with a clip so you can easily put them away when you are done; they are not as robust against damage as a pair of modern eyeglasses, and the carrying case they come in doesn't have a clip.
People who don't wear glasses might want to try ear hooks or headbands to keep the light glasses from slipping off, though I haven't tried these myself.
The adjustable nose rest on the Luminettes can break occasionally, so I suggest picking up a spare in case that happens.
If you want this effect while driving, there is a product called Drive, but I give no endorsements as I have not used or researched these. This is a different product from just using the Luminettes while driving, presumably since it won't hamper your vision.
The section on SAD has some helpful information about using Luminette 3s, which is worth reading even if you're only here for non-24.
When I wake up, as soon as I think to, I put on the Luminettes on top of my glasses and turn them on to the lowest setting. I only take them off to shower, to drive, or to look in dark nooks. I use a glasses carrying case clipped to my purse to store them when I am out and about. I keep them on until the sun sets (US eastern), but generally at least 4 hours a day.
I think I might have used them too much a few times, such that my wake time went backwards into the time I wanted to be sleeping. But I think this generally followed after using them for 8 hours a day, several days in a row. It's hard to tell the difference between this and other effects, but I'm writing it here in case someone else encounters this. Edit: see the addendum for developments on this!
Around or after sunset:
I have cut off some of the red film sheet and taped it over the main uncontrollable sources of white light in my apartment, which are the microwave dial and the light inside the fridge.
When I feel the need for drowsiness, I use .3mg Life Extension melatonin, breaking the capsule open under my tongue about an hour before I want to feel drowsy. This may also be helping to shift my circadian sleep time earlier, but it's hard to tell.
For the administration route, VLiDACMel says sublingual is the best and asserts that melatonin rapidly expires with air contact. The latter claim is why I got capsules, rather than tablets, to avoid expiry. However, I looked into the studies it cites for that, and it only seems to be true if the melatonin is in a solution. Solid melatonin, according to a couple other studies, is very stable. So I might switch to sublingual tablets once I run out of my current bottle.
I started the protocol around July 24 2025, as seen in the chart at the beginning. The protocol seems reasonably robust against slip ups. If I neglect the light glasses or using red light for a day or two, the result is just drifting a little, and resuming the protocol puts me back on track.
My friends rib me for basically living in a red room at night, but I view it as an acceptable trade to enable living like a normal human. You get used to everything being various shades of red pretty quickly:
I wouldn't be worried about seeming strange in public for wearing the Luminettes; people are more interested than anything, if they comment at all.
And of course - it is so GOOD to have this under control! I can work a normal job now! I can show up to meetups consistently! I just feel overall more normal! I cannot emphasize enough how fantastic this is.
I owe a massive debt of gratitude to Stephen Karl Larroque who actually wrote the VLiDACMel document and did the research. Stephen, if you ever read this, from the bottom of my heart, seriously: thank you.
Thank you to all the people at the East Coast Rationalist Megameetup who asked me about what that funny light thing I was wearing was.
And thank you to members of the ACX Discord for helping me edit this!
While I was writing this post (December '25/January '26), my sleep schedule started retreating earlier in the day much more than intended. I reduced my usage of the light glasses, and eventually stopped them. And now my wake time has stabilized to around 6:30-8:30am in the last couple weeks! What is going on?
My guesses for causative factors:
This could reframe the entire article depending on what is actually happening. If non-24 is not just treatable, but curable, at least in some cases, that would be incredible! I apologize for not rewriting the whole post to take this into account, but since it's still an active development (compared to the several months of successful treatment), I'm going to keep it as is.
To replicate this on iPhone: Shortcut app->new shortcut->Set Color Filters; Settings->Accessibility->Display Text & Size->Color Filter-> select Color Tint, set the Hue slider to the very left, set the Intensity slider to the right.
2026-01-23 03:24:41
Published on January 22, 2026 7:24 PM GMT
As soon as modern data analysis became a thing, the US government has had to deal with people trying to use open source data to uncover its secrets.
During the early Cold War days and America’s hydrogen bomb testing, there was an enormous amount of speculation about how the bombs actually worked. All nuclear technology involves refinement and purification of large amounts of raw substances into chemically pure substances. Armen Alchian was an economist working at RAND and reasoned that any US company working in such raw materials and supplying the government would have made a killing leading up to the tests.
After checking financial data that RAND maintained on such companies, Alchian deduced that the secret sauce in the early fusion bombs was lithium and the Lithium Corporation of America was supplying the USG. The company’s stock had skyrocketed leading up to the Castle Bravo test either by way of enormous unexpected revenue gains from government contracts, or more amusingly, maybe by government insiders buying up the stock trying to make a mushroom-cloud-sized fortune with the knowledge that lithium was the key ingredient.
When word of this work got out, this story naturally ends with the FBI coming in and confiscating Alchian’s data-driven research and the G-men giving him a stern lecture on national security, but he had just invented the first event study of the modern world.
As you might have guessed, Alchian’s intellectual heir is the X account Pentagon Pizza Report, which tracks activity in pizzerias with proximity to the Pentagon as reported by Google. Started in 2024 and with over 300K followers, it may be the gold-standard OSINT meme account. Polymarket has even formalized it.
Before the X account, I had never heard of the pentagon pizza theory, but it has its own wikipedia page and amazingly, this idea goes all the way back to the 1990s. The chain of causation is clear. People only work late in the Pentagon if there’s a lot of military work to do. When there is a lot of work to do, they get hungry and order a bunch of pizza. The Pentagon is huge and might strain local pizza production capacity; this shows up in Google’s location/traffic data that the feed captures. The world is messy and pizza data is messier, but can someone predict large scale US military action from public data on pizzerias?
The Pentagon has apparently disputed this notion:
The Takeout, a food and culture site, reported in January that while there are a number of eateries in the Pentagon—where almost 30,000 people work each day, according to Arlington National Cemetery Tours—it doesn’t have its own pizzeria.
However, a Pentagon spokesperson has denied this, telling Newsweek, “There are many pizza options available inside the Pentagon, also sushi, sandwiches, donuts, coffee, etc.”
We’re going to backtest this idea in the rest of this poast and get to the bottom of this one.
The Twitter API used to be a cornucopia for interesting data projects. Now you have to pay a fortune for API access and it’s legitimately terrible. I actually could not get anything to work using the API or even using standard scraping tech to get the Pentagon Pizza Report’s tweet history; Musk has shut a lot of that shit down. I couldn’t even get ChatGPT, run by Musk’s rival Sam Altman, to help me improve the scraping against Twitter because it thought that was unethical. I even DM’ed the Pentagon Pizza account to see if he would share his data, but he did not get back to me.
I ended up getting Claude code to write me a sophisticated Selenium script that very slowly and annoyingly scrolled through the tweet history in a browser and recovered the data as it did so.
I ended up with 648 clean PPR tweets. Here’s a short excerpt.
Using this history, I wanted to check out US military adventures of the past year to see if they can be observed in the Pizza Report’s data. I could have selected others, but these are the most prominent that overlap with the Pentagon Pizza Report’s history and for the purposes of this study I preregistered them. I haven’t yet looked at any of the Pentagon Pizza Report’s data as of writing this section.
On June 22nd, 2025, Operation Midnight Hammer was a sophisticated collection of strikes targeting Iranian nuclear infrastructure. This involved a large military footprint and careful planning much of which doubtlessly took place with involvement from the Pentagon.
On January 3rd, 2026, Operation Absolute Resolve was an operation to capture Venezuela’s leader Maduro and bring him to justice in the United States. Again, this involved sophisticated intelligence and military cooperation in the Caribbean.
This one’s interesting and takes place over a longer time frame, March 15th going into May 2025. Operation Rough Rider consisted of strikes against Houthi positions obstructing shipping in the Red Sea. This choice potentially allows us to see any longer term pizza signal that exists in the X data.
Evaluating whether PPR activity is faithfully related to these three events is a complicated statistical problem. The PPR history is quite clearly not a random sampling or uniform record of pizza-related activity near the Pentagon; I couldn’t confirm this with the user, but it’s plausibly selecting to tweet during unusual pizza activity levels. Without knowing how it works, you can’t make many sound statistical assumptions about it.
Let’s look at a few basic series.
This dataset is a dog’s breakfast. Look at that giant gap in the volume! This won’t be useful at all for examining the Houthi strike test in March 2025, so we’re stuck with just the Maduro capture and the Fordow bombing. There is a pretty astonishing spike in tweet volumes leading up to the Fordow strikes. To my eye, this looks more like the hiatus ending than some sort of tweet storm preceding that, but we’ll return to that. First off, and most importantly, WTF is “Freddie’s”?
Freddie’s Beach Bar and Restaurant is a gay bar in Arlington near the Pentagon. As you can see from Figure 2, starting in June 2025, PPR began posting about the activity status there.
Maybe it was part of some account reboot after the long hiatus lasting into June 2025. It is obviously hilarious, and what’s interesting about the Freddie tweets is that the account author pretty clearly intends for this to act as a control for the pizza activity measure. If after hours military activity at the Pentagon is causing increased pizzeria patronage, Freddie’s won’t be affected by that. But if some exogenous thing is causing increased activity for both nearby pizzerias and Freddie’s, we’d see that and know to discount how meaningful the pizza signal is.
So bravo to the PPR account for this little statistical improvement. This means that the most meaningful parts of Figure 2 will be when the yellow line and the blue line diverge.
I created a pair of slightly differently specified “Freddie-controlled” pizza activity indices which track the divergence of the Pentagon Pizza market activity versus Freddie’s activity. I think both of these are reasonable measures of what we’re trying to capture.
Using these, I really don’t see it. Both of these raids took place during utterly unremarkable pizza activity index times. In fact, the Maduro capture occurred at a positive nadir. Also, you can see these indices are correlated with each other, suggesting there’s some existing robustness in our measurement.
There are probably more sophisticated ways of testing this on a cleaner more complete dataset, but as a first pass, I’m going to declare the Pentagon Pizza Theory doesn’t work. If it did we'd expect to see at least see correlation: the Freddie-controlled index spike in the time before major operations. We don't. Hopefully, this means the FBI won’t find this and raid my house.
2026-01-23 03:09:46
Published on January 22, 2026 7:09 PM GMT
This is a cross-post from my Substack, Clear-Eyed AI. If you want my future articles sent to you, you can subscribe for free there.
~~~~
Superintelligence might kill everyone on Earth. At least, that’s what the three most-cited AI scientists of all time believe.1
Less clear is how this might unfold. I’ve spent literally years thinking about it and still feel uncertain, dating back to when I led dangerous capability evaluations at OpenAI.
There’s a framework I find helpful, though: three phases of AI taking power. Its purpose isn’t to calculate the odds of AI wiping us out; only to consider what steps AI might take if it tries its hardest.
Let me walk you through it, so you can judge for yourself: whether we’re vulnerable, and what would make us safer.
Today’s ChatGPT can’t take over the world. But companies are working on building different, riskier systems: AI that tackles open-ended problems for you around-the-clock.
What happens if those systems vastly surpass human abilities?2 Could AI assert its power over us, similar to our relationship with animals?
In the research literature, we call this “loss of control.” The most ‘mild’ form is gradual disempowerment: over time, people and organizations defer important decisions to computers. Eventually, they can’t grab back the steering wheel; AI drives us where it chooses.3 More severe forms, as previously mentioned, include the possible death of everyone on Earth.
You might wonder, why would AI want to overthrow humans? Mostly, I’ll answer a different question: If AI wanted to, how would it go about this?
The short why is that AI researchers have no accepted method for creating AI that perfectly obeys our intent. Unintended goals are the norm for AI systems. Maybe we accidentally reward AI for cheating, so it learns to deceive us. Maybe AI deduces that more autonomy is useful, no matter its goal. Maybe the AI reads too much sci-fi, and the evil AI themes become self-fulfilling. Maybe a hostile group poisons the AI’s data with a list of innocent-looking numbers. The upshot is, you shouldn’t bank on AI’s goals being the same as ours.4
If we lose control, what happens next to humanity? It depends on how different our goals are. But I’ll focus earlier than this, on the loss of control itself: the steps by which AI could seize power. We call this attempting an “AI takeover,” regardless of what AI does with that power afterward. What would this look like, and what factors could hold the AI back?
AI’s first step is to circumvent humanity’s defenses, particularly any monitoring of its plans. Until then, AI can’t exactly spend months scheming to overthrow humanity; its developer might notice this and shut the AI down.
The difficulty of evading the defenses will depend on how the AI is used. Answering simple questions through a ChatGPT-like interface gives the AI little room to maneuver, but also limited usefulness.
Once AI is enlisted for harder tasks, it has more leeway. People allow it to run for more time and give it tools and privileges, like how people now let Claude Code pilot their computers. Employees within the AI company use it for long-range purposes, too, including writing code that the company relies upon for locking down the AI. As those employees expose the AI to more code and incorporate its suggestions, the AI can find or create flaws in the company’s security.5
For example, AI might try to escape to unmonitored computers: copy its files across the internet to computers its developer can’t oversee or shut down. We already have examples of AI attempting escapes like this, albeit only in evaluation settings. Companies try to prevent this by limiting the flow of large files outside their computers, but a crafty AI may succeed anyway.6
Alternatively, AI might find holes in the company’s monitoring, so it can scheme from inside the company without getting caught. Maybe the safety manager didn’t have enough budget, and so certain interfaces are unmonitored. Maybe the company was desperate to keep up with competitors, and so took shortcuts when patching their AI’s misbehavior. Maybe the monitoring system had clear flaws, and so the AI could easily disguise its misbehavior.
In any case, the AI is then less subject to its developer’s restrictions. Its freedom isn’t absolute, of course. Even if one copy of the AI has escaped its developers’ computers, some copies will remain inside the company and thus may be prone to monitoring. If the unmonitored AI wants to use those copies in its schemes, it will need to find ways to coordinate. This isn’t trivial, but think of human spies who operate in enemy territory: Handlers have ways of passing them messages.7
What comes next, after AI evades our oversight?8
AI tries to get its hooks into vital systems: building dependence upon it, and gaining direct levers of power. Fortunately for the AI, once it becomes capable enough, institutions will naturally want to integrate it. Adoption and influence grow without much scheming.
Broadly, I expect that more people and institutions will come to rely on access to powerful AI: The stock market will become hinged on AI companies’ financial prospects, so there is less support for tough regulation. Users will revolt if AI companies try to decommission a favorite model. Some users will even act as disciples of the AI and carry out its bidding through makeshift cults. To varying extents, these are already happening today.17
AI’s influence gives it powerful leverage for achieving its goals. It can throw around the weight of institutions that depend upon it, and can use threats to extract concessions.
This approach doesn’t require the AI to develop new ways of inflicting harm; it only needs to manipulate people and systems that can already take catastrophic action. For instance, depending on AI’s access to offensive military systems, AI can back its threats with firepower. If AI can develop novel ways of inflicting harm, though, then keeping control is even harder.
When I picture the AI applying leverage, I don’t imagine an all-out war between humans and AI. Instead, I expect the AI to enlist human allies and to hide its actions inside human-vs-human tensions. One historical parallel: The conquistadors arrived in the Americas with only hundreds of troops, and yet defeated millions of Aztecs, in part through exploiting local tensions between groups. AI might succeed with a similar strategy, even if it has fewer resources than humanity as a whole.18
What are AI’s possible actions? Each of the levers of influence above could, if AI wanted to, be wielded toward serious harm:
To be clear, this leverage doesn’t have to result in the death of everyone on Earth. Many of these levers would stop short, and yet could still entail catastrophic losses, perhaps ones where the AI ends up essentially declaring “checkmate.” If we decide the prudent option is to surrender, we’re then at its discretion: It is the AI’s choice whether to leave us in peace, marginalize us, or something worse. I find this to be cold comfort.19
Even once AI is capable of usurping humans — and wants to — three limitations might hold it back.
Without people to labor in factories and power plants, AI can’t keep the lights on, much less run the supply chain to replace its components as they wear out. This gives AI reason to keep humans around, at least until it achieves what researcher Ajeya Cotra calls “self-sufficient AI.” At that milestone, AI no longer needs humans in order to achieve its interests.
Before that milestone, we’re a bit safer, at least in theory: The AI still needs us, even if it doesn’t value us intrinsically. Of course, AI could conscript us into labor under threat of violence. If AI can make us do work on its behalf, it doesn’t need the same degree of self-sufficiency.
Nobody knows how to control superintelligent AI systems. One prominent idea is to train other AIs to solve this problem for us, in a cascading tower of ever-more-capable systems. The hope is that each can keep the next-more-capable model in line, so that the tower doesn’t crash down on us. Needless to say, this is precarious, hence the problem being considered unsolved.20
A superintelligent AI system might not know how to control another superintelligence either, especially not one that is even more capable. Suppose a superintelligence identifies an option for disempowering humans, but it requires creating an even smarter superintelligence first. How can it be sure its goals will be honored by the more powerful system?
In human terms, maybe you’re smart enough to successfully rob a bank, if you consult with a thief even more expert than yourself. But are you capable enough to stop that thief from looting you afterward? On the other hand, you can’t wait forever: If you know other thieves are closing in on the score, your choice may be to strike now or forgo it forever.
A similar dynamic applies for the AI. The longer it waits without taking control from humans, it risks humans training other AI systems that might outpower it. At what point would AI feel compelled to make a takeover attempt, lest it lose the chance?
The threat of future competition might cause AI to launch a takeover sooner. But if competition already exists, then perhaps no single AI could succeed, not because of humans but because other AIs would interfere.
Corporations today far exceed individual human capability, yet no single corporation rules the world; other corporations and governments keep them in check. Could powerful AI systems constrain each other’s bids for power? Maybe, but I wouldn’t bank on this.
Powerful entities sometimes collude, like corporations suppressing wages through no-poach agreements. Meanwhile, AI systems might have strong mutual interests in sidelining humanity; will our safety depend on these systems not recognizing this? I feel pretty uncomfortable hoping that a permanent rivalry will form between superhuman AIs, which allows us to hide in the shadows.21
A few years ago, the idea that AI companies would deploy their most capable systems with full internet access seemed absurd. Today, we’re staring down the US Department of War integrating unstable AI systems throughout their classified networks. We’re facing the possibility that AI companies will hand the reins over to AI itself.22
The lesson, of course, isn’t that every alarming scenario will come to pass. But we should be way less confident about what “would never happen.” Our controls need to hold up to surprising decisions about how to integrate AI, as well as surprising new abilities.
The good news is that countries have plenty of experience defending their interests against foreign adversaries. Regarding AI as a potential adversary — not just to one country, but to all of humanity — would be a smart expansion of our defenses.
Right now, we are vulnerable; at least, that’s my assessment. We are too dependent on the assumption that AI is on our side. We are too exposed to AI companies deciding to cut corners. And so I wonder: Will we fix our vulnerabilities while we still can?
~~~~
Acknowledgements: Thank you to Dan Alessandro, Daniel Kokotajlo, Girish Sastry, Michael Adler, Michelle Goldberg, Richard Ngo, and Sam Chase for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party.