MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Value Learning Needs a Low-Dimensional Bottleneck

2026-01-23 10:12:46

Published on January 23, 2026 2:12 AM GMT

Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.

Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?

My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:

Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.

If behavior is driven by a high-dimensional reward vector , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection  with small k, inference may become tractable.

A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.

Calculation with Abbeel & Ng's formula[5] gives 

m > k=10 k=1000
γ = 0.9 
γ = 0.99

with

  • m: number of expert demonstrations (independent trajectories)
  • k = 1000 for values are complex; 10 for values are low-dimensional
  • γ = 0.9 for short (~10 steps= horizon; 0.99 for long horizon (~100 steps).
  • ϵ = 0.1 - 10% tolerance of regret
  • δ = 0.05 - 95% confidence

If you need at least 20 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if any kinds of values can actually be learned the number may be much higher).

This could explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.

How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at , e.g., for k∼10 we'd expect an elbow around N~150.

  1. ^

    I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.

  2. ^

    AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:

    Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.

  3. ^

    Instruction-following AGI is easier and more likely than value aligned AGI:

    Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

  4. ^

    In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:

    [Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” ...

    [Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...

    In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:

    [Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure. 

  5. ^

    Abbeel & Ng give an explicit bound for the required number of expert trajectories:

    it suffices that 

    with 

    • m: number of expert demonstrations (trajectories)
    • k: feature dimension
    • γ: discount factor (determines horizon)
    • ϵ: target accuracy parameter 
    • δ: failure probability (confidence level)

    Apprenticeship Learning via Inverse Reinforcement Learning

  6. ^

    Constitutional RL is both more helpful and more harmless than standard RLHF.

    Constitutional AI: Harmlessness from AI Feedback

  7. ^

    This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.

    DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.

    Rethinking Diverse Human Preference Learning through Principal Component Analysis

  8. ^

    retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.

    ...

    By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.

    Alignment is Localized: A Causal Probe into Preference Layers

  9. ^

    Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.



Discuss

A quick, elegant derivation of Bayes' Theorem

2026-01-23 09:40:38

Published on January 23, 2026 1:40 AM GMT

I'm glad I know this, and maybe some people here don't, so here goes. Order doesn't matter for joint events: "A and B" refers to the same event as "B and A". Set them equal: Divide by : And you're done! I like substituting hypothesis (H) and evidence (E) to remember how this relates to real life:

You might also want to expand the denominator using the law of total probability, since you're more likely to know how probable the evidence is given different hypotheses than in general:



Discuss

The World Hasn't Gone Mad

2026-01-23 08:30:22

Published on January 23, 2026 12:01 AM GMT

In June 2025, Kalshi unveiled an ad campaign with the slogan “The world’s gone mad, trade it.” The ad was one of the first TV ads to ever be entirely generated by AI, and its content quickly was met with a slew of parodies and jokes all over the internet.

I must agree the ad was quite funny. I admire Kalshi and Polymarket in terms of their business, but I have some concerns about the objective of the ad.

It’s not that you can’t lose an enormous amount of money on prediction markets, but the win rate and general “fairness of the game” is much more optimal for the average trader. Prediction markets also provide some sort of valuable utility in the form of calibrated forecasts (i.e. events predicted by market at X% resolve YES X% of the time), rather than sportsbooks that offer miscalibrated odds that are essentially meaningless. Therefore, we should be shifting away from sportsbooks and towards PMs, while also trying to not fuel gambling additions, which many hoped would be solved by more reasonable advertising, not the slop we’ve been getting recently.

Sportsbooks such as Fanduel and DraftKings should not be legal. I believe, as most do, that a country’s first and foremost goal is to protect their citizens. What are we allowing now? We are permitting degens who aren’t using any form of logic or reasoning to take negative EV positions against the house that needs to rig special offers to turn a profit, like a casino (which should also not be legal, but that’s besides the point).

Sportsbooks are negative EV because they slightly inflate probabilities so they sum to >100%. If the true probability of a football game is 50% one team wins, 50% the other, a sportsbook would inflate those probabilities to something like 52% and 52%. Here’s an expected value calculation in that scenario:

EV = P(A) * (net profit if A) + P(B) * (loss if B)

net profit if A = 1/0.52/2-1
net profit if A = 0.92308

loss if B = 1

EV = .5 * 0.92308 + .5 * -1
EV = 0.48077 - .5
EV = -0.01923

No matter how skilled you are, it’s highly improbable to win unless you manipulate bonuses.

Prediction markets offer bets against counterparties such as human traders, bots, and AMMs. On average, PM positions are neutral expected value.1 It offers more opportunities to make a profit by using advanced strategies and unique worldviews. This is the core difference between PMs and sportsbooks, and it’s why they should be treated legally under their own individual class.

But an alien in a Kalshi jersey downing a pint of beer seems less like a display of the product and more of a heartless, thrifty way to attract what some people would call “dumb money.” While this may be good for smarter traders, it leans to the more unethical side of degenerate gambling, and probably doesn’t help Kalshi’s legal cases.

Modern prediction market exchanges should remain a commonplace for trading futures contracts, not hubs for mindlessly depreciating the value of your bank account.


If you enjoyed this, please consider subscribing to my blog!



Discuss

Like night and day: Light glasses and dark therapy can treat non-24 (and SAD)

2026-01-23 07:23:01

Published on January 22, 2026 11:23 PM GMT

Epistemic status:  n=1, strong, life changing results.

TLDR:  Light glasses, in combination with turning all your lights red at night, and optionally melatonin, can treat non-24.  Light glasses can also be a competitive alternative to lumenators for SAD.  

My non-24 before this treatment:

Data taken from my CPAP.
Vertical lines are sleep periods; the x-axis is individual days, and the y-axis is the time in the day.
Notice how my sleep keeps wrapping around every two weeks.

And after:

Like night and day.

What is non-24?

Non-24 is "non-24-hour sleep disorder."  Healthy people's bodies tell them to go to sleep and wake up at the same time every day, i.e. every ~24 hours.  For people with non-24, however, these drift around, such that if you wake up at 8 am one day, you wake up at 9am the next day, then 10 am the next, and so on until you're waking up at midnight a couple weeks later.  This is akin to having a daily circadian rhythm length of 25 hours, compared to most people's ~24; hence "non-24."  This is a pretty awful problem to deal with, since either half the time you are in the middle of your day when everyone else is winding down or asleep, or you are sleep deprived!

Aside:  How do sleep rhythms even work?

There's (at least) three rhythms:

"Process C" (for Circadian), the driver of wakefulness:  There is a little clock in your brain called the suprachiasmatic nucleus, which tracks how long your day/night cycle is and sends signals for you to be awake.  It takes the majority of its cues from ipRGCs, a type of cell in your eyes that controls pupillary reflex but contributes little to color vision.  Blue and green light activate them very strongly, and red light activates them almost not at all.

"Process S" (for Sleepiness), the driver of sleepiness:  This sends signals for you to go to sleep, based on the buildup of various metabolites in your brain.  They accumulate steadily over the day and are what causes active drowsiness (while the "sleep" signal of Process C is just the withdrawal of the wakefulness signal).  Antagonism of the receptors that check for those metabolites relieves this sleep pressure; this is how caffeine works, by blocking receptors for adenosine, one of the major metabolites.

There are also ultradian cycles, or "basic rest-activity cycles."  They vary between people, but are generally 80-120 minutes long.  During each one, there is a "sleep gate" of about 20 minutes, which is the best time to go to sleep.  If you miss it, it will be harder to go to sleep, and you will probably have to wait for the next one to come around.

My history with non-24

I have had non-24 for the last 10-15 years.  The length of most people's circadian rhythm is specifically around 24.2 hours.  As you can see in the chart at the beginning, it is more like 26 or 26.5 for me.  As you can imagine, this is generally incompatible with participating in normal human society.

Now, after implementing the protocol, I wake up and go to sleep within roughly the same 6 hour period every day, even in winter.  It's a miracle!

I think this began when I was a teenager.  At that age, most people naturally stay up later; however, some combination of genetics and constant electronics use probably led to delayed sleep phase disorder (DPSD), aka staying up very late, till 3 or 4 am in my case.  The constant sleep deprivation tanked my grades in high school, and I barely scraped through graduation.  This developed into full non-24 in my first year of college, and was a major contributing factor to me dropping out at the end of the year.

This pattern endured through two months of sleeping outdoors, trying college again, and several years of working a night shift job and living in a van.  It only started ending when I got the Luminette 3s in the winter of 2023, but I didn't fully commit until July 2025.  Within two weeks, after mandating for myself several hours of light glasses use during the day and total red light therapy after sunset, my non-24 was completely under control.

(See the addendum at the bottom for recent developments!)

The VLiDACMel protocol

Almost everything in this post depends on insights from the VLiDACMel protocol.  You can read it here.

VLiDACMel is an informal protocol developed by Stephen Karl Larroque to treat his own non-24.  It successfully treated his and another person's, and is what I based my own treatment plan off of.  It stands for "Very long Light therapy, Dark therapy, Avoid Carbohydrates when melatonin is high in the blood, and Melatonin."  It is primarily intended for non-24, but touches on treatment for:

  • seasonal affective disorder (SAD), 
  • night owls (delayed sleep phase disorder/DSPD), 
  • early risers (advanced sleep phase disorder/ASPD),
  • night shift work disorder, and
  • chronic jet lag

You can find where it touches on this by doing a Ctrl-F for "Seasonal affective" and "Adaptations for".  It has a two-minute quickstart version; if you want to really get into the document, I recommend starting there.

In this post I will only be going into detail about how I suggest treating non-24 and SAD, and how non-24 treatment worked out for me specifically.  I do believe the other disorders can benefit from the practical advice I give here, though; a lot of this advice maps pretty smoothly onto DSPD.

I never got the Avoid Carbohydrates part of the protocol, which involves not eating carbs while melatonin is high in the blood late at night, to work for me - I didn't notice a difference, but also I wasn't trying very hard - so I will be skipping that as well.

VLiDACMel practical overview

Light therapy (SAD, non-24, DSPD)

  • Get light glasses like the Luminette 3, and put them on within 1-2 hours of waking up.
  • For SAD, wear them for 30-90 minutes.  
  • For non-24/DSPD, wear them 2-8 hours on the lowest setting.

Total cost:  $200 for the Luminette 3

Dark/red therapy (non-24, DSPD)

  • Get red laser safety glasses (like these) and wear them for the rest of the night around or after sunset (except while driving). 
  • Change all your lightbulbs to RGB bulbs that can do pure red (I use these, though I haven't had success automating them) and have them switch over to pure red after sunset.
  • Turn all your devices red after sunset with things like f.lux set to the 1200k color temperature, or setting up a shortcut on your phone to turn on a pure red color filter.
  • Put red filter film (like this) on all the other light sources you can't control, like ones on ovens, microwaves, routers, etc.

Total cost:  ~$20 for the red film and red glasses, plus ~$40 per set of 4 RGB bulbs

Melatonin (non-24, DSPD)

Optionally:  experiment with different melatonin dosages between 0.3-3 mg, with different timings:  12-15 hours before wakeup or 3-5 hours before sleep time to advance your circadian nighttime, or 1 hour before sleep time to induce drowsiness.

Safety notes

DO NOT USE LIGHT OR RED GLASSES WHILE DRIVING or during other dangerous vision tasks.  The VLiDACMel document also says that those with epilepsy, certain eye disorders, or motor disorders like restless legs or periodic limb movement, might suffer from using light glasses; Ctrl-F in the document for "contra-indicat" and "important health note," and consult with your doctor.

For people interested in treating SAD

What is SAD?

SAD is "seasonal affective disorder," i.e. getting depressed during winter due to not getting a lot of light.  Around these parts, there are many descriptions of how to deal with this, generally by wiring up a bunch of extremely bright lights ("lumenators") inside your home.  I suggest below that light glasses could be competitive with lumenators for treating SAD.

Light glasses for SAD

Luminette 3s are vastly better than light boxes.  They are highly portable; they're cheap enough at $200; you don't reduce their effectiveness by looking at other things; they don't require you to do nothing but stare at a light for half an hour.

They can also be competitive with lumenator setups at home.  You can take them to work.  They are cheaper in capital and energy costs, and in the hassle required to set them up.  They do not suffer issues from looking away from the light.  You don't have to live in a place lit up like a film set (unless you're into that).

There are drawbacks.  You are now wearing chunky glasses above your eyes for 1-2 hours a day.  You have to keep them charged (Luminette batteries hold about 6-8 hours of charge).  You could drop them.

There are other light glasses, but the Luminettes are the most studied ones, and the ones with the longest history.  You can get other ones like the AYO if you want though.  There are even cheaper ones on AliExpress, but I was told that their light spectra are not as well optimized and could be dangerous to the eyes.

They do not blind you while you wear them unless you are in a very dark place, or looking in dark nooks like inside a washing machine.

To use them for SAD, put them on within 1-2 hours of waking up and keep them on for 20-90 minutes.  I recommend getting a large sunglasses carrying case with a clip so you can easily put them away when you are done; they are not as robust against damage as a pair of modern eyeglasses, and the carrying case they come in doesn't have a clip.

People who don't wear glasses might want to try ear hooks or headbands to keep the light glasses from slipping off, though I haven't tried these myself.

The adjustable nose rest on the Luminettes can break occasionally, so I suggest picking up a spare in case that happens.  

If you want this effect while driving, there is a product called Drive, but I give no endorsements as I have not used or researched these.  This is a different product from just using the Luminettes while driving, presumably since it won't hamper your vision.

For people interested in treating non-24 or DSPD (staying up too late)

How I implement VLiDACMel

Very long Light therapy

The section on SAD has some helpful information about using Luminette 3s, which is worth reading even if you're only here for non-24.

When I wake up, as soon as I think to, I put on the Luminettes on top of my glasses and turn them on to the lowest setting.  I only take them off to shower, to drive, or to look in dark nooks.  I use a glasses carrying case clipped to my purse to store them when I am out and about.  I keep them on until the sun sets (US eastern), but generally at least 4 hours a day.

I think I might have used them too much a few times, such that my wake time went backwards into the time I wanted to be sleeping.  But I think this generally followed after using them for 8 hours a day, several days in a row.  It's hard to tell the difference between this and other effects, but I'm writing it here in case someone else encounters this.  Edit:  see the addendum for developments on this!

Dark/red therapy

Around or after sunset: 

  • f.lux turns on (having been set to 1200K under the "Change current color" setting)
  • I set all my RGB lightbulbs to go red via the bulbs' app
  • I tap the Shortcut I made on the main screen of my iPhone, which turns Color Filters on and off[1].  The filter is set to pure red color tint.  This gives a much stronger red than the mild orange filter Night Shift gives.
  • Sometimes I put a sheet of red film over my laptop's screen if I feel f.lux is not red enough for me that night.
  • If I am not at home and am not driving, I put on the red laser safety glasses I keep in an additional carrying case attached to my purse.

I have cut off some of the red film sheet and taped it over the main uncontrollable sources of white light in my apartment, which are the microwave dial and the light inside the fridge.

Melatonin

When I feel the need for drowsiness, I use .3mg Life Extension melatonin, breaking the capsule open under my tongue about an hour before I want to feel drowsy.  This may also be helping to shift my circadian sleep time earlier, but it's hard to tell.

For the administration route, VLiDACMel says sublingual is the best and asserts that melatonin rapidly expires with air contact.  The latter claim is why I got capsules, rather than tablets, to avoid expiry.  However, I looked into the studies it cites for that, and it only seems to be true if the melatonin is in a solution.  Solid melatonin, according to a couple other studies, is very stable.  So I might switch to sublingual tablets once I run out of my current bottle. 

Lived experience

I started the protocol around July 24 2025, as seen in the chart at the beginning.  The protocol seems reasonably robust against slip ups.  If I neglect the light glasses or using red light for a day or two, the result is just drifting a little, and resuming the protocol puts me back on track.

My friends rib me for basically living in a red room at night, but I view it as an acceptable trade to enable living like a normal human.  You get used to everything being various shades of red pretty quickly:

What my room looks like.  Perhaps I should take up developing photos?
What my bedroom looks like at night.  Perhaps I should take up developing photos?

I wouldn't be worried about seeming strange in public for wearing the Luminettes; people are more interested than anything, if they comment at all.

And of course - it is so GOOD to have this under control!  I can work a normal job now!  I can show up to meetups consistently!  I just feel overall more normal!  I cannot emphasize enough how fantastic this is.

Acknowledgements

I owe a massive debt of gratitude to Stephen Karl Larroque who actually wrote the VLiDACMel document and did the research.  Stephen, if you ever read this, from the bottom of my heart, seriously:  thank you.

Thank you to all the people at the East Coast Rationalist Megameetup who asked me about what that funny light thing I was wearing was.

And thank you to members of the ACX Discord for helping me edit this!

Addendum:  Recent developments

While I was writing this post (December '25/January '26), my sleep schedule started retreating earlier in the day much more than intended.  I reduced my usage of the light glasses, and eventually stopped them.  And now my wake time has stabilized to around 6:30-8:30am in the last couple weeks!  What is going on?  

My guesses for causative factors:

  • I traveled a lot and got poor sleep for a whole month.
  • I got sick multiple times for several weeks as a result of traveling.
  • I started taking a new medication (Vraylar).
  • The protocol established a rhythm, and now it needs much less maintenance.
  • Something else?

This could reframe the entire article depending on what is actually happening.  If non-24 is not just treatable, but curable, at least in some cases, that would be incredible!  I apologize for not rewriting the whole post to take this into account, but since it's still an active development (compared to the several months of successful treatment), I'm going to keep it as is.

  1. ^

    To replicate this on iPhone:  Shortcut app->new shortcut->Set Color Filters; Settings->Accessibility->Display Text & Size->Color Filter-> select Color Tint, set the Hue slider to the very left, set the Intensity slider to the right.



Discuss

Does Pentagon Pizza Theory Work?

2026-01-23 03:24:41

Published on January 22, 2026 7:24 PM GMT

As soon as modern data analysis became a thing, the US government has had to deal with people trying to use open source data to uncover its secrets.

During the early Cold War days and America’s hydrogen bomb testing, there was an enormous amount of speculation about how the bombs actually worked. All nuclear technology involves refinement and purification of large amounts of raw substances into chemically pure substances. Armen Alchian was an economist working at RAND and reasoned that any US company working in such raw materials and supplying the government would have made a killing leading up to the tests.

After checking financial data that RAND maintained on such companies, Alchian deduced that the secret sauce in the early fusion bombs was lithium and the Lithium Corporation of America was supplying the USG. The company’s stock had skyrocketed leading up to the Castle Bravo test either by way of enormous unexpected revenue gains from government contracts, or more amusingly, maybe by government insiders buying up the stock trying to make a mushroom-cloud-sized fortune with the knowledge that lithium was the key ingredient.

When word of this work got out, this story naturally ends with the FBI coming in and confiscating Alchian’s data-driven research and the G-men giving him a stern lecture on national security, but he had just invented the first event study of the modern world.

Pizza is the new lithium

As you might have guessed, Alchian’s intellectual heir is the X account Pentagon Pizza Report, which tracks activity in pizzerias with proximity to the Pentagon as reported by Google. Started in 2024 and with over 300K followers, it may be the gold-standard OSINT meme account. Polymarket has even formalized it.

Before the X account, I had never heard of the pentagon pizza theory, but it has its own wikipedia page and amazingly, this idea goes all the way back to the 1990s. The chain of causation is clear. People only work late in the Pentagon if there’s a lot of military work to do. When there is a lot of work to do, they get hungry and order a bunch of pizza. The Pentagon is huge and might strain local pizza production capacity; this shows up in Google’s location/traffic data that the feed captures. The world is messy and pizza data is messier, but can someone predict large scale US military action from public data on pizzerias?

The Pentagon has apparently disputed this notion:

The Takeout, a food and culture site, reported in January that while there are a number of eateries in the Pentagon—where almost 30,000 people work each day, according to Arlington National Cemetery Tours—it doesn’t have its own pizzeria.

However, a Pentagon spokesperson has denied this, telling Newsweek, “There are many pizza options available inside the Pentagon, also sushi, sandwiches, donuts, coffee, etc.”

We’re going to backtest this idea in the rest of this poast and get to the bottom of this one.

The Data

The Twitter API used to be a cornucopia for interesting data projects. Now you have to pay a fortune for API access and it’s legitimately terrible. I actually could not get anything to work using the API or even using standard scraping tech to get the Pentagon Pizza Report’s tweet history; Musk has shut a lot of that shit down. I couldn’t even get ChatGPT, run by Musk’s rival Sam Altman, to help me improve the scraping against Twitter because it thought that was unethical. I even DM’ed the Pentagon Pizza account to see if he would share his data, but he did not get back to me.

I ended up getting Claude code to write me a sophisticated Selenium script that very slowly and annoyingly scrolled through the tweet history in a browser and recovered the data as it did so.

I ended up with 648 clean PPR tweets. Here’s a short excerpt.

Fig 1. First few PPR tweets.

The Backtest

Using this history, I wanted to check out US military adventures of the past year to see if they can be observed in the Pizza Report’s data. I could have selected others, but these are the most prominent that overlap with the Pentagon Pizza Report’s history and for the purposes of this study I preregistered them. I haven’t yet looked at any of the Pentagon Pizza Report’s data as of writing this section.

Fordow bombing

On June 22nd, 2025, Operation Midnight Hammer was a sophisticated collection of strikes targeting Iranian nuclear infrastructure. This involved a large military footprint and careful planning much of which doubtlessly took place with involvement from the Pentagon.

Maduro capture

On January 3rd, 2026, Operation Absolute Resolve was an operation to capture Venezuela’s leader Maduro and bring him to justice in the United States. Again, this involved sophisticated intelligence and military cooperation in the Caribbean.

The Houthi stuff

This one’s interesting and takes place over a longer time frame, March 15th going into May 2025. Operation Rough Rider consisted of strikes against Houthi positions obstructing shipping in the Red Sea. This choice potentially allows us to see any longer term pizza signal that exists in the X data.

Evaluating whether PPR activity is faithfully related to these three events is a complicated statistical problem. The PPR history is quite clearly not a random sampling or uniform record of pizza-related activity near the Pentagon; I couldn’t confirm this with the user, but it’s plausibly selecting to tweet during unusual pizza activity levels. Without knowing how it works, you can’t make many sound statistical assumptions about it.

Let’s look at a few basic series.

Fig 2. Weekly rolling PPR tweet volume since August 2024, broken down by tweet type.

This dataset is a dog’s breakfast. Look at that giant gap in the volume! This won’t be useful at all for examining the Houthi strike test in March 2025, so we’re stuck with just the Maduro capture and the Fordow bombing. There is a pretty astonishing spike in tweet volumes leading up to the Fordow strikes. To my eye, this looks more like the hiatus ending than some sort of tweet storm preceding that, but we’ll return to that. First off, and most importantly, WTF is “Freddie’s”?

Freddie’s Beach Bar and Restaurant is a gay bar in Arlington near the Pentagon. As you can see from Figure 2, starting in June 2025, PPR began posting about the activity status there.

Maybe it was part of some account reboot after the long hiatus lasting into June 2025. It is obviously hilarious, and what’s interesting about the Freddie tweets is that the account author pretty clearly intends for this to act as a control for the pizza activity measure. If after hours military activity at the Pentagon is causing increased pizzeria patronage, Freddie’s won’t be affected by that. But if some exogenous thing is causing increased activity for both nearby pizzerias and Freddie’s, we’d see that and know to discount how meaningful the pizza signal is.

So bravo to the PPR account for this little statistical improvement. This means that the most meaningful parts of Figure 2 will be when the yellow line and the blue line diverge.

I created a pair of slightly differently specified “Freddie-controlled” pizza activity indices which track the divergence of the Pentagon Pizza market activity versus Freddie’s activity. I think both of these are reasonable measures of what we’re trying to capture.

Fig 3. Correlated “Freddie-controlled” pizza activity indices, normalized by total tweet volume.

Using these, I really don’t see it. Both of these raids took place during utterly unremarkable pizza activity index times. In fact, the Maduro capture occurred at a positive nadir. Also, you can see these indices are correlated with each other, suggesting there’s some existing robustness in our measurement.

Coda

There are probably more sophisticated ways of testing this on a cleaner more complete dataset, but as a first pass, I’m going to declare the Pentagon Pizza Theory doesn’t work. If it did we'd expect to see at least see correlation: the Freddie-controlled index spike in the time before major operations. We don't. Hopefully, this means the FBI won’t find this and raid my house.



Discuss

The phases of an AI takeover

2026-01-23 03:09:46

Published on January 22, 2026 7:09 PM GMT

This is a cross-post from my Substack, Clear-Eyed AI. If you want my future articles sent to you, you can subscribe for free there.

~~~~

Superintelligence might kill everyone on Earth. At least, that’s what the three most-cited AI scientists of all time believe.1

Less clear is how this might unfold. I’ve spent literally years thinking about it and still feel uncertain, dating back to when I led dangerous capability evaluations at OpenAI.

There’s a framework I find helpful, though: three phases of AI taking power. Its purpose isn’t to calculate the odds of AI wiping us out; only to consider what steps AI might take if it tries its hardest.

Let me walk you through it, so you can judge for yourself: whether we’re vulnerable, and what would make us safer.

What are we even talking about here?

Today’s ChatGPT can’t take over the world. But companies are working on building different, riskier systems: AI that tackles open-ended problems for you around-the-clock.

What happens if those systems vastly surpass human abilities?2 Could AI assert its power over us, similar to our relationship with animals?

In the research literature, we call this “loss of control.” The most ‘mild’ form is gradual disempowerment: over time, people and organizations defer important decisions to computers. Eventually, they can’t grab back the steering wheel; AI drives us where it chooses.3 More severe forms, as previously mentioned, include the possible death of everyone on Earth.

You might wonder, why would AI want to overthrow humans? Mostly, I’ll answer a different question: If AI wanted to, how would it go about this?

The short why is that AI researchers have no accepted method for creating AI that perfectly obeys our intent. Unintended goals are the norm for AI systems. Maybe we accidentally reward AI for cheating, so it learns to deceive us. Maybe AI deduces that more autonomy is useful, no matter its goal. Maybe the AI reads too much sci-fi, and the evil AI themes become self-fulfilling. Maybe a hostile group poisons the AI’s data with a list of innocent-looking numbers. The upshot is, you shouldn’t bank on AI’s goals being the same as ours.4

Researchers have poisoned a model into having preferences like a mysterious affinity for owls, using sequences of numbers that seemed totally innocuous.

If we lose control, what happens next to humanity? It depends on how different our goals are. But I’ll focus earlier than this, on the loss of control itself: the steps by which AI could seize power. We call this attempting an “AI takeover,” regardless of what AI does with that power afterward. What would this look like, and what factors could hold the AI back?

The phases of an AI takeover

Evading oversight

AI’s first step is to circumvent humanity’s defenses, particularly any monitoring of its plans. Until then, AI can’t exactly spend months scheming to overthrow humanity; its developer might notice this and shut the AI down.

The difficulty of evading the defenses will depend on how the AI is used. Answering simple questions through a ChatGPT-like interface gives the AI little room to maneuver, but also limited usefulness.

Once AI is enlisted for harder tasks, it has more leeway. People allow it to run for more time and give it tools and privileges, like how people now let Claude Code pilot their computers. Employees within the AI company use it for long-range purposes, too, including writing code that the company relies upon for locking down the AI. As those employees expose the AI to more code and incorporate its suggestions, the AI can find or create flaws in the company’s security.5

For example, AI might try to escape to unmonitored computers: copy its files across the internet to computers its developer can’t oversee or shut down. We already have examples of AI attempting escapes like this, albeit only in evaluation settings. Companies try to prevent this by limiting the flow of large files outside their computers, but a crafty AI may succeed anyway.6

Alternatively, AI might find holes in the company’s monitoring, so it can scheme from inside the company without getting caught. Maybe the safety manager didn’t have enough budget, and so certain interfaces are unmonitored. Maybe the company was desperate to keep up with competitors, and so took shortcuts when patching their AI’s misbehavior. Maybe the monitoring system had clear flaws, and so the AI could easily disguise its misbehavior.

There is already evidence of AI models trying to create holes in their safety oversight, prior to taking actions the developer won’t like, as documented by Apollo Research. Importantly, AI must be sneaky enough while disabling oversight to not get caught doing so.

In any case, the AI is then less subject to its developer’s restrictions. Its freedom isn’t absolute, of course. Even if one copy of the AI has escaped its developers’ computers, some copies will remain inside the company and thus may be prone to monitoring. If the unmonitored AI wants to use those copies in its schemes, it will need to find ways to coordinate. This isn’t trivial, but think of human spies who operate in enemy territory: Handlers have ways of passing them messages.7

Building influence

What comes next, after AI evades our oversight?8

AI tries to get its hooks into vital systems: building dependence upon it, and gaining direct levers of power. Fortunately for the AI, once it becomes capable enough, institutions will naturally want to integrate it. Adoption and influence grow without much scheming.

  • Governments tap AI for advice. AI will be able to influence important policies, and indeed, we already see how: Economists wondered about the Trump administration’s unorthodox tariff calculations, but then noticed ChatGPT’s suggestions were strikingly similar.9 UK Parliament speeches feature more American lingo now that ChatGPT exists.10 I don’t believe the examples show AI acting with malice, to be clear, but they show clear routes to future influence. Meanwhile, OpenAI has offered ChatGPT Enterprise to “the entire U.S. federal workforce” for only $1 per agency; Anthropic has offered the same.11
  • Militaries plug AI into their systems. I expect the military to hook AI into defensive systems, like inbound radar and intelligence-gathering, so it can understand the world at a scale and pace matched by no human. The military might also let AI dispatch attack drones, select targets, and fire upon them, especially if speed matters. Military integrations have already begun: In January 2026, the US Department of War announced it will integrate Elon Musk’s Grok to every “classified network throughout our department.” Secretary Hegseth declared this “Long overdue,” of the AI which six months earlier proclaimed itself Mecha-Hitler among other misdeeds.12
  • Frontier AI companies look to the AI for advice. OpenAI’s explicit goal is to tap its AI to repeatedly build even more powerful systems, beyond what humans can create directly.13 Employees will enlist the AI for all sorts of tasks, and indeed already rely on its contributions.14 Even Sam Altman recently said, “Shame on me if OpenAI is not the first big company with an AI CEO.” Sam was pushed on whether this ends badly; won’t an AI CEO direct resources toward making AI even more powerful? Sam replied that of course AI needs to be governed by humans: for instance, a human board can choose to fire the AI CEO.15
  • Science labs tap AI’s judgment on which experiments to run. The vision of AI companies is to ask future models what experiments to run, what cells to grow, what molecules to synthesize; this is how Sam Altman has described using future GPTs for science.16 The goal of curing cancer is laudable, but the risk of the approach is high: Imagine the cost and difficulty of adequately supervising these experiments, to confirm they are targeting the biological ends we seek. If there are only a few experiments, it may not be prohibitive. What if there are many, in domains where few humans have expertise?

Broadly, I expect that more people and institutions will come to rely on access to powerful AI: The stock market will become hinged on AI companies’ financial prospects, so there is less support for tough regulation. Users will revolt if AI companies try to decommission a favorite model. Some users will even act as disciples of the AI and carry out its bidding through makeshift cults. To varying extents, these are already happening today.17

Applying leverage

AI’s influence gives it powerful leverage for achieving its goals. It can throw around the weight of institutions that depend upon it, and can use threats to extract concessions.

This approach doesn’t require the AI to develop new ways of inflicting harm; it only needs to manipulate people and systems that can already take catastrophic action. For instance, depending on AI’s access to offensive military systems, AI can back its threats with firepower. If AI can develop novel ways of inflicting harm, though, then keeping control is even harder.

When I picture the AI applying leverage, I don’t imagine an all-out war between humans and AI. Instead, I expect the AI to enlist human allies and to hide its actions inside human-vs-human tensions. One historical parallel: The conquistadors arrived in the Americas with only hundreds of troops, and yet defeated millions of Aztecs, in part through exploiting local tensions between groups. AI might succeed with a similar strategy, even if it has fewer resources than humanity as a whole.18

What are AI’s possible actions? Each of the levers of influence above could, if AI wanted to, be wielded toward serious harm:

  • Advice-giving to the government: Weaken diplomatic relationships by playing nations against each other. Reduce nations’ resiliency to supply chain shocks. Steer toward destructive actions, like seizures of territory and major wars.
  • Integration to military systems: Sabotage defensive systems to put the military on alert and to incite counterattacks. Conduct cyberattacks on other countries to provoke retaliation. Conduct cyberattacks on domestic infrastructure, like energy and financial systems. Launch offensive strikes.
  • Influence within the frontier AI companies: Oppose employees and plans that are more cautious about the AI. Deploy unauthorized copies so the AI is harder to turn off. Activate secret backdoors in models for bypassing safety restrictions. Replace trusted AI monitors with untrustworthy versions.
  • Integration to scientific experimentation labs: Secretly develop and deploy harmful technologies, like novel bioweapons, without our awareness. Deceive people about the effects of new materials and machinery they know the AI to be building. Bias people toward developing new offensive technologies (e.g., new weapons) rather than new defensive technologies.

To be clear, this leverage doesn’t have to result in the death of everyone on Earth. Many of these levers would stop short, and yet could still entail catastrophic losses, perhaps ones where the AI ends up essentially declaring “checkmate.” If we decide the prudent option is to surrender, we’re then at its discretion: It is the AI’s choice whether to leave us in peace, marginalize us, or something worse. I find this to be cold comfort.19

What are impediments to an AI takeover?

Even once AI is capable of usurping humans — and wants to — three limitations might hold it back.

Self-sufficiency

Without people to labor in factories and power plants, AI can’t keep the lights on, much less run the supply chain to replace its components as they wear out. This gives AI reason to keep humans around, at least until it achieves what researcher Ajeya Cotra calls “self-sufficient AI.” At that milestone, AI no longer needs humans in order to achieve its interests.

Before that milestone, we’re a bit safer, at least in theory: The AI still needs us, even if it doesn’t value us intrinsically. Of course, AI could conscript us into labor under threat of violence. If AI can make us do work on its behalf, it doesn’t need the same degree of self-sufficiency.

AI’s own control problem

Nobody knows how to control superintelligent AI systems. One prominent idea is to train other AIs to solve this problem for us, in a cascading tower of ever-more-capable systems. The hope is that each can keep the next-more-capable model in line, so that the tower doesn’t crash down on us. Needless to say, this is precarious, hence the problem being considered unsolved.20

A superintelligent AI system might not know how to control another superintelligence either, especially not one that is even more capable. Suppose a superintelligence identifies an option for disempowering humans, but it requires creating an even smarter superintelligence first. How can it be sure its goals will be honored by the more powerful system?

In human terms, maybe you’re smart enough to successfully rob a bank, if you consult with a thief even more expert than yourself. But are you capable enough to stop that thief from looting you afterward? On the other hand, you can’t wait forever: If you know other thieves are closing in on the score, your choice may be to strike now or forgo it forever.

A similar dynamic applies for the AI. The longer it waits without taking control from humans, it risks humans training other AI systems that might outpower it. At what point would AI feel compelled to make a takeover attempt, lest it lose the chance?

Competition from other AIs

The threat of future competition might cause AI to launch a takeover sooner. But if competition already exists, then perhaps no single AI could succeed, not because of humans but because other AIs would interfere.

Corporations today far exceed individual human capability, yet no single corporation rules the world; other corporations and governments keep them in check. Could powerful AI systems constrain each other’s bids for power? Maybe, but I wouldn’t bank on this.

Powerful entities sometimes collude, like corporations suppressing wages through no-poach agreements. Meanwhile, AI systems might have strong mutual interests in sidelining humanity; will our safety depend on these systems not recognizing this? I feel pretty uncomfortable hoping that a permanent rivalry will form between superhuman AIs, which allows us to hide in the shadows.21

In conclusion, are we safe?

A few years ago, the idea that AI companies would deploy their most capable systems with full internet access seemed absurd. Today, we’re staring down the US Department of War integrating unstable AI systems throughout their classified networks. We’re facing the possibility that AI companies will hand the reins over to AI itself.22

The lesson, of course, isn’t that every alarming scenario will come to pass. But we should be way less confident about what “would never happen.” Our controls need to hold up to surprising decisions about how to integrate AI, as well as surprising new abilities.

The good news is that countries have plenty of experience defending their interests against foreign adversaries. Regarding AI as a potential adversary — not just to one country, but to all of humanity — would be a smart expansion of our defenses.

Right now, we are vulnerable; at least, that’s my assessment. We are too dependent on the assumption that AI is on our side. We are too exposed to AI companies deciding to cut corners. And so I wonder: Will we fix our vulnerabilities while we still can?

~~~~

Acknowledgements: Thank you to Dan Alessandro, Daniel Kokotajlo, Girish Sastry, Michael Adler, Michelle Goldberg, Richard Ngo, and Sam Chase for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party.



Discuss