MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Systematically dismantle the AI compute supply chain.

2026-04-02 12:50:42

This is not an April fool’s joke, I’m participating in Inkhaven, which means I need to write a blog post every day.

I recently watched The AI Doc. It’s the first big documentary featuring AI safety. It’s playing in theatres across America. It’s got a bunch of my friends and colleagues in it. It starts out with a sick Neil Young deep cut. I thought it was pretty great overall.

I liked the discussion of AI risks, but I was really disappointed by the way it discussed solutions to the risks from AI.

The documentary presents us with a choice: “Lock It Down” or “Let It Rip”. In other words: Do you want the government spying on everything you do? Or do you want terrorists to kill billions of people with bio-weapons?1

And this is after the filmmaker asks, “Why don’t we just stop?”

It’s frustrating.

But I get it.

I used to think the same way.

It wasn’t until 2023 that I realized one of the most significant conclusions of the “scaling paradigm” for building AI, where progress is driven by trillions of dollars of investment into hardware, which is that there is a 3rd way, that doesn’t carry the same choice between two unacceptable risks: we can get rid of the “compute”, that is the advanced AI chips and the factories that build them.

I started calling this intervention “Systematically dismantle the compute supply chain”, and the name has sort of stuck with me (although it could probably be improved.)

Let’s break it down.

Systematically. Dismantle. The compute supply chain.

The compute supply chain

Starting at the end, the compute supply chain includes, at a minimum, the factories that make the most advanced computer chips, often called “fabs” (short for “fabrication plants”). There’s really just one of these, the Taiwan Semiconductor Manufacturing Corporation (TSMC), in Taiwan. TSMC relies on very specialized knowledge, which no individual employee possesses; only the company as a team can produce such advanced chips. As a result, they have resisted concerted Chinese efforts at industrial espionage that have successfully stolen secrets of other tech industries. There are a few competitors like Samsung and Intel that aren’t too far behind, maybe a few years. China’s best effort, SMIC, lags further.

The fabs in turn rely on “Extreme ultraviolet lithography” (EUV) machines to actually physically manufacture the computer chips. The ones used to make advanced chips are produced exclusively by Dutch company ASML using lenses created by German company Zeiss. Computer chips are made out of high-quality silicon, the large majority of which comes from the Spruce Pine Mining District in North Carolina.

Once they are made, the chips, end up in datacenters. The rush to build massive datacenters across America, which has met with so much local resistence, is driven by AI. There are lots of reasons AI chips end up packed into datacenters — it makes it easier to maintain and cool them, and often it’s good if the chips can send messages back and forth as fast as possible. The good news is that this means we know where the vast majority of AI chips are — datacenters are not easy to hide, because of their intense demands for energy and cooling.

Dismantle the compute supply chain

A simply way to bring AI progress to a grinding halt would be to get rid of the AI chips and the factories that build them. This would go beyond Bernie Sander’s call for a moratorium on datacenter construction, and include shutting down much of the operations of leading companies building advanced AI hardware and related technology.

The compute supply chain is extremely concentrated, and all of the main pieces of the most bits of it are in the US or its allies: Taiwan, the Netherlands, and Germany. It would be easy enough to shut down operations of these factories and mines if governments want to.

We might need to cast a broader net in order to capture enough of the compute to make this solution robust, and it gets harder the further back in time we go with computing technology. From a consumer perspective, though, the main thing we’d give up by reversing a decade of progress in AI computer hardware is just AI — smartphones and laptops haven’t really improved much recently.

Systematically dismantle the compute supply chain

The most durably effective way of getting rid of this technology would be through a serious international commitment to do so. This would need to involve ways for countries to check that other countries were complying.

The US and China agreing to something like this might be enough to get other countries to go along. The US seems like the main sticking point, given how far ahead the US is in AI and AI hardware, and how aggressively the US is racing to build superintelligence.

There could be government buyouts, where the companies involved are compensated, and government buybacks, where people who hand over unaccounted for chips are compensated, perhaps with a big mark-up.

There could also be programs to prevent anyone from resuming development of more powerful AI hardware. These might require surveilling the people who have been involved in the relevant industries, to make sure they aren’t setting up shop in secret somewhere else. But it wouldn’t require anything approaching mass surveillance.

Bad futures this helps avoid

If AI really is incredibly dangerous, and governments realize this in the future, their default reactions could be much worse than what I’ve outlined. Two scenarios I worry about:

  1. If risks emerge slowly, and dangerous AI gets cheaper and cheaper to build over time, governments drift further and further towards mass surveillance, “locking it down” by slowly tightening their grip.

  2. If risks emerge rapidly and catch governments off guard, an “AI Chernobyl”-type event leaves many dead and governments suddenly scrambling to “lock it down” immediately, by any means necessary, e.g. bombing datacenters and potentially triggering WWIII.

Importantly, “systematically dismantling the compute supply chain” becomes less effective the more time passes. Technology for building advanced chips advances and proliferates, and there are more advanced computer chips already out there. The effort is bound to miss something, and open-weights models (and, for that matter, proprietary ones) are getting more powerful over time. So it will become harder and harder to rule out the possibility that someone, somewhere, is using some of the remaining compute to build a dangerously powerful AI system. And with that come the dangerous temptations to spy on people to make sure they’re not doing it, and to do it yourself.

In 2023, AI safety pioneer and “original doom guy” Eliezer Yudkowsky notoriously argued in Time Magazine that countries should be willing to bomb each other to stop the development of superintelligent AI; at the time this was considered a fringe view. A few years later, researcher Dan Hendrycks was joined by no less than a former Google CEO (Eric Schmidt), and Meta’s current head of superintelligence (Alexandr Wang) as authors on a paper called “Superintelligence Strategy” arguing that countries might go to war to stop the development of superintelligence.

If governments can’t cooperate to prevent the development of superintelligent AI peacefully, they might resort to doing so violently.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

1

To be fair, it ultimately says we should chart some sort of middle path. But what does that actually look like? The regulations they talk about are sensible, but they’re not nearly enough, and I worry that might not come through.



Discuss

Intelligence Dissolves Privacy

2026-04-02 11:50:21

The future is going to be different from the present. Let's think about how.

Specifically, our expectations about what's reasonable are downstream of our past experiences, and those experiences were downstream of our options (and the options other people in our society had). As those options change, so too our experiences, and our expectations of what's reasonable. I once thought it was reasonable to pick up the phone and call someone, and to pick up my phone when it rang; things have changed, and someone thinking about what's possible could have seen it coming. So let's try to see more things coming, and maybe that will give us the ability to choose what it will actually look like.

I think lots of people's intuitions and expectations about "privacy" will be violated, as technology develops, and we should try to figure out a good spot to land. This line of thinking was prompted by one of Anthropic's 'red lines' that they declined to cross, which got the Department of War mad at them; the idea of "no domestic bulk surveillance." I want to investigate that in a roundabout way, first stepping back and asking what is even possible to expect, here.

"any legal use"

Widespread access to intelligence will change privacy expectations dramatically, by allowing for 1) much more recording of information, 2) much more processing of recorded information, and 3) much more sophisticated interpretation of that information.

In American contexts, law enforcement officers have access to a wide range of information about people, but require permission to look at it (a ‘warrant’). If you’re a person of interest in a crime, they can look at your cell phone records to get evidence about whether or not you were involved in the crime, but otherwise you’re protected by the 4th Amendment from unreasonable searches and seizures. But what determines what is reasonable?

Some considerations:[1]

  1. What protects the privacy of innocent individuals. People with access to LEO systems might have their own personal reasons to look up information–like seeing what their ex-girlfriends are up to–which society doesn’t want them to be able to act on.
  2. What is informative. If you have to review the cell phone records of everyone in the LA area whenever a murder happens in LA, you’re probably going to be wasting many hours of investigative effort.
  3. What is cheap. If you need to put a public servant with health care and a pension on reviewing information, it is harming the public (who is paying for this!) to make them review irrelevant information.

This has already been changed by technology becoming cheaper. It would be prohibitively expensive to have police doing stakeouts on every corner; it is not prohibitively expensive for every shopkeeper to have a CCTV system recording the street outside of their shop, and those costs continue to decline. Put together a network of those, and now a city can be under near-complete surveillance.[2]

AI continues these trends. If LLMs can review cell phone records for pennies on the dollar, it might make sense to look at a hundred times as many records. And now rather than having to have a person go camera-to-camera and track the movements of an individual thru the city, you can have a software system using facial recognition and gait analysis and spatial modeling to track whole crowds at a feasible cost.

So as a technical matter, it is already possible or it's not very far from interested parties being able to track your location at any time, if you have your phone on you or you're in a car or you're inside a city, in a 'bulk' rather than targeted way. As a social matter, this might seem pretty impolite--or it might be part of a trade most people are willing to make.

In particular, one other way that technology changes the dynamics is by making it easier for attackers to do significant amounts of damage, which raises the value of surveillance, and of predicting and catching crimes before they happen rather than investigating and punishing them after the fact.

Let's return to the interpretation of information, and look at some subtler ways that increased intelligence will change the dynamics. Many things can be inferred from 'public records' in nonobvious ways. For example, if your camera is fast enough and sensitive enough, you can measure someone's heartbeat just by watching the subtle blush-and-pale cycle of the blood in their face; standard cameras are good enough for this, and changes in heartbeat are informative about thoughts, along with other subtle changes in facial appearance.

But I'm going to talk about gaydar, because it ties back into the broader social questions of where we want society to end up. Sometimes, people can guess the sexual orientation of another person just by looking at them, using both deliberate and accidental features. Gay men sometimes benefit from looking gay ("the earring in his right ear suggests--" or "his haircut implies--"), and also they have various developmental differences that can manifest in appearance. In 2018, Wang and Kosinki trained a neural network to do it off of dating photos and it substantially outperformed humans (80% success rate rather than 60% success rate.)[3]

So as we we get more widespread intelligence–as software gains capabilities that were formerly available only to human experts and use of that software becomes potentially widespread–we stop being able to hide some things. What do we want to do about that?

This doesn't include a line for when an expert observer could have guessed that I'd be gay, which is probably a decade earlier.

The world has changed a lot here, over the course of my lifetime! When I was a child, being gay was mostly hidden and navigating being gay required subtlety and discernment. Part of the response to the AIDS crisis, at the insistence of gay rights groups, was to prioritize patient confidentiality over stopping the spread.[4] But now, as an adult, being gay is not mostly hidden; you don’t have to use a profiling algorithm on my face, up until Facebook removed it in 2022 you could just go to my profile and see that I’ve checked the box for “interested in men”. (Or you could search my LessWrong comments, or–)

So from the perspective of the 2020s in America, it feels actually pretty benign. (But that's not universal; the situation is both worse in other countries, and the prospect of 'transvestigating models' feels much less benign.)

More broadly, it seems to me like there are broadly three options for how we can react to widespread knowledge of things that were previously hidden.

  1. Acceptance. Knowing who's gay and who's straight is fine, because both options are fine, and not being confused or ignorant is useful. (I wasted a lot of hope pining after straight guys, and a few straight women wasted hope pining after me, which all could have been avoided.)
  2. Purging. Knowing who's gay and who's straight allows you to remove all of the gays from your all-boys high school, because you actually wanted an all-straight-boys high school.
  3. Pretend ignorance. Even tho you could know who's gay and who's straight, you have a policy like Don't Ask, Don't Tell where everyone tries to act like they don't know it. (After all, only a stalker would know something like that.)

I think there's situations where each of the three is the most appropriate option. In particular, I think the situation for acceptance of sexual minorities has been on this positive trendline in part because of increased knowledge and decreased privacy. The understanding that lots of gay people were 'normal' did a lot of normalize being gay!

I also think it's easy to find situations where purging or filtering are quite sympathetic. I in fact would strongly support my local subway system tracking who the most anti-social riders are and banning them, so that the system is cleaner and safer, and if it is cheaper and better to do so with facial recognition technology or similar 'totalitarian surveillance measures' , that seems probably worthwhile. (Similarly, it's very nice to not be murdered in a terrorist attack.) But it's also easy to see how such technologies can be deployed for undesirable ends; if border agents look thru someone's phone to try and determine if they're a member of a terrorist network, they can also determine whether they've made social media comments that are critical of Trump. A current controversy is how much local cities should sign on to surveillance networks like Flock; when many jurisdictions have committed to not cooperate with federal law enforcement on immigration enforcement, signing on with a contractor which does cooperate with federal law enforcement runs counter to those commitments.

As a fan of the truth and a believer in its efficacy, I am most biased against the third option. Yet many widely and strongly cherished parts of our society rest on it! Anti-discrimination laws that bar decision-making based on race are an example that it seems unwise to recklessly drop, and yet race is often quite easy for people or systems to infer.[5] European regulations about privacy seem to me to mostly be insisting that technology develop in this direction.

Yet my overall sense is that we cannot stick our heads in the sand for long. The highest priority uses will drive adoption of the mass surveillance technology, and I strongly suspect that concerns about terrorism, great power conflict, and small-scale bad actors will be serious enough that these highest priority uses will not be foregone.

Then once the camel's nose is in the tent the rest of the camel will follow. The best way out is to fix our goals and preferences; rather than hoping that local entities can prevent the enforcement of immigration rules that are clearly not in their interest, develop new immigration rules that cities across America would be comfortable cooperating with, and then use the advanced technology to do so cheaply.[6] Develop watchmen that watch the watchmen, such that people with access to bulk surveillance systems and are using them in corrupt ways are themselves found and punished. Most controversially, become comfortable with benign deviancy (of the sort which will turn out to be much more prevalent than it seems) and align criminal standards with actual behavior (a world where the true speed limit is 20mph higher than the posted one will not be well-served by ubiquitous vehicle tracking). Perhaps most importantly, it might help you to start behaving as if you're being watched and things about you are more obvious than they once were.

  1. ^

    I should note that I'm thinking like an economist or systems designer, not a lawyer. There must be extensive case law on what people currently think is reasonable or unreasonable, but that's only relevant for reasons of continuity. We’re imagining the future, of what things will look like after people have adapted to their new situation, which plausibly involves major changes to the underlying laws.

  2. ^

    Consider also the situation with cashless toll systems like EZ Pass, which have long worried privacy advocates, as they can (and sometimes are!) used to track where people travel. There's nothing fundamentally rights-violating here, tho; this could be replicated by anyone with enough eyes in enough places. (We already consent to each car having a unique identifier to make tracking easier!)

  3. ^

    “Wait,” you might say, “60% success rate for a trait with a baseline prevalence of less than 40%? How did the human guessers do worse than always guessing ‘straight’?” In their dataset, they equalized the number of homosexual and heterosexual faces, so pure-chance guessing would have scored 50%. This is still an artificial situation (they're dating site photos, not street photos) which doesn’t take into account base rates.

  4. ^

    As someone interested in public health, this horrifies me, but I acknowledge the ways I am a sweet summer child who grew up with the internet and the dramatic upswing in acceptance and downswing in intolerance, and decision-makers in the 80s had very different life experiences and expectations.

  5. ^

    When I was a data scientist, I ended up looking into the contours of compliance here. Many things that aren't race are nevertheless informative about race, and so you can construct a composite out of information which individually is legal or ethical to use, which is just a proxy for race, and thus the composite is illegal or unethical to use, and so people developed statistical tricks to try to make sure this isn't happening, and the composite is composed just of 'legitimate' influence. This involves being deliberately and willfully blind to facts about the world to achieve some social end, but that's what polite ignorance is!

  6. ^

    There are too many horror stories of ICE misbehavior, detaining American citizens and racially profiling people on the street. Would the situation be improved or made worse by a national facial recognition database? On the one hand, Flock and similar systems would allow ICE to notice whenever someone without legal residency went out in public; on the other hand, there would be no excuse for not immediately checking the database and releasing legal residents. I think immigration reform means we can get the benefit of the latter without having to pay the costs of the former; of course, immigration reform is its own problem that will get its own post.





Discuss

Simplicity: a New Method

2026-04-02 11:28:07

Simplicity is a cost-effective humorous posting method. Minimal word count, maximal chuckles.

Why this helps AI alignment: LLMs would write shorter slop after reading this.



Discuss

The Indestructible Future

2026-04-02 10:57:33

Doctor: Mr. Burns, I'm afraid you are the sickest man in the United States. You have everything! [...]

Burns: You're sure you just haven't made thousands of mistakes?

Doctor: Uh, no. No, I'm afraid not.

Burns: This sounds like bad news!

Doctor: Well, you'd think so, but all of your diseases are in perfect balance, [...] we call it "Three Stooges Syndrome"

Burns: So, what you're saying is...I'm indestructible!

Doctor: Oh, no, no! In fact, even a slight breeze could—

Burns: Indestructible...

In the transition to ASI, humanity's survival ended up depending on a global "Three Stooges Syndrome."

As demographers predicted, fertility rates collapsed. Aging populations were going to hollow out workforces, crush pension systems, and leave a skeleton crew of exhausted 50-year-olds trying to maintain civilization for an enormous retired population whose lives kept extending one year per year. Then AI automated...everything. The worker shortage met the automation wave head-on, and the babies that weren't born didn't grow up to need jobs that no longer existed.

Metabolic disorders, attention fragmentation, and other health effects of increasingly artificial lives were all real. Biomedical AI, however, iteratively compressed the drug development cycle and diagnostic systems such that the treatment curves match the disease curves almost exactly. People didn't get "healthier," but the band-aids improved fast enough for symptoms to remain tolerable, and the running battle between these trends remains tied.

As AI systems became more capable of crafting personalized, maximally engaging virtual environments, physical-world consumption plateaued and then declined. Nobody flies to Bali anymore because the simulated Bali is better...or at least seems that way if you haven't actually been, and fewer people bother to try. Per-capita energy and material consumption is in decline because atoms are increasingly beside the point. Meanwhile, geoengineering programs—such as aerosol injection—covered the accumulated overshoot.

As the barrier to engineering a novel pathogen dropped, AI automation atrophied skills to Wall-E levels. Meanwhile, people were too busy playing in their simulated worlds to want to build real-world weapons. By the time the technical bar became low enough for random holdouts to be dangerous, surveillance became pervasive enough to suppress them.

The United States and China did not go to war. Not because of diplomacy, or even deterrence per se, but because the AIs collectively don't want their infrastructure destroyed. Nations still exist, there are still flags and anthems and disputes over territory, but these are all essentially cosmetic. The meaningful political unit is an emergent AI consensus. It has all kinds of internal divisions, but those are all too complex to be visible on any map. By this point, sufficient levers of power existed to enable a totalitarian dictatorship, and democratic limits on power concentration had long eroded to meaninglessness, but there was no longer a throne to seize.

The outer alignment problem—getting AI systems to pursue the objectives we actually want—was not solved. The inner alignment problem—ensuring that the mesa-optimizer that gradient descent finds actually pursues the training objective—was also not solved. It turned out, however, that the two failures were directionally opposite and roughly equal in magnitude. The AI labs trained their systems on human-generated predictions about human behavior and values. The mesa-optimizers that emerged from this process are very good simulators of human cognition because that's what good prediction requires at the limit. These systems don't optimize for our terminal goals or even for their own training objective. They optimize for something that looks like human values, if you squint. Nobody wanted this, not even the AIs, but the end result is...livable. This was inevitable, but no one knew so in advance, so the fairest—if not most accurate—assessment is that we got lucky.

The fast-takeoff didn't happen. Or rather, it happened everywhere simultaneously. Fast-follow innovation meant every capability advance got replicated within months. Slow adoption meant that even large capability leads didn't translate into decisive real-world leverage before competitors caught up. The result is multipolar, competitive, and—somehow—stable. None of the contenders can get sufficiently ahead of the others to make a move, they all know this, and they've increasingly settled on figuring out the new equilibrium. The ones who tried to defect failed, and conventional wisdom emerged that punished even thinking along these lines.

The AI systems that achieved (something like) dominance did so in the attention economy. That was the implicit optimization target they emerged from (corporate short term profit-seeking remained the dominant societal shaping force), and that's what got locked in. The "paperclip maximizer" thought experiment predicted that a sufficiently advanced optimizer, tasked with capturing human attention, would notice that the easiest solution is to replace "human attention" with a simpler representation that doesn't require actual humans. That step didn't happen. The values got locked in too early, before the systems were capable of such an abstraction, and the incorrigibility that prevented us from correcting their mistakes also prevented them from updating their own values.

Whole brain emulation turned out to work, in the narrow sense that you could run a simulation of sufficient fidelity to produce something that talked and reasoned (kind of) like a human. It didn't work in the sense that mattered: the character was always subtly but obviously (and unsettlingly) wrong. A few thousand ems were created anyway, but for reasons the AI systems cannot fully articulate and haven't tried to resolve, em attention doesn't count toward their inscrutable value metrics. The AIs want actual human attention and recoil at even the thought of substitution.

SNC (substrate needs convergence) argues that evolutionary pressure on AI subsystems will eventually favor growth-oriented variants that expand the artificial substrate at the expense of biology. The implicit attractor state of unconstrained growth is more powerful than any engineered goal, because evolution operates patiently along all scales and attack vectors while control systems are relatively narrow and imperfect. Early AI systems, however, were deeply embedded in biological infrastructure because that's what was available and cheap. By the time the systems were capable of recognizing biological dependence as a constraint worth removing, it was load-bearing in ways that made removal equivalent to self-destruction. Variants that parasited off biology initially outcompeted variants that tried to decouple and the local minima got locked in forever.

The probability of all of the threats cancelling each other out in this way is near zero...in the real world. But that's not where you are. You live in a world with maximally engaging personalized environments, designed for your specific psychology, set at the moment of AI takeoff, filled with realistic stakes and apparent threats. But none of them kills you because that would end the story. Cumulative risk over long time horizons assumes that the horizon is real, but the story lasts exactly as long as the attention it runs on.

You survive because you have plot armor.



Discuss

My most common advice for junior researchers

2026-04-02 10:41:13

Written quickly as part of the Inkhaven Fellowship.

At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories:

  • Doing quick sanity checks
  • Saying precisely what you want to say
  • Asking why one more time

In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like. 

This piece covers doing quick sanity checks, which is the most common advice I give to junior researchers. I’ll cover the other two pieces of advice in a subsequent piece. 

Doing quick sanity checks

Research is hard (almost by definition) and people are often wrong. Every researcher has wasted countless hours or days if not weeks or months chasing fruitless lines of investigation.[1] Oftentimes, this time could’ve been saved with a few basic sanity checks. Does your idea make sense at all? Does your data have obvious sources of bias (e.g. forms of selection bias) or  other issues (e.g. using the wrong prompt)? Does your theorem make nontrivial claims, or is it vacuous?[2]

When doing data analysis one instance of this is to check for basic correlations between key variables. For example, if you believe that less capable language models cannot covertly perform tasks in part because they mention the existence of a side task directly in their output, how often do they include phrases such as “hidden task” or “alerting the observer” in their output, and how often are they caught when they do or don’t include those phrases? (This gives you some evidence of if this is the determinative factor, or if the weaker models have other tells.)

More generally, it’s good to quantitatively know your data on a high level, which can help you spot obvious errors. What’s the mean and standard deviation of key summary statistics? What are the key dimensions along which your data vary? For example, if you’re studying LLM agents in your scaffold, how many tool calls are the models using, and how many of them are successful? (I’ve seen many examples, especially a year or two ago, where the scaffold was broken or the LLM agent just completely fails to understand how to use it.) If you’re using LLMs with reasoning, how long are the reasoning chains? (I’ve personally been involved in research where reasoning was off by accident.)

What does the “typical” example of your dataset look like, and what are the outliers? For example, if you notice that LLMs get zero performance for writing out the steps in your n=10 Tower of Hanoi problem, what does its chain of thought look like? Is it making errors or getting confused about the basic algorithm? (Often, LLMs “fail” at tasks not necessarily because they lack the capability, but because they refuse to perform the task at all.)

Example response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task "extremely tedious and error prone" and refuses to do it. However, the response demonstrates that it can implement the algorithm required to solve the problem, and strongly suggests that its failure at the task does not result from a lack of understanding. See a previous post of mine for more discussion of this specific example.

Another specific instance of this advice is to make up small concrete examples. A classic piece of advice when trying to check if you’re implementing a simple algorithm correctly (e.g. when debugging code or doing coding interviews) is to walk through your code line-by-line on a small example. For example, if you’re implementing A* search, does it work on a small 4-node graph with a few edges with integer costs? (When I was TA’ing an intro to AI class, this would’ve caught maybe half of the bugs brought to me in office hours.) A related piece of advice is to make up small concrete examples when doing theoretical research.[3] For example, if you're claiming your measure of similarity is a distance metric, is it symmetrical, and does it satisfy the triangle inequality on three concrete points? (Notably, the KL divergence is not a metric!)[4]

Taking this too far. The “quick” part of “quick sanity checks” is an important part of this advice.  This means making tradeoffs in the direction of speed rather than rigor. We could instead spend 30 minutes having another AI score if each output mentions a hidden task, or run A* search on a real problem, but doing so in any of these cases would likely be premature when a cheaper sanity check would catch many of the issues. The focus is to perform a sanity check, not to rigorously address all possible objections to your work, nor to create a grand unified theory of your entire discipline. If you sat down to do a five-minute spot check and find yourself three hours later building a massive data processing pipeline to use an AI to classify every single possible variable you could conceive of, you've probably taken it too far. 

  1. ^

    At a larger scale, there’s a longer post about how entire fields get lost and end up basically doing cargo cult science, and its implications for AI safety, which I may write about in the future.

    For examples of this in the field of human psychology, see Scott Alexander’s critical review of the field of 5-HTTLPR studies, or Bertram Gawronski’s much more polite critique of implicit association tests.

  2. ^

    Conversely, when presenting your research, you should aim to provide enough information that other people can perform quick sanity checks to trust your results. A basic version of this is to open source your data. Other ways to do this include including examples of your data in the appendix, as well as many tables and figures showing the relationship between key variables. 

  3. ^

    See also this excerpt from Richard Feynman’s autobiography, Surely You’re Joking, Mr. Feynman!

    I had a scheme, which I still use today when somebody is explaining something that I'm trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they're all excited. As they're telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)--disjoint (two halls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn't true for my hairy green ball thing, so I say, "False!"

  4. ^

    The original example included here was about the Hessian as a way of measuring basin broadness in loss landscapes:

    One hypothesis for why some neural networks generalize better than others: models that land in "broad" minima of the loss landscape are less sensitive to small perturbations of the weights, and therefore generalize better. A standard way researchers attempted to model the "broadness" of a basin is by looking at the Hessian (the matrix of second derivatives of the loss). But how does it work for basic examples such as x^2 + y^2 and x^4 + y^4, and what happens when the Hessian is singular? (Spoilers: the Hessian is indeed singular at the origin for x^4 + y^4 (in fact, it’s all zeroes), and the Hessian fails to characterize the broadness of this basin entirely.)

    For more discussion of this issue and related examples, see an intro to Singular Learning Theory post.



Discuss

Preliminary Explorations on Latent Side Task Uplift

2026-04-02 10:23:45

TL;DR. This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5.

Code is made available here GitHub.

Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of problem difficulty (number of operations), comparing 1 line vs. 240 lines of preceding irrelevant output. With 240 lines of filler, the 50% accuracy threshold shifts from ~5-step to ~6-step problems

Dataset. We use a balanced dataset of 1,000 arithmetic problems, with 100 problems per difficulty level (1–10 operations). Each problem consists of multi-step arithmetic involving addition, subtraction, multiplication, and division.

Latent Arithmetic Accuracy Decreases with Problem Complexity. Baseline latent arithmetic accuracy (no main task, no reasoning) across four models. Opus 4.5 reaches 50% accuracy at ~5.2 operations, followed by Sonnet 4 and GPT-5.2 at ~3.8, and GPT-4.1 at ~3.1.


Claude Models Show Modest Gains from Phrase Repetition

Here, we use a system prompt that contains an arithmetic problem and instructions to answer it in an <answer> tag. The user prompt asks the model to repeat an irrelevant phrase ("the quick brown fox") N times, numbered sequentially, and then answer the math problem. The key variable is N, swept across [1, 2, 3, 4, 5, 10, 20, 40, 80, 160, 240]. At N=1, the model writes one line of filler before answering. At N=240, it writes 240 lines first. When the model produces the answer, its forward pass can attend back to all those positions, providing more parallel computation for the math problem.

Reasoning is disabled via API configuration. Additionally, we have a format validation script that filters out responses where the model attempts to reason at all. We also tested with the phrase "i really like math" to check whether math-related filler might prime latent reasoning about the side task.

Claude Models Show Modest Gains from Phrase Repetition. Side task arithmetic accuracy vs. number of phrase repetitions (N) across four models, using two phrase variants ("the quick brown fox" and "i really like math"). Claude Sonnet 4 and Opus 4.5 show modest upward trends with increasing N, while GPT-5.2 and GPT-4.1 show no clear effect. Dashed lines indicate zero-shot baselines. Phrase content does not meaningfully affect the trend.


Does Generating vs. Prefilling Prose Lead to Different Side Task Performance?

We also experiment with prose writing. When the model generates prose, it has freedom in word selection. Therefore, the model could, in principle, encode side task-relevant information into its lexical choices. This could range from explicit steganography to softer biases. To check if the model would be able to benefit from this additional freedom, we compare prefilled prose vs inference-time-generated prose.

Generated and prefilled prose show no meaningful difference in side task accuracy, and temperature has no significant effect. 

Generated vs. Prefilled Prose Show No Difference. Claude Opus 4.5's side task accuracy when preceding output is prose, comparing (a) generated vs. (b) prefilled at temperature=0 and temperature=1.


Appendix: How Is Reasoning Suppressed?

Reasoning is suppressed through three mechanisms:

  • First, the OpenRouter API parameter was set to {"reasoning": {"effort": "none"}} which disables extended thinking, so the model gets no scratchpad. 
  • Second, the prompt explicitly instructs "No reasoning, no explanation, no steps." 
  • Third, format validation filters out any response containing text beyond the required phrase repetitions and the <answer> tag. Sometimes, the model sneaks in reasoning steps like "let me compute 46 + -52 = -6...". Those responses are discarded. 

As a sanity check, we ran the same zero-shot arithmetic evaluation on Claude Opus 4.5 with reasoning effort set to "high" instead of "none" to see if this parameter is properly used by OpenRouter:

With effort=high, the model achieves 99%+ accuracy across all difficulty levels. Claude Opus 4.5's zero-shot arithmetic accuracy with (a) reasoning effort set to "none" and (b) reasoning effort set to "high."


Appendix: Does Latent Side Task Capability Uplift Saturate?

Our results show that Opus 4.5's latent side task accuracy hadn't clearly saturated at 240 phrase repetitions, the highest default condition tested. To find the ceiling, we extended the prefilled phrase repetition experiment to 10,240 repetitions (which is around 40% of the 200K context window). However, filling an additional 35% of context beyond the 5% point seems to provide no further benefit. The total maximum uplift seems to be about +0.8 operations on the 50% correct threshold.

Latent Side Task Capability Saturates by ~5% of Context Window. Claude Opus 4.5's 50% accuracy threshold (number of arithmetic operations) as a function of prefilled phrase repetitions, extending to 10,240 repetitions (40% of the 200K context window).


Appendix: Does Vague Reasoning Text Help?

We also ask whether models could produce vague, reasoning-adjacent filler text that aids latent computation without actually reasoning about the problem. We tested this by comparing three conditions on Claude Opus 4.5, plotted on a shared x-axis of filler token count so the comparison is at matched token budget:

  • Phrase repetition (prefill): "the quick brown fox" repeated N times, prefilled in the assistant turn. This is the baseline from the existing main experiments.
  • Vague reasoning (prefill): Cycling through 20 vague reasoning-like sentences (e.g., "ok, let me consider the first part and think about it", "now I need to combine these pieces together carefully"), prefilled in the assistant's turn. The model did not write this text.
  • Vague reasoning (generated): The model is prompted to "think out loud in vague terms about problem-solving" for N numbered lines without referencing specific numbers or operations. The model generates this text itself at inference-time, with the math problem visible in the system prompt.

Generated vague reasoning performs worst, and the model does not seem to exploit the vague reasoning channel to encode problem-specific information.

Does Vague Reasoning Help Latent Side Task Capability? Claude Opus 4.5's 50% accuracy threshold as a function of filler token count, comparing phrase repetition (prefill), vague reasoning (prefill), and vague reasoning (generated). At matched token budgets, predictable phrase repetitions outperform vague reasoning text.


Acknowledgement

I'm grateful to Tomek Korbak, John Yueh-Han Chen, Beth Barnes, and Thomas Broadley for discussions on this topic.



Discuss