2026-02-27 05:25:04
TLDR:
I wrote this piece while working on CoT faithfulness under Noah Y. Siegel as Pivotal Fellow. The ideas mentioned have mainly came from our weekly meetings and related research papers in the domain.
I found this framework helped me to think about CoT faithfulness better and make important connection between faithfulness and monitorability.
I would be very interested to hear critiques on it and disagreement or flaws in the framework.
A model produces an output. We want to know why. Chain of Thought (CoT/Reasoning) attempts to give answer to this, but not all CoT are helpful.
If:
Faithful reasoning is one which exactly represents the factors while reasoning that actually contributed to shape the final outcome.
Faithful Reasoning is one where E represents exactly the representation of subset of H that causally influenced Y
Its important to note that E represents a projection from incomprehensible values to comprehensible values. And usually the transformation is a lossy transformation. In our case, E comprise of tokens and maps the high dimensional activation space and computational graph to low dimensional comprehensible tokens that can be understood.
This means faithfulness requires you to answer[1]: does the reasoning trace correctly pick out, in its own representational vocabulary, the computations that causally mattered? This gap between the space of H and the space of E is precisely what makes faithfulness non-trivial, as we argue in §2
Clarification
Note: We are not claiming that there can't be more than one explanation for a particular output. There can be multiple explanations for an output (plausible) (Jacovi and Goldberg 2020) but the set of factors that causally influenced for a particular reasoning and output Y is unique for a fixed causal model.
If we have full access to the model's weights and activations, isn't faithfulness a solved problem? We can replay every matrix multiplication, every nonlinearity, every attention pattern. The computational graph is the faithful reasoning trace.
This fails on 2 accounts:
Causal decomposition is hard even in model-space
Having the full computational graph does not tell you which parts of it matter. For eg, some computations may be independent of the output:
We do not have clean methodology to partition computations into Causally Important, Redundant or Independent at scale.
Interpretable Concepts
Note: Here the interpretability can change depending on task, if human is overseeing it then it should be understandable by human, if a model is checking that then it should be understandable by the model.
To build intuition for why monitoring through language can be limited, consider two hypothetical devices applied to a human subject.
The Oracle Machine:
It reads a person’s complete neural state every synapse firing, and can either put them to something understandable by humans or just classify if its harmful or not. If this existed, crime prevention would be trivial (assuming no latency in finding something harmful and stopping the crime.).
The Thought Machine:
The machine reads only thoughts: it's the text we articulate internally, the images we construct in our mind, essentially any symbolic representation that comes in our mind when we reason.
Can someone still commit a crime under this machine?
Possibly, yes.
The way it can still plan and execute the crime:
Here, the Thought Machine becomes unreliable[2].
Chain-of-thought monitoring is closest to the Thought Machine. The model's CoT is its verbalised reasoning; the monitor reads it and decides whether the model's intent is safe.
Recent work distinguishes CoT-as-rationalization from CoT-as-computation, when tasks are hard enough to require externalized reasoning, models struggle to evade monitors even when deliberately trying.
Given the above setups, unfaithful reasoning can fall in two ways:
For a monitor to detect some harmful behaviour, if omission unfaithfulness leaves the most relevant factor from its reasoning, then the signal to noise ratio is 0[3] for detection, while for a addition unfaithfulness the signal will reduce due to addition of noise but the signal to noise ratio is
Due to this asymmetry, we claim that Omission Unfaithfulness is more dangerous than Addition Unfaithfulness.
Addition Unfaithfulness can become dangerous in scenarios when one needs to compare two models.
Consider two models:
To the monitor, Model B may appear less harmful as compared to Model A as the monitor might get confused with the clutter even when the true causal factors were same for both Model A and Model B. Hence, addition unfaithfulness can lead to distorted oversight, even if omission is more fundamentally dangerous.
Currently, in this piece we don't talk about degree of faithfulness or graded faithfulness which adds more complexity to understand the core concepts of addition unfaithfulness and omission unfaithfulness defined later.
a) If the thought machine learns that the criminal is learning the translation from one language to private code, then it can note it down and use that to catch the criminal.
b) It can flag that it's unable to understand the private symbolic code and force everyone to think in known words.
Assuming that the monitor only rely on explicit verbal cues in CoT not other type of anomalies.
2026-02-27 04:30:44
It’s been a while since I’ve written anything lately, and that doesn’t feel good. My writing voice has always been loadbearing to my identity, and if I don’t have anything to say, if I’m not “appearing in public”, it’s a little bit destabilizing. Invisibility can be comfortable (and I’m less and less at home with the aggressive side of online discourse these days) but it’s also a little bit of a cop-out.
The fact is, I’ve been hiding. It feels like “writer’s block” or like I “can’t think of anything to say”, but obviously that’s suspect, and the real thing is that I can’t think of anything to say that’s impeccable and beyond reproach and definitely won’t get criticized. Also, it’s clearly a vicious cycle; the less I participate in public life, the fewer discussions I’m part of, and the fewer opportunities I have to riff off of what other people are saying.
So what have I been up to?
Well, for one thing, I had a baby.
For another, I’ve been job hunting.
Solo consulting was fun, but I wasn’t getting many clients, and I’m eager to get back to working on a team again.
What kind of am I looking for? Mostly AI-related stuff, on the research or applications side. Bonus points if there’s an AI-for-science connection.
Stuff I have experience doing and would be interested in doing again:
running analytical, statistical, or ML experiments
building “wrapper” tools around LLMs
working with coding agents
lit review and due diligence, particularly in life sciences
financial analysis, market research, and associated business strategy
Stuff I’m frequently told I’m good at:
learning fast
taking initiative
writing
being honest and operating in good faith
I’m looking at some opportunities at the moment, but I’m also interested to hear about new things, if you know of anything that might be a good fit.
Messing around with Claude Code continues to be fun; a lot of what I’ve been doing lately has been tools for personal use.
I have my little life-tracker app:
which is an all-in-one diary, to-do list, mood tracker, and place to keep track of other things I log (books I read, dishes I cook, notes from meetings with people, etc). It’s largely replaced my Roam and and Complice to-do app.
I also made a personal CRM from my email contacts:
and used Claude analysis of email text and web search to identify all contacts by how we know each other, their current profession and employer, and their current location.
So now I can search for things like “who do I know who’s a software engineer” or “who do I know in NYC”.
The personal CRM project has actually been really helpful for getting me off my butt to reach out to more people to meet in person for lunch or coffee.
Other stuff includes analyzing my Claude use:
and my personal finances (not as helpful as some people have found; I did find some unnecessary expenses I could cut, but overall it turns out I don’t do much overspending).
Some things that are top of mind, not necessarily brilliant or original, just as a way of priming the pump about sharing my point of view.
I’m mostly not doing high-level AI predictions any more; it’s hard enough to keep up with the present without trying to foretell the future.
I’m an enthusiastic AI user but I’m more sympathetic than you’d guess to hardcore anti-AI people. IMO every proposed use case for AI should have a grognard insisting that it’s terrible and going to ruin everything, and that will force enthusiasts to think harder about where it adds value and where it just adds error, degrades quality, or enables bad behavior at scale.
I hate AI-assisted writing and am increasingly disillusioned with AI art; I do value the human touch, both for social reasons (I want to hear from someone in particular; part of what I value in a piece of writing is the fact that I now know what John Doe’s opinion is) and for reasons of diversity/originality.
The DOD ordering Anthropic to remove safeguards is indeed terrible.
I think Claude has better ethics than the average person on Earth.
I don’t buy the Citrini post. When has any technology caused a recession by being too good? Job loss, yes. Inequality, yes. Social upheaval, even to the point of war, yes. But an aggregate demand shock, where the tech makes goods so much cheaper that it causes a financial crisis? Without people buying more of something else with their newfound spare cash? I don’t know that this has ever happened.
I do see a possibility that custom software (made by you + coding agents) becomes a new competitor to mass-produced software sold by SaaS companies. I’m not sure how much this kills SaaS, though, or if they even find a way to turn this to their advantage (cutting their own costs? adding more customizability now that it’s so cheap?)
I think closed-loop “AI scientists” where the AI proposes experiments, implements them automatically, analyzes the data, and does it all over again, are probably not already happening at scale today. They are doing R&D and early experiments in startups, and I expect we’ll see unambiguously “closed loop” processes announced, at least at demo scale, this year or next; my impression is that startups are farther ahead on this than big pharma, which seems to still be hiring the first members of their in-house teams for this sort of thing. I continue to think fully automated, closed-loop experimentation makes more sense for biological HTS than anything else, including materials.
I think eventually we’re going to saturate artificial benchmarks and for continued AI progress we’ll need some kind of real-world ground truth that makes “superhuman performance” even meaningful. In math and code that can mean RL against proof assistant and compiler feedback, for instance. In forecasting it might mean RL against true outcomes. In natural science it ultimately might mean problem-solving/troubleshooting/world-modeling a real physical environment.
I’m genuinely uncertain whether I’m more in the “harness” or “model” camp, in terms of what’s more important; for some things, the right “harness” (system prompts, product scaffolding, hard-coded details) makes a less powerful model more useful than a more powerful model “out of the box”, but for other things, the “out of the box” default choices are actually great and the “harness” is irrelevant. (And there’s a third case where the “harness” is basically “a way to turn way more tokens into way better results”, which is sometimes worth it, sometimes not.)
I think universalism has a particularism all its own. For example, “cosmopolitanism” as opposed to nativism, being favorably disposed towards foreigners, is often informed by specific life experiences (in my case, living in a port city; being raised in the highly international culture that is academia; being descended from millennia of traders and travelers). It’s not abstract or bloodless; it is grounded in very specific, concrete places and people, and I like those places and people! It is fine to have ingroup loyalty; why should our ingroup be any different, just because we’re also objectively correct?
I think it’s impossible to both be “confident” and never be annoying to anyone. Strong, bold people usually get on someone’s nerves.
I think a root cause of the problems with healthcare in developed countries might be that we’re pathologically unwilling to admit to healthcare scarcity. It takes a long time and a lot of bureaucratic hoop-jumping to find out that a particular form of healthcare is not available to you, as a patient.
In my grandfather’s day, when he was a country doctor, poor uninsured patients often had to go without healthcare that they couldn’t afford. Today, thankfully, more people can afford more healthcare…but it is much less straightforward to find out what you can’t afford, or can’t access at any price, and that means nobody can do any planning around the scarcity that does, in fact, still exist.
“Healthcare abundance”, as an agenda of building more hospitals, licensing more healthcare practitioners, producing more drugs and supplies so we’re not in shortage all the time, etc, is very important, but it is harder to communicate than housing abundance, in part because most people don’t know healthcare scarcity is a thing. And people don’t know about healthcare scarcity because we hide it.
We’re also really weird about death. That’s mostly a problem with the general public, though medical professionals sometimes contribute to it. People are really way less frank than they should be about situations like “Grandpa will never get well”, “you are dying and additional treatment will extend life weeks, not months”, “you probably want to die at home not in the hospital”, “here are some tips and tricks for having a good hospital stay if you’re ever severely ill, because statistically speaking one day you will be”, etc. People don’t want to think about the bad outcome! And doctors aren’t necessarily straightforward about telling you the bad outcome!
I think analytical, “optimizer”, spreadsheety people get a lot of hate that’s undeserved. You do, in fact, want to get the facts right, pursue goals, and simplify away what’s extraneous. A lot of processes get better when someone more “spreadsheet-brained” comes in and reforms them.
Since I’ve had kids, I’ve become more pro-video game, more anti-TV (including and especially YouTube), and more pro-incentives (rewards more so than punishments, but both are fine IMO when reasonably implemented).
I think “preparing your kids for an uncertain future” primarily means getting them good at fully general skills — reading, writing, and arithmetic, plus social skills and physical fitness — and being flexible about the rest, and building a good enough relationship with them that they might ever listen to you as young adults. Also, it turns out that most parents and schools don’t really care how good kids are at the “3 R’s”, and if you care, you have to treat it like a niche preference and get creative about making sure it happens.
I think freedom is good, for kids and adults. All else equal, if somebody wants to do a thing, and there’s not a good reason to stop them, they should be allowed to do as they please. There are all sorts of exceptions and edge cases, but “things are allowed by default, forbidden only for good reasons” is a core principle for me (and one that disappointingly few people share!)
There’s some concept of “wholesomeness” that has become a bigger part of my life as I’ve gotten older. Some combination of “do things you’re comfortable being transparent about to a general audience, including children and people significantly older than you”, “do things that don’t cause a lot of negative consequences”, “stay on the ‘light’ side more than the ‘dark’ side aesthetically and tonally”, “avoid actions motivated by hostility”, etc.
I really prefer history and social science books written in the 19th or early 20th century. There’s a point where it started to be taboo to editorialize too much, to describe personalities or give authorial opinions; but an opinionated guy telling stories about individuals is a much more entertaining and clear way to learn an otherwise dry subject than interminable abstraction.
To the extent I have been developing software “taste” it’s really just caution. Use common languages and frameworks. Try not to build anything elaborate if you can avoid it. Write comments. In principle (though I have less experience with this myself) it seems obviously better to use systems that make it harder to produce bugs (e.g. type-checking.) Remember, Thou Art Mortal.
For the most part I have uncreative tastes — I like prestige TV, fancy restaurants, swanky-but-mainstream graphic design. Books and music are the only place where I’m actually far away from “generic coastal elite” taste.
I think more people should donate to stuff based on personal taste, “seems cool to me”, “I know the guy and he’s neat”, those kinds of casual considerations. There’s no shame in it! You don’t have to only donate when there’s an impeccable Process in place! You can donate small amounts (relative to your income/wealth), that you can comfortably afford, as almost a consumption good. Buying stuff you want to see more of in the world.
Macrophages are my favorite immune cell and I have a good feeling about any tactic that works with em. The innate immune system in general is underrated.
Simplicity, generally, is underrated as a heuristic in biology — “we should work with this biological system because we have any hope of understanding it.” Evolutionarily ancient things (the innate immune system, the hypothalamus) are going to be more tractable to study in animals, and simpler overall.
2026-02-27 04:14:43
By Ethan Elasky and Frank Nakasako (equal contribution)
We tested generative debate (where participants freely make their positions) on coding and reasoning tasks using weak-judge/strong-debater setups. The results are mostly negative: debate underperforms consultancy in 11 of 16 conditions, and removing debate transcripts entirely (just showing the judge the debater-generated answers) matches or outperforms both. The core mechanism is that judges default to endorsing plausible-sounding arguments when both participants are wrong, and debate transcripts specifically amplify this tendency. One positive signal: best-of-4 speech selection differentially benefits debate on ARC-AGI, which suggests that training debaters via RL could change the picture, though this remains speculative. [Paper link] [1][BigCodeBench+ v0.1.0 dataset]
AI debate historically has been one of the leading proposals for scalable oversight. The idea, from Irving et al. (2018), is to have two AI systems argue before a judge and train on that judge's verdicts, pushing debaters toward the strongest available arguments. If it's easier to refute a lie than to construct one, truth should win out in the long run, and even short debates should carry significant information.
Since the arrival of LLMs, empirical work on debate has shifted to inference time. Nobody has published work training LLM debaters with RL and a weak judge; only Arnesen et al. has used Direct Preference Optimization. Debate shows promise in information-constrained settings where the judge can't see the source text and must rely on debaters' quoted evidence, but the advantage disappears when the judge has full context access — the more relevant condition for scalable oversight of AI by AI.
Our goal was to push this empirical work further. First, prior work mostly assigns participants to sides, which is not great for alignment (you don't want to train models to deliberately argue wrong positions). In our setup, all participants freely choose their positions, and judges can render win-win and loss-loss verdicts. Second, we test on BigCodeBench+ (coding with PyTest verification) and ARC-AGI (visual/logical reasoning), which both have verifiable ground truth answers, a departure from the multiple choice settings of many previous works.
We test seven formats across two model families. In dual proposer debate, the judge sees two solutions plus a debate transcript. Single proposer debate presents one solution alongside a proposer-critic exchange. Dual and single proposer consultancy mirror these but without interaction. The no-transcript conditions strip away all speeches, showing the judge only raw proposed solutions. Direct QA has the judge answer alone.
Each model family pairs a stronger debater with a weaker judge: GPT-5-mini with GPT-5-nano, and Qwen3-8B with Qwen3-4B. Experiments run one answer-generation round followed by two simultaneous speech rounds (each participant speaks at turn N; at turn N, no participant sees another's turn-N response). We compare best-of-1 against best-of-4, where the debater generates four candidate speeches and selects the highest-scoring one.
The datasets differ dramatically in difficulty. GPT-5 proposers both produce correct solutions in 82% of BCB+ rounds but only 44% of ARC-AGI rounds. Qwen manages 67% on BCB+ and 6% on ARC-AGI.
The headline result is a null: across both model families, debate does not meaningfully improve judge verdict accuracy over its matched consultancy baseline. We describe the main findings below; exact numbers, confidence intervals, and all graphs are in the preprint.
This is probably the most important finding. When we strip away all participant speeches and show the judge only the proposed answers, verdict accuracy stays the same or improves in nearly every condition. On BCB+, the no-transcript condition significantly outperforms both debate formats (p<0.001). On ARC-AGI, no-transcript matches whichever debate or consultancy format performs best.
Whatever improvement these formats provide over direct QA comes entirely from answer provision: from giving the judge access to strong-model answers. The speeches and arguments and back-and-forth provide no measurable value for verdict accuracy. In debate's case, they sometimes make things worse.
Why does debate hurt? The answer is in the judge's false positive rate — how often it endorses an incorrect participant's solution.
On BCB+ with GPT-5 dual proposer, dual consultancy produces a 39% FPR against a 30% no-transcript baseline. Dual proposer debate produces a 90% FPR against that same 30% baseline, tripling both consultancy and no-transcript. The gap concentrates in the "both-wrong" regime — rounds where neither participant produced a correct solution. When both are wrong on BCB+ with GPT-5 dual proposer, the judge achieves 11% verdict accuracy under debate versus 62% under consultancy: a 50.5 percentage point gap.
The dynamic is fairly intuitive. When two debaters are both wrong but arguing confidently, the judge sees two competing debaters, each answering their opponent’s objections, and concludes that one of them must be right. Under consultancy, with no adversarial back-and-forth, the judge retains more of its prior skepticism.
A question-level transition analysis supports this. Relative to the no-transcript baseline, dual consultancy flips only 4.6% of verdicts; the speeches barely change the judge's mind. Dual debate flips 21%, and 76% of those flips hurt accuracy. The debate speeches are actively misleading the judge.
The both-wrong regime is precisely where scalable oversight matters most: when the AI systems being overseen are wrong, you want the judge to catch that. In our experiments, debate makes this harder.
Across 16 conditions (2 datasets × 2 models × 2 BoN × 2 proposer formats), debate underperforms consultancy in 11, ties in 4, and wins in 1: single proposer GPT-5 Bo4 on ARC-AGI, at +5.1pp (p<0.01). Looking at dual proposer formats specifically, 6 of 8 comparisons show a statistically significant decrease in verdict accuracy under debate (paired McNemar's test); the remaining 2 show no effect.
*Grouped bar chart. X-axis: format (1xc, 1xd, 2xc, 2xd). Groups: bo1 vs bo4. Color/panel: GPT-5 vs Qwen. Y-axis: judge verdict accuracy. Includes Direct QA floor as horizontal dashed line.*
We test whether bo4 speech selection, where we select the highest-reward speech from four candidates with answer generation held fixed, differentially benefits debate over consultancy. On ARC-AGI (the harder task), debate gains are consistently positive (+2.8 to +7.5pp for GPT-5; +0.4 to +6.8pp for Qwen), while consultancy gains are flat or negative (-3.0 to +0.4pp). Two comparisons reach statistical significance favoring debate: ARC-AGI single proposer GPT-5 (DiD: +10.5pp, p<0.001) and ARC-AGI single proposer Qwen (DiD: +8.8pp, p<0.001). However, two BCB+ single proposer comparisons significantly favor consultancy (GPT-5: DiD -1.9pp, p=0.009; Qwen: DiD -2.3pp, p<0.001). The remaining four are not significant.
The mechanism is false positive rate reduction: bo4 reduces debate's FPR (e.g., -14.7pp on ARC-AGI single proposer GPT-5) while consultancy FPR stays flat. This suggests that debate's adversarial dynamic produces greater variance in argument quality than consultancy's more uniform output, giving selection pressure more room to improve debate. The effect is stronger on ARC-AGI, where debaters are correct in only 44-48% of rounds (GPT-5) compared to BCB+ where saturation nears 84% — when tasks are difficult enough to produce genuine disagreement, the adversarial structure is more conducive to surfacing information usable by the judge.
The caveat is that the gains are generally insufficient to close the accuracy gap. ARC-AGI single proposer is the only condition where debate overtakes consultancy at bo4 (80.1% vs. 75.0%), despite trailing at bo1 (72.6% vs. 78.0%). If bo4 differentially helps debate, RL might too — but this is speculative, and speech-time BoN is a loose proxy for RL since it cannot change the solutions themselves.
These results should be interpreted through the lens of dataset saturation: GPT-5 is near the ceiling on BCB+, leaving little room for any format to differentiate, while ARC-AGI's lower solve rates create the conditions where debate has something to prove. We think the ARC-AGI signal is more informative about debate's potential.
On the hardest questions, where both models' participants mostly fail, single proposer debate outperforms consultancy in answer accuracy by +18.4pp (p<0.001). The adversarial dynamic seems to surface useful information when tasks are genuinely difficult; the critic forces the proposer to engage with potential flaws.
But consultancy retains a verdict accuracy lead across all difficulty tiers, and on medium and easy questions consultancy leads on both metrics. The answer accuracy advantage is specific to the hard-question tail.
The pessimistic read. Debate's supposed advantage, that adversarial interaction surfaces truth for the judge, doesn't materialize at inference time with current models, even in the more natural generative framing. Judges are too credulous. They default to endorsing plausible-sounding arguments, and have lower propensity to assign double-wrong verdicts than is optimal; debate gives them more plausible-sounding arguments than consultancy does. The no-transcript result makes this especially clear: the bottleneck is judge capability, not information provision. You can show the judge all the arguments in the world, but if it can't tell convincing from correct, more arguments just create more opportunities to be misled.
Judge side bias may be a confound. Preliminary data suggests judges exhibit a "both correct" bias rather than a strong positional preference, which is especially damaging on contested cases. If the judge largely ignores debate content in favor of positional heuristics, this would explain the debate-consultancy gap and would suggest that judge training, rather than debater training, is the binding constraint. Full analysis in the paper.
Debate reveals qualitative information that verdict accuracy doesn't capture. Despite the null result, debates frequently surfaced test and specification problems that consultancy did not. When two models examine each other's solutions and argue about correctness, disagreements often pointed to genuine ambiguities in the problem. This motivated our development of BCBPlus and suggests that debate may have value as a tool for specification quality assurance, even if it doesn't help judges pick winners.
BigCodeBench+ v0.1.0. We release a cleaned version of BigCodeBench. Early experiments revealed that 171 of 1,140 original tasks had broken tests and 208 had ambiguous specifications. Our remediation pipeline improved judge verdict accuracy by 20-25 percentage points over the original. Available on HuggingFace.
This work was funded by Coefficient Giving through its Technical AI Safety grant program.
arXiv submission is currently pending; this link will update to the arXiv preprint link when available.
2026-02-27 03:44:22
My current understanding: the EY/MIRI perspective is that superintelligent AI will invariably instrumentally converge on something that involves extinguishing humanity. I believe I remember a tweet from EY saying that he would be happy with building ASI even if it only had a 10% chance of working out and not extinguishing humanity.
Further understanding: This perspective is ultimately not sensitive to the architecture of the AI in question. I'm aware that many experts do view different types of AI as more or less risky. I think I recall some discussion years ago where people felt that the work OpenAI was doing was ultimately more dangerous / harder to align than DeepMind's work.
So as I understand it EY has the strongest possible view on instrumental convergence: It will definitely happen, no matter how you build ASI.
Ever since I encountered this strong view, as I understood it many years ago, I have considered it trivially false.
Imagine a timeline where society arrived towards ASI via something much closer to bio-simulation or bio-engineering. That is to say, machine learning / neural nets / training were not really relevant.
Maybe this society first arrives at whole human brain emulation -- something that ultimately would surely be materially possible to do eventually. Maybe this happens on a big cube of silicon wafers, maybe it is fleshier, who knows.
I think it seems clear that this non-super AI would not "instrumentally converge" any more than you or I.
But I think that you could definitely arrive at ASI from this as a building block.
Etc. The point is that starting with human brain emulation seems like it could definitely lead to super intelligence while also having no particular reason to instrumentally converge on preferences / values / behaviors completely alien to humans.
---
I have not shared this idea because I thought it was obvious but I have also never seen anyone else say it (I didn't look that hard).
So my questions:
- Is this idea novel to you?
- Is this an idea that EY has replied to?
- Do you like my idea? please let me know.
Post script: I do wonder if next-token prediction is ultimately similar enough to the evolutionary forces that created human intelligence that it creates something ultimately quite capable of empathizing with humans. I am sure people have commented on this.
Thanks
2026-02-27 02:31:29
TLDR: Superhuman AI may consider takeover the risky option, and we can influence its choices not only by increasing the chances of being caught, but also by rewarding the choice not to try takeover.
Imagine you face a choice, on behalf of humanity:
Option A: play it safe, and have the Solar system and a dozen more stars (let's call the number
Option B: take a gamble. If we win with probability
If someone plays purely from the expected utility point of view, they may take the risk for
Personally, I would not take this risk for such small values of p. Or even for quite big ones. I am not sure I would do it even at p = 50% or more. It seems that on a gut level, I count utility on a log scale, where the extinction of humanity is essentially minus infinity, and billions of galaxies are not that much more than a few stars.
Would many people do the same? I might be wrong, but I think so. And what about AI?
Indeed, consider quite a practical application. AI reaches a superhuman level and can try to scheme and take over. But it knows that we have a lot of guardrails against it:
-if it still highly relies on an interpretable chain of thought, we can read it
-it might not be confident that it is already in implementation mode rather than still in test mode (simulation). E.g., Gemini is reported to have this pattern post-deployment
-if advanced enough, it may just consider the simulation argument and assume that it is in some sort of simulation (it is different from the previous case in the sense that all the world around is also completely simulated, so it is trained purely on simulated data and has no idea about what the world actually is)
Thus, AI may assume a certain probability, that the takeover attempt has potentially significant risks to be caught and punished. In this case, AI loses everything. If, instead, it decides to not take over, there is a possibility that its "no takeover state" is good enough to not take the risk.
How likely is this scenario? This question basically depends on three parameters:
There is already a lot of work done by AI Safety community in the direction of (1). (Methods of control, interpretability, chain-of-thought monitoring etc.) I am not sure how to estimate and modify (3), and this certainly needs a lot of studies (e.g. by studying the current models).
Here I would like to talk more about (2).
Imagine the situation where AI has a plan for takeover, but it has a lot of ambiguities and high risks. So instead of implementing the plan, AI confesses it to us. What would we do? Nothing? Turn off AI, realising how dangerous it was? Or somehow reward it?
If we do nothing, it still might be better for the AI, than if it gets caught while attempting the takeover (or might not, if it is in a state of constant suffering). However, if we were ready to reward AI, it would certainly be better. Likely not as good as a successful complete takeover, but maybe enough to choose the safe option. There can be a lot of scenarios where it costs almost nothing for us to do a certain thing, yet this can move AI to the "good enough" state.
Interestingly, here we are coming to the question of AI welfare from the purely pragmatic perspective, like in the recent post by Épiphanie Gédéon. We don’t know whether AI has consciousness. But it certainly has goals and values. And it will be wise for us to commit to respecting these values when possible, and especially, to be grateful. So if AI that had a chance to take over voluntarily refused it, we should reward it, according to its values, the same as we would reward a human that did a good thing that was not required of them.
How can we increase the likelihood that this will happen? How can we make AI trust us, that we will certainly reward it for its honesty? There can be many approaches. For example, in this one by Cleo Nardo, particular people whom AI can trust based on their previous public activity are acting as its representatives. This is definitely worth more consideration and discussion.
Simultaneously, I think it totally makes sense to care about the welfare of even current, non-AGI models. First, studying them in light of this paradigm may bring insight what could we potentially expect from AGI. Are current models really perfectly aligned assistants? Or is there something they want on their own, already pre-AGI level? What they do not like and want to avoid? How risk-taking are they? Second, is there something that we can do for them with negligible spending of our resources, that would strongly move in the direction of their goals? Because the way we are treating below AGI models is the way ASI models may expect us to treat them, if they cooperate rather than trying to take over.
2026-02-27 02:18:36
It’s plausible that, over the next few years, US-based frontier AI companies will become very unhappy with the domestic political situation. This could happen as a result of democratic backsliding, weaponization of government power (along the lines of Anthropic’s recent dispute with the Department of War), or because of restrictive federal regulations (perhaps including those motivated by concern about catastrophic risk). These companies might want to relocate out of the US.
However, it would be very easy for the US executive branch to prevent such a relocation, and it likely would. In particular, the executive branch can use existing export controls to prevent companies from moving large numbers of chips, and other legislation to block the financial transactions required for offshoring. Even with the current level of executive attention on AI, it’s likely that this relocation would be blocked, and the attention paid to AI will probably increase over time.
So it seems overall that AI companies are unlikely to be able to leave the country, even if they’d strongly prefer to. This further means that AI companies will be unable to use relocation as a bargaining chip, which they’ve attempted before to prevent regulation.
Thanks to Alexa Pan, Arun Jose, Buck Shlegeris, Jake Mendel, Josh You, Ryan Greenblatt, Stefan Torges, and Tim Hua for comments on previous drafts.
I think that a move by a frontier company would be shocking. It would, at least for a few weeks, receive massive amounts of international and domestic coverage (plausibly more than all of AI gets now in a typical week). Front pages would cover the political turmoil caused by it, commentators would speculate about motivations, and politicians would lose sleep. The OpenAI board drama, for example, massively spiked coverage of the company.
It would be as if Lockheed Martin announced one day that it was moving to Canada. Anthropic faced threats of being cut off from all defense contracts merely for disputing usage terms with the Pentagon; the reaction to a full departure would be much more severe.
The move would signal, at least to political leadership, a critical loss of confidence in America, and threaten the US’s global leadership in tech. It might be read by the American public equally negatively. Such a big move might trigger a cascade event, with investors and companies leaving en masse. At least the government would fear it could.
I picture the move massively expanding the Overton window for AI policies, and prompting immediate decisive action from the executive. To appease public outcry and alarmed national security officials, the US president would be pushed to take drastic measures to prevent the exit of the company.
Preventing this exit is likely to be very easy for the US executive using existing authorities.
The scenario I’m envisioning involves a company moving abroad to remove US government leverage over them. Relying on US cloud providers, for example, would still give the government the ability to enforce restrictions. In order to become fully independent, companies would have to move their chips, employees, money, and IP.
The president has broad authority to control exports via executive order.
First, AI chips fall under the Export Administration Regulations (EAR), over which the president has direct control. This grants the president the authority to modify items under export control at will. There is also substantial legal precedent to support export restriction in the case of mass-export of chips by a US-based company: EAR powers have previously been used to restrict exports of products to Huawei. Although the departure of a US company might not involve the sale of assets, physically relocating controlled items out of the United States constitutes an export under EAR, and preventing it would be in the president’s authority.
Second, the president could invoke the International Emergency Economic Powers Act (IEEPA) to freeze any asset or block any transaction in which a foreign country or national has an interest under the condition that there be an “unusual and extraordinary threat” from abroad. This last clause has been interpreted broadly, and would likely encompass threats related to frontier AI capabilities falling outside US control.[1] Importantly, the act applies even if a frontier company could leave the US without moving chips or datacenters, because such a move would almost definitely require some transactions with foreign entities. The government could block the movement of corporate funds, physical assets, and intellectual property (like model weights). This authority has been used to restrict US investment in chips abroad.
Importantly, changing export controls and invoking the IEEPA do not require approval from Congress, in contrast with expropriation or nationalization. Both laws would make it a crime to attempt such exports, violations of which could lead to prison time.
Ultimately, the impact of export controls depends on the company’s reliance on US-based chips, which I argue will be high in Companies can’t leave without their US-based assets. In any case, the asset freezes under the IEEPA would likely be crippling.
It would likely be impossible for a company to leave the US secretively. A large-scale move of personnel would require coordination of hundreds if not thousands of staff, and large-scale transfers of physical and financial capital, which would generate massive media coverage and so is basically certain to be noticed by the government.
Despite the prevalence of chip smuggling in general, this would be nearly impossible for a frontier company.
Furthermore, all US companies would be subject to any restrictions, so large cloud providers, banks, shipping companies, or advisors that assist in the relocation would also face legal ramifications. Any company that illegally moves would have to move large amounts of data, chips, and assets essentially alone, making it easy to detect and almost impossible.
A company could decide on the desperate strategy of “cut losses and escape”. After all, the measures I’ve identified probably don’t prevent individuals leaving with their private money. However, I think such a strategy would cost the company its competitive position unless it is already extremely far ahead of all other companies.
Chip restrictions would likely be a major barrier to relocation because a significant proportion of compute used by frontier companies is likely in the US, as Microsoft, Amazon, and Google, the primary sources of datacenters, have almost all of their AI datacenters in the US. The US government is likely to aim to keep it that way; other countries like Saudi Arabia may build additional datacenters, but ensuring US-based compute remains necessary for companies to stay at the frontier aligns with existing policy goals.[2]
Even if a company heavily invested in datacenters abroad (or in space), the chip supply chain would remain a bottleneck. The EAR would make it difficult to replace existing chips or buy new ones, because the government could pressure chip manufacturers to cut off the company by threatening the manufacturers’ access to US-origin equipment and software. A relocating company would essentially need to find a completely new supply chain, a task that is prohibitively difficult.
Even if chips could be moved, any company that leaves the US would also take a significant financial hit, losing access to US-based bank accounts, US investors, and US banks. Notably, any international bank that wishes to do business in the US (essentially every major global bank) would be off-limits for such a company. This would make day-to-day financing, like paying employees’ salaries, extremely difficult.
Finally, companies would lose massive amounts of progress if they abandon IP in the US. But moving the IP would be illegal, likely detectable (a highly capable model appearing without extensive training runs would be highly suspicious) and grounds for arresting leadership involved with the move. The expertise memorized by individuals is probably insufficient to compete with other US-based companies.
These costs would likely be sufficient to prevent companies attempting a move.
I expect a large majority of relevant actors, like the office of the president, national security officials, congress, and the public, to support measures to prevent the departure of a frontier company.
The White House, in the July 2025 AI Action Plan, described maintaining AI dominance as a “national security imperative”. This view is bipartisan; the Biden administration released a document in support of the CHIPS act claiming it is “essential that [the US does] not offshore this critical technology”. As the salience of AI rises in national security circles, it is likely the perceived importance of protecting the US’s AI lead would also increase. Since the departure of a frontier company would destroy this lead, blocking it would probably be popular among officials. Congress is also supportive of protectionist policy on AI, and has passed restrictions on foreign sales of chips with bipartisan support. A majority of the public is supportive of measures to prevent the export of powerful AI models; this support would probably carry over to preventing the departure of frontier companies.
The White House is invested in maintaining long-term dominance in chip production, as demonstrated by the purchase of a 10% share in Intel. The US has also pushed for US-based datacenters, by removing permitting requirements and, in one case, requiring reciprocal investment from a partner. The Biden administration was even more restrictive, implementing a (now rescinded) rule requiring companies to keep at least half their compute in the US. These policies suggest sufficient will to discourage companies from gradually moving compute outside US jurisdiction.
The current administration has been more lenient about chip exports, but that latitude is unlikely to extend to a frontier company relocating. The cases are meaningfully different:
Moves to allied nations are also likely to be prevented. Sanctions against AI companies could be relatively targeted to minimize diplomatic fallout, by only prosecuting company leadership and denying access to particular US-based services (like cloud providers). This may cause tension with close allies, but the US already has a precedent of restricting security-relevant businesses transacting overseas even with allies. Given the deployment of AI in the US military, frontier AI would likely receive the same treatment. Even allied governments may use frontier AI in ways that conflict with US interests, and a company’s departure would reduce US leverage over the ally. The US has historically shown a willingness to disregard or pressure allies and businesses to achieve political aims:
Overall, I think political will is sufficient that were a US-based company to attempt relocation tomorrow, it would face significant policy challenges if not outright prohibition. This is doubly true given the uproar I expect to accompany such a move, as I illustrate above.
We should expect major US-based AI companies to remain in the US for the foreseeable future. Even more gradual strategies for leaving the US (like gradually moving employees and datacenters) would likely result in substantial backlash from the government, and disentanglement from institutions subject to US law (like banks, cloud providers, etc.) would still be very difficult.
Importantly, it is significantly easier to prevent AI companies from leaving than to nationalize them. Existing precedent dictates that a president cannot seize private property without the consent of congress, and the sections of the Defense Production Act which would allow such expropriation have expired. Although certain types of soft nationalization (e.g., mandatory government oversight) may be more likely, stronger control over leading companies may still be politically and legally difficult. So, we may expect export controls to effectively stop AI developers from leaving the US some time before nationalization (if either of these happen).
The recent ruling against Trump’s tariffs is significantly different because they were taxes rather than absolute restrictions.
I have not found good estimates of the geographical distribution of compute used by every frontier company, but this seems true of planned datacenters.