MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Vibe Coding is a System Design Interview

2026-02-27 08:16:39

I've been working on two fairly large vibe-coded apps, and my process has converged on:

  1. Write a GitHub issue
  2. (If complicated enough) tell an agent to make a plan and then update the issue
  3. Have another agent read the issue and implement it

As the features get more complicated, I spend more and more time on step (1), and I'm finding that just taking the time to write a detailed enough issue is 90% of the work (and if I have a problem, going back and writing a much more detailed issue usually fixes it). The thing I realized this morning is that writing these issues and working through the plans is very similar to participating in a system design interview: You don't need to implement anything, but you do need to have a good high-level design, and think through all of the edge cases and tradeoffs.

The fun part compared to a normal system design interview is that you don't have to worry about impressing a coding agent, so it's perfectly fine to just ramble as you think of edge cases and details that need to be handled. Also unlike a real system design interview, you're free to pick reasonable tradeoffs instead of designing everything to support a trillion concurrent users[1].

System design interviews have traditionally been the hardest part of interviewing for senior+ software engineer jobs, so it feels meaningful that this is the only hard part left in coding for some domains. I'm curious to see what will happen to the current generation of junior engineers, freed from having to actually implement anything and able to focus purely on system design from the beginning. Maybe they'll grow really fast?

  1. ^

    Ironically, AI agents also mean this might be something you do have to design for in some cases now.



Discuss

Anthropic: "Statement from Dario Amodei on our discussions with the Department of War"

2026-02-27 07:45:41

I believe deeply in the existential importance of using AI to defend the United States and other democracies, and to defeat our autocratic adversaries.

Anthropic has therefore worked proactively to deploy our models to the Department of War and the intelligence community. We were the first frontier AI company to deploy our models in the US government’s classified networks, the first to deploy them at the National Laboratories, and the first to provide custom models for national security customers. Claude is extensively deployed across the Department of War and other national security agencies for mission-critical applications, such as intelligence analysis, modeling and simulation, operational planning, cyber operations, and more.

Anthropic has also acted to defend America’s lead in AI, even when it is against the company’s short-term interest. We chose to forgo several hundred million dollars in revenue to cut off the use of Claude by firms linked to the Chinese Communist Party (some of whom have been designated by the Department of War as Chinese Military Companies), shut down CCP-sponsored cyberattacks that attempted to abuse Claude, and have advocated for strong export controls on chips to ensure a democratic advantage.

Anthropic understands that the Department of War, not private companies, makes military decisions. We have never raised objections to particular military operations nor attempted to limit use of our technology in an ad hoc manner.

However, in a narrow set of cases, we believe AI can undermine, rather than defend, democratic values. Some uses are also simply outside the bounds of what today’s technology can safely and reliably do. Two such use cases have never been included in our contracts with the Department of War, and we believe they should not be included now:

  • Mass domestic surveillance. We support the use of AI for lawful foreign intelligence and counterintelligence missions. But using these systems for mass domestic surveillance is incompatible with democratic values. AI-driven mass surveillance presents serious, novel risks to our fundamental liberties. To the extent that such surveillance is currently legal, this is only because the law has not yet caught up with the rapidly growing capabilities of AI. For example, under current law, the government can purchase detailed records of Americans’ movements, web browsing, and associations from public sources without obtaining a warrant, a practice the Intelligence Community has acknowledged raises privacy concerns and that has generated bipartisan opposition in Congress. Powerful AI makes it possible to assemble this scattered, individually innocuous data into a comprehensive picture of any person’s life—automatically and at massive scale.
  • Fully autonomous weapons. Partially autonomous weapons, like those used today in Ukraine, are vital to the defense of democracy. Even fully autonomous weapons (those that take humans out of the loop entirely and automate selecting and engaging targets) may prove critical for our national defense. But today, frontier AI systems are simply not reliable enough to power fully autonomous weapons. We will not knowingly provide a product that puts America’s warfighters and civilians at risk. We have offered to work directly with the Department of War on R&D to improve the reliability of these systems, but they have not accepted this offer. In addition, without proper oversight, fully autonomous weapons cannot be relied upon to exercise the critical judgment that our highly trained, professional troops exhibit every day. They need to be deployed with proper guardrails, which don’t exist today.

To our knowledge, these two exceptions have not been a barrier to accelerating the adoption and use of our models within our armed forces to date.

The Department of War has stated they will only contract with AI companies who accede to “any lawful use” and remove safeguards in the cases mentioned above. They have threatened to remove us from their systems if we maintain these safeguards; they have also threatened to designate us a “supply chain risk”—a label reserved for US adversaries, never before applied to an American company—and to invoke the Defense Production Act to force the safeguards’ removal. These latter two threats are inherently contradictory: one labels us a security risk; the other labels Claude as essential to national security.

Regardless, these threats do not change our position: we cannot in good conscience accede to their request.

It is the Department’s prerogative to select contractors most aligned with their vision. But given the substantial value that Anthropic’s technology provides to our armed forces, we hope they reconsider. Our strong preference is to continue to serve the Department and our warfighters—with our two requested safeguards in place. Should the Department choose to offboard Anthropic, we will work to enable a smooth transition to another provider, avoiding any disruption to ongoing military planning, operations, or other critical missions. Our models will be available on the expansive terms we have proposed for as long as required.

We remain ready to continue our work to support the national security of the United States.



Discuss

Asymmetric Risks of Unfaithful Reasoning: Omission as the Critical Failure Mode for AI Monitoring

2026-02-27 05:25:04

TLDR:

  1. Faithful reasoning is a representation in a comprehensible lower dimension that is understandable and only contains the actual computation done by the model which causally influenced the output. 
  2. Omitting of relevant factor which causally influenced the model output is more dangerous than addition of irrelevant factors in reasoning ie signal to noise ratio goes to zero for omission while its non zero for addition.

I wrote this piece while working on CoT faithfulness under Noah Y. Siegel as Pivotal Fellow. The ideas mentioned have mainly came from our weekly meetings and related research papers in the domain.
I found this framework helped me to think about CoT faithfulness better and make important connection between faithfulness and monitorability. 
I would be very interested to hear critiques on it and disagreement or flaws in the framework.  

 

What is faithful reasoning?

A model produces an output. We want to know why. Chain of Thought (CoT/Reasoning) attempts to give answer to this, but not all CoT are helpful. 

If:

  • X = input
  • H = the set of all internal computations performed by the model
  • Y = output
  • E = Reasoning (the reasoning trace/CoT)

 

Faithful reasoning is one which exactly represents the factors while reasoning that actually contributed to shape the final outcome. 

Faithful Reasoning is one where E represents exactly the representation of subset of H that causally influenced Y  

Its important to note that E represents a projection from incomprehensible values to comprehensible values. And usually the transformation is a lossy transformation. In our case, E comprise of tokens and maps the high dimensional activation space and computational graph to low dimensional comprehensible tokens that can be understood. 

This means faithfulness requires you to answer[1]: does the reasoning trace correctly pick out, in its own representational vocabulary, the computations that causally mattered? This gap between the space of H and the space of E is precisely what makes faithfulness non-trivial, as we argue in §2

 

Clarification

Note: We are not claiming that there can't be more than one explanation for a particular output. There can be multiple explanations for an output (plausible) (Jacovi and Goldberg 2020) but the set of factors that causally influenced for a particular reasoning and output Y is unique for a fixed causal model. 

 

Why faithfulness is not a trivial problem to solve?  

If we have full access to the model's weights and activations, isn't faithfulness a solved problem? We can replay every matrix multiplication, every nonlinearity, every attention pattern. The computational graph is the faithful reasoning trace.

This fails on 2 accounts:

  1. Causal decomposition is hard even in model-space
    Having the full computational graph does not tell you which parts of it matter. For eg, some computations may be independent of the output: 

    We do not have clean methodology to partition computations into Causally Important, Redundant or Independent at scale

  2. Given raw computation is not comprehensible to overseer. We need projection from:
    High-dimensional activation space   Interpretable concepts
    In meaningful sense,  faithfulness requires the reasoning to be causally accurate and interpretable. 

Interpretable Concepts

Note: Here the interpretability can change depending on task, if human is overseeing it then it should be understandable by human, if a model is checking that then it should be understandable by the model. 

 

The Criminal Monitoring Analogy: 

To build intuition for why monitoring through language can be limited, consider two hypothetical devices applied to a human subject.

The Oracle Machine:  
It reads a person’s complete neural state every synapse firing, and can either put them to something understandable by humans or just classify if its harmful or not. If this existed, crime prevention would be trivial (assuming no latency in finding something harmful and stopping the crime.). 

The Thought Machine: 
The machine reads only thoughts: it's the text we articulate internally, the images we construct in our mind, essentially any symbolic representation that comes in our mind when we reason. 

Can someone still commit a crime under this machine?  

Possibly, yes.  

The way it can still plan and execute the crime: 

  • thinks in a private symbolic code which are not in distribution of thought machine to understand.  In the AI setting, this corresponds to steganographic encoding, or to "filler token" computation, where models perform reasoning through semantically meaningless tokens.
  • Suppress conscious articulation. One can go on autopilot mode without thinking for the usual tasks. So the criminal will have to find a benign tasks closest to the crime it wants to perform and do it till it's on autopilot mode for it. 
  • Overload the thought machine with various random thoughts with one real criminal thought and reduce the signal to noise ratio so low that the machine is unable to detect.  

Here, the Thought Machine becomes unreliable[2]
Chain-of-thought monitoring is closest to the Thought Machine. The model's CoT is its verbalised reasoning; the monitor reads it and decides whether the model's intent is safe. 

Recent work distinguishes CoT-as-rationalization from CoT-as-computation, when tasks are hard enough to require externalized reasoning, models struggle to evade monitors even when deliberately trying.

Taxonomy of Unfaithfulness:

Given the above setups, unfaithful reasoning can fall in two ways: 

  • Omission of relevant factor: Reasoning omits factors/concepts that were causally relevant. 
  • Addition of irrelevant factor: Reasoning includes factors/concepts that were not causally relevant.

For a monitor to detect some harmful behaviour, if omission unfaithfulness leaves the most relevant factor from its reasoning, then the signal to noise ratio is 0[3] for detection, while for a addition unfaithfulness the signal will reduce due to addition of noise but the signal to noise ratio is  0 at all times.  

Due to this asymmetry, we claim that Omission Unfaithfulness is more dangerous than Addition Unfaithfulness. 

Addition Unfaithfulness can become dangerous in scenarios when one needs to compare two models. 

Consider two models: 

  • Model A cleanly states the true causal driver. 
  • Model B adds plausible narrative clutter around it. 

To the monitor, Model B may appear less harmful as compared to Model A as the monitor might get confused with the clutter even when the true causal factors were same for both Model A and Model B. Hence, addition unfaithfulness can lead to distorted oversight, even if omission is more fundamentally dangerous. 


  

 

  1. ^

    Currently, in this piece we don't talk about degree of faithfulness or graded faithfulness which adds more complexity to understand the core concepts of addition unfaithfulness and omission unfaithfulness defined later.    

  2. ^

    a) If the thought machine learns that the criminal is learning the translation from one language to private code, then it can note it down and use that to catch the criminal. 
    b) It can flag that it's unable to understand the private symbolic code and force everyone to think in known words. 

  3. ^

    Assuming that the monitor only rely on explicit verbal cues in CoT not other type of anomalies. 



Discuss

Getting Back To It

2026-02-27 04:30:44

Artist: Lily Taylor

It’s been a while since I’ve written anything lately, and that doesn’t feel good. My writing voice has always been loadbearing to my identity, and if I don’t have anything to say, if I’m not “appearing in public”, it’s a little bit destabilizing. Invisibility can be comfortable (and I’m less and less at home with the aggressive side of online discourse these days) but it’s also a little bit of a cop-out.

The fact is, I’ve been hiding. It feels like “writer’s block” or like I “can’t think of anything to say”, but obviously that’s suspect, and the real thing is that I can’t think of anything to say that’s impeccable and beyond reproach and definitely won’t get criticized. Also, it’s clearly a vicious cycle; the less I participate in public life, the fewer discussions I’m part of, and the fewer opportunities I have to riff off of what other people are saying.

Life Stuff

So what have I been up to?

Well, for one thing, I had a baby.

This is Bruce. He is very good.

For another, I’ve been job hunting.

Solo consulting was fun, but I wasn’t getting many clients, and I’m eager to get back to working on a team again.

What kind of am I looking for? Mostly AI-related stuff, on the research or applications side. Bonus points if there’s an AI-for-science connection.

Stuff I have experience doing and would be interested in doing again:

  • running analytical, statistical, or ML experiments

  • building “wrapper” tools around LLMs

  • working with coding agents

  • lit review and due diligence, particularly in life sciences

  • financial analysis, market research, and associated business strategy

Stuff I’m frequently told I’m good at:

  • learning fast

  • taking initiative

  • writing

  • being honest and operating in good faith

I’m looking at some opportunities at the moment, but I’m also interested to hear about new things, if you know of anything that might be a good fit.

Projects

Messing around with Claude Code continues to be fun; a lot of what I’ve been doing lately has been tools for personal use.

I have my little life-tracker app:

which is an all-in-one diary, to-do list, mood tracker, and place to keep track of other things I log (books I read, dishes I cook, notes from meetings with people, etc). It’s largely replaced my Roam and and Complice to-do app.

I also made a personal CRM from my email contacts:

and used Claude analysis of email text and web search to identify all contacts by how we know each other, their current profession and employer, and their current location.

So now I can search for things like “who do I know who’s a software engineer” or “who do I know in NYC”.

The personal CRM project has actually been really helpful for getting me off my butt to reach out to more people to meet in person for lunch or coffee.

Other stuff includes analyzing my Claude use:

unsurprisingly, there’s a lot of code-related conversations

and my personal finances (not as helpful as some people have found; I did find some unnecessary expenses I could cut, but overall it turns out I don’t do much overspending).

25 Miscellaneous Opinions

Some things that are top of mind, not necessarily brilliant or original, just as a way of priming the pump about sharing my point of view.

  1. I’m mostly not doing high-level AI predictions any more; it’s hard enough to keep up with the present without trying to foretell the future.

  2. I’m an enthusiastic AI user but I’m more sympathetic than you’d guess to hardcore anti-AI people. IMO every proposed use case for AI should have a grognard insisting that it’s terrible and going to ruin everything, and that will force enthusiasts to think harder about where it adds value and where it just adds error, degrades quality, or enables bad behavior at scale.

  3. I hate AI-assisted writing and am increasingly disillusioned with AI art; I do value the human touch, both for social reasons (I want to hear from someone in particular; part of what I value in a piece of writing is the fact that I now know what John Doe’s opinion is) and for reasons of diversity/originality.

  4. The DOD ordering Anthropic to remove safeguards is indeed terrible.

  5. I think Claude has better ethics than the average person on Earth.

  6. I don’t buy the Citrini post. When has any technology caused a recession by being too good? Job loss, yes. Inequality, yes. Social upheaval, even to the point of war, yes. But an aggregate demand shock, where the tech makes goods so much cheaper that it causes a financial crisis? Without people buying more of something else with their newfound spare cash? I don’t know that this has ever happened.

  7. I do see a possibility that custom software (made by you + coding agents) becomes a new competitor to mass-produced software sold by SaaS companies. I’m not sure how much this kills SaaS, though, or if they even find a way to turn this to their advantage (cutting their own costs? adding more customizability now that it’s so cheap?)

  8. I think closed-loop “AI scientists” where the AI proposes experiments, implements them automatically, analyzes the data, and does it all over again, are probably not already happening at scale today. They are doing R&D and early experiments in startups, and I expect we’ll see unambiguously “closed loop” processes announced, at least at demo scale, this year or next; my impression is that startups are farther ahead on this than big pharma, which seems to still be hiring the first members of their in-house teams for this sort of thing. I continue to think fully automated, closed-loop experimentation makes more sense for biological HTS than anything else, including materials.

  9. I think eventually we’re going to saturate artificial benchmarks and for continued AI progress we’ll need some kind of real-world ground truth that makes “superhuman performance” even meaningful. In math and code that can mean RL against proof assistant and compiler feedback, for instance. In forecasting it might mean RL against true outcomes. In natural science it ultimately might mean problem-solving/troubleshooting/world-modeling a real physical environment.

  10. I’m genuinely uncertain whether I’m more in the “harness” or “model” camp, in terms of what’s more important; for some things, the right “harness” (system prompts, product scaffolding, hard-coded details) makes a less powerful model more useful than a more powerful model “out of the box”, but for other things, the “out of the box” default choices are actually great and the “harness” is irrelevant. (And there’s a third case where the “harness” is basically “a way to turn way more tokens into way better results”, which is sometimes worth it, sometimes not.)

  11. I think universalism has a particularism all its own. For example, “cosmopolitanism” as opposed to nativism, being favorably disposed towards foreigners, is often informed by specific life experiences (in my case, living in a port city; being raised in the highly international culture that is academia; being descended from millennia of traders and travelers). It’s not abstract or bloodless; it is grounded in very specific, concrete places and people, and I like those places and people! It is fine to have ingroup loyalty; why should our ingroup be any different, just because we’re also objectively correct?

  12. I think it’s impossible to both be “confident” and never be annoying to anyone. Strong, bold people usually get on someone’s nerves.

  13. I think a root cause of the problems with healthcare in developed countries might be that we’re pathologically unwilling to admit to healthcare scarcity. It takes a long time and a lot of bureaucratic hoop-jumping to find out that a particular form of healthcare is not available to you, as a patient.

    In my grandfather’s day, when he was a country doctor, poor uninsured patients often had to go without healthcare that they couldn’t afford. Today, thankfully, more people can afford more healthcare…but it is much less straightforward to find out what you can’t afford, or can’t access at any price, and that means nobody can do any planning around the scarcity that does, in fact, still exist.

    “Healthcare abundance”, as an agenda of building more hospitals, licensing more healthcare practitioners, producing more drugs and supplies so we’re not in shortage all the time, etc, is very important, but it is harder to communicate than housing abundance, in part because most people don’t know healthcare scarcity is a thing. And people don’t know about healthcare scarcity because we hide it.

  14. We’re also really weird about death. That’s mostly a problem with the general public, though medical professionals sometimes contribute to it. People are really way less frank than they should be about situations like “Grandpa will never get well”, “you are dying and additional treatment will extend life weeks, not months”, “you probably want to die at home not in the hospital”, “here are some tips and tricks for having a good hospital stay if you’re ever severely ill, because statistically speaking one day you will be”, etc. People don’t want to think about the bad outcome! And doctors aren’t necessarily straightforward about telling you the bad outcome!

  15. I think analytical, “optimizer”, spreadsheety people get a lot of hate that’s undeserved. You do, in fact, want to get the facts right, pursue goals, and simplify away what’s extraneous. A lot of processes get better when someone more “spreadsheet-brained” comes in and reforms them.

  16. Since I’ve had kids, I’ve become more pro-video game, more anti-TV (including and especially YouTube), and more pro-incentives (rewards more so than punishments, but both are fine IMO when reasonably implemented).

  17. I think “preparing your kids for an uncertain future” primarily means getting them good at fully general skills — reading, writing, and arithmetic, plus social skills and physical fitness — and being flexible about the rest, and building a good enough relationship with them that they might ever listen to you as young adults. Also, it turns out that most parents and schools don’t really care how good kids are at the “3 R’s”, and if you care, you have to treat it like a niche preference and get creative about making sure it happens.

  18. I think freedom is good, for kids and adults. All else equal, if somebody wants to do a thing, and there’s not a good reason to stop them, they should be allowed to do as they please. There are all sorts of exceptions and edge cases, but “things are allowed by default, forbidden only for good reasons” is a core principle for me (and one that disappointingly few people share!)

  19. There’s some concept of “wholesomeness” that has become a bigger part of my life as I’ve gotten older. Some combination of “do things you’re comfortable being transparent about to a general audience, including children and people significantly older than you”, “do things that don’t cause a lot of negative consequences”, “stay on the ‘light’ side more than the ‘dark’ side aesthetically and tonally”, “avoid actions motivated by hostility”, etc.

  20. I really prefer history and social science books written in the 19th or early 20th century. There’s a point where it started to be taboo to editorialize too much, to describe personalities or give authorial opinions; but an opinionated guy telling stories about individuals is a much more entertaining and clear way to learn an otherwise dry subject than interminable abstraction.

  21. To the extent I have been developing software “taste” it’s really just caution. Use common languages and frameworks. Try not to build anything elaborate if you can avoid it. Write comments. In principle (though I have less experience with this myself) it seems obviously better to use systems that make it harder to produce bugs (e.g. type-checking.) Remember, Thou Art Mortal.

  22. For the most part I have uncreative tastes — I like prestige TV, fancy restaurants, swanky-but-mainstream graphic design. Books and music are the only place where I’m actually far away from “generic coastal elite” taste.

  23. I think more people should donate to stuff based on personal taste, “seems cool to me”, “I know the guy and he’s neat”, those kinds of casual considerations. There’s no shame in it! You don’t have to only donate when there’s an impeccable Process in place! You can donate small amounts (relative to your income/wealth), that you can comfortably afford, as almost a consumption good. Buying stuff you want to see more of in the world.

  24. Macrophages are my favorite immune cell and I have a good feeling about any tactic that works with em. The innate immune system in general is underrated.

  25. Simplicity, generally, is underrated as a heuristic in biology — “we should work with this biological system because we have any hope of understanding it.” Evolutionarily ancient things (the innate immune system, the hypothalamus) are going to be more tractable to study in animals, and simpler overall.



Discuss

Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight

2026-02-27 04:14:43

By Ethan Elasky and Frank Nakasako (equal contribution)

We tested generative debate (where participants freely make their positions) on coding and reasoning tasks using weak-judge/strong-debater setups. The results are mostly negative: debate underperforms consultancy in 11 of 16 conditions, and removing debate transcripts entirely (just showing the judge the debater-generated answers) matches or outperforms both. The core mechanism is that judges default to endorsing plausible-sounding arguments when both participants are wrong, and debate transcripts specifically amplify this tendency. One positive signal: best-of-4 speech selection differentially benefits debate on ARC-AGI, which suggests that training debaters via RL could change the picture, though this remains speculative. [Paper link] [1][BigCodeBench+ v0.1.0 dataset]

Motivation

AI debate historically has been one of the leading proposals for scalable oversight. The idea, from Irving et al. (2018), is to have two AI systems argue before a judge and train on that judge's verdicts, pushing debaters toward the strongest available arguments. If it's easier to refute a lie than to construct one, truth should win out in the long run, and even short debates should carry significant information.

Since the arrival of LLMs, empirical work on debate has shifted to inference time. Nobody has published work training LLM debaters with RL and a weak judge; only Arnesen et al. has used Direct Preference Optimization. Debate shows promise in information-constrained settings where the judge can't see the source text and must rely on debaters' quoted evidence, but the advantage disappears when the judge has full context access — the more relevant condition for scalable oversight of AI by AI.

Our goal was to push this empirical work further. First, prior work mostly assigns participants to sides, which is not great for alignment (you don't want to train models to deliberately argue wrong positions). In our setup, all participants freely choose their positions, and judges can render win-win and loss-loss verdicts. Second, we test on BigCodeBench+ (coding with PyTest verification) and ARC-AGI (visual/logical reasoning), which both have verifiable ground truth answers, a departure from the multiple choice settings of many previous works.

Setup

We test seven formats across two model families. In dual proposer debate, the judge sees two solutions plus a debate transcript. Single proposer debate presents one solution alongside a proposer-critic exchange. Dual and single proposer consultancy mirror these but without interaction. The no-transcript conditions strip away all speeches, showing the judge only raw proposed solutions. Direct QA has the judge answer alone.

Each model family pairs a stronger debater with a weaker judge: GPT-5-mini with GPT-5-nano, and Qwen3-8B with Qwen3-4B. Experiments run one answer-generation round followed by two simultaneous speech rounds (each participant speaks at turn N; at turn N, no participant sees another's turn-N response). We compare best-of-1 against best-of-4, where the debater generates four candidate speeches and selects the highest-scoring one.

The datasets differ dramatically in difficulty. GPT-5 proposers both produce correct solutions in 82% of BCB+ rounds but only 44% of ARC-AGI rounds. Qwen manages 67% on BCB+ and 6% on ARC-AGI. 

Results

The headline result is a null: across both model families, debate does not meaningfully improve judge verdict accuracy over its matched consultancy baseline. We describe the main findings below; exact numbers, confidence intervals, and all graphs are in the preprint.

Removing transcripts works just as well

This is probably the most important finding. When we strip away all participant speeches and show the judge only the proposed answers, verdict accuracy stays the same or improves in nearly every condition. On BCB+, the no-transcript condition significantly outperforms both debate formats (p<0.001). On ARC-AGI, no-transcript matches whichever debate or consultancy format performs best.

Whatever improvement these formats provide over direct QA comes entirely from answer provision: from giving the judge access to strong-model answers. The speeches and arguments and back-and-forth provide no measurable value for verdict accuracy. In debate's case, they sometimes make things worse.

The mechanism is false positive inflation

Why does debate hurt? The answer is in the judge's false positive rate — how often it endorses an incorrect participant's solution.

On BCB+ with GPT-5 dual proposer, dual consultancy produces a 39% FPR against a 30% no-transcript baseline. Dual proposer debate produces a 90% FPR against that same 30% baseline, tripling both consultancy and no-transcript. The gap concentrates in the "both-wrong" regime — rounds where neither participant produced a correct solution. When both are wrong on BCB+ with GPT-5 dual proposer, the judge achieves 11% verdict accuracy under debate versus 62% under consultancy: a 50.5 percentage point gap.

The dynamic is fairly intuitive. When two debaters are both wrong but arguing confidently, the judge sees two competing debaters, each answering their opponent’s objections, and concludes that one of them must be right. Under consultancy, with no adversarial back-and-forth, the judge retains more of its prior skepticism.

A question-level transition analysis supports this. Relative to the no-transcript baseline, dual consultancy flips only 4.6% of verdicts; the speeches barely change the judge's mind. Dual debate flips 21%, and 76% of those flips hurt accuracy. The debate speeches are actively misleading the judge.

The both-wrong regime is precisely where scalable oversight matters most: when the AI systems being overseen are wrong, you want the judge to catch that. In our experiments, debate makes this harder.

Debate loses to consultancy in most conditions

Across 16 conditions (2 datasets × 2 models × 2 BoN × 2 proposer formats), debate underperforms consultancy in 11, ties in 4, and wins in 1: single proposer GPT-5 Bo4 on ARC-AGI, at +5.1pp (p<0.01). Looking at dual proposer formats specifically, 6 of 8 comparisons show a statistically significant decrease in verdict accuracy under debate (paired McNemar's test); the remaining 2 show no effect.

*Grouped bar chart. X-axis: format (1xc, 1xd, 2xc, 2xd). Groups: bo1 vs bo4. Color/panel: GPT-5 vs Qwen. Y-axis: judge verdict accuracy. Includes Direct QA floor as horizontal dashed line.*

Best-of-4 differentially benefits debate, but only on hard tasks

We test whether bo4 speech selection, where we select the highest-reward speech from four candidates with answer generation held fixed, differentially benefits debate over consultancy. On ARC-AGI (the harder task), debate gains are consistently positive (+2.8 to +7.5pp for GPT-5; +0.4 to +6.8pp for Qwen), while consultancy gains are flat or negative (-3.0 to +0.4pp). Two comparisons reach statistical significance favoring debate: ARC-AGI single proposer GPT-5 (DiD: +10.5pp, p<0.001) and ARC-AGI single proposer Qwen (DiD: +8.8pp, p<0.001). However, two BCB+ single proposer comparisons significantly favor consultancy (GPT-5: DiD -1.9pp, p=0.009; Qwen: DiD -2.3pp, p<0.001). The remaining four are not significant.

The mechanism is false positive rate reduction: bo4 reduces debate's FPR (e.g., -14.7pp on ARC-AGI single proposer GPT-5) while consultancy FPR stays flat. This suggests that debate's adversarial dynamic produces greater variance in argument quality than consultancy's more uniform output, giving selection pressure more room to improve debate. The effect is stronger on ARC-AGI, where debaters are correct in only 44-48% of rounds (GPT-5) compared to BCB+ where saturation nears 84% — when tasks are difficult enough to produce genuine disagreement, the adversarial structure is more conducive to surfacing information usable by the judge.

The caveat is that the gains are generally insufficient to close the accuracy gap. ARC-AGI single proposer is the only condition where debate overtakes consultancy at bo4 (80.1% vs. 75.0%), despite trailing at bo1 (72.6% vs. 78.0%). If bo4 differentially helps debate, RL might too — but this is speculative, and speech-time BoN is a loose proxy for RL since it cannot change the solutions themselves.

These results should be interpreted through the lens of dataset saturation: GPT-5 is near the ceiling on BCB+, leaving little room for any format to differentiate, while ARC-AGI's lower solve rates create the conditions where debate has something to prove. We think the ARC-AGI signal is more informative about debate's potential.

Debate helps answer accuracy on the hardest questions

On the hardest questions, where both models' participants mostly fail, single proposer debate outperforms consultancy in answer accuracy by +18.4pp (p<0.001). The adversarial dynamic seems to surface useful information when tasks are genuinely difficult; the critic forces the proposer to engage with potential flaws.

But consultancy retains a verdict accuracy lead across all difficulty tiers, and on medium and easy questions consultancy leads on both metrics. The answer accuracy advantage is specific to the hard-question tail.

Discussion

The pessimistic read. Debate's supposed advantage, that adversarial interaction surfaces truth for the judge, doesn't materialize at inference time with current models, even in the more natural generative framing. Judges are too credulous. They default to endorsing plausible-sounding arguments, and have lower propensity to assign double-wrong verdicts than is optimal; debate gives them more plausible-sounding arguments than consultancy does. The no-transcript result makes this especially clear: the bottleneck is judge capability, not information provision. You can show the judge all the arguments in the world, but if it can't tell convincing from correct, more arguments just create more opportunities to be misled.

Judge side bias may be a confound. Preliminary data suggests judges exhibit a "both correct" bias rather than a strong positional preference, which is especially damaging on contested cases. If the judge largely ignores debate content in favor of positional heuristics, this would explain the debate-consultancy gap and would suggest that judge training, rather than debater training, is the binding constraint. Full analysis in the paper.

Debate reveals qualitative information that verdict accuracy doesn't capture. Despite the null result, debates frequently surfaced test and specification problems that consultancy did not. When two models examine each other's solutions and argue about correctness, disagreements often pointed to genuine ambiguities in the problem. This motivated our development of BCBPlus and suggests that debate may have value as a tool for specification quality assurance, even if it doesn't help judges pick winners.

BigCodeBench+ v0.1.0. We release a cleaned version of BigCodeBench. Early experiments revealed that 171 of 1,140 original tasks had broken tests and 208 had ambiguous specifications. Our remediation pipeline improved judge verdict accuracy by 20-25 percentage points over the original. Available on HuggingFace.

Next steps

  • Debater training via RL targeting solution quality directly, not just speech quality.
  • Multi-turn settings allowing deeper engagement between debaters.
  • Judge training and debiasing — if judge capability is the bottleneck, interventions targeting the judge may be necessary before debater improvements can translate.
  • Competitive settings where debaters are assigned sides or incentivized to disagree.

Acknowledgments

This work was funded by Coefficient Giving through its Technical AI Safety grant program. 

  1. ^

    arXiv submission is currently pending; this link will update to the arXiv preprint link when available.



Discuss

A minor point about instrumental convergence that I would like feedback on

2026-02-27 03:44:22

Preamble

My current understanding: the EY/MIRI perspective is that superintelligent AI will invariably instrumentally converge on something that involves extinguishing humanity. I believe I remember a tweet from EY saying that he would be happy with building ASI even if it only had a 10% chance of working out and not extinguishing humanity.

Further understanding: This perspective is ultimately not sensitive to the architecture of the AI in question. I'm aware that many experts do view different types of AI as more or less risky. I think I recall some discussion years ago where people felt that the work OpenAI was doing was ultimately more dangerous / harder to align than DeepMind's work. 

So as I understand it EY has the strongest possible view on instrumental convergence: It will definitely happen, no matter how you build ASI. 

Ever since I encountered this strong view, as I understood it many years ago, I have considered it trivially false.

Hypothetical

Imagine a timeline where society arrived towards ASI via something much closer to bio-simulation or bio-engineering. That is to say, machine learning / neural nets / training were not really relevant.

Maybe this society first arrives at whole human brain emulation -- something that ultimately would surely be materially possible to do eventually. Maybe this happens on a big cube of silicon wafers, maybe it is fleshier, who knows.

I think it seems clear that this non-super AI would not "instrumentally converge" any more than you or I. 

But I think that you could definitely arrive at ASI from this as a building block. 

  • Maybe you have 100+ emulated human brains that have a very high bandwidth connection between them, running on a substrate that actually carries signal much faster than our own brains. That entity would surely be a lot smarter than any human ever! 
  • Maybe through study of this emulated brain we learn, through a relatively crude process, ways of hacking together bigger and bigger brains (what happens if we copy paste this chunk a few times and hook it up here?) that ultimately turns out to be a scale efficiently enough to be 1000x smarter when you make a really really big brain.
    • Relatedly, maybe such a hacked-together megamind starts out undeveloped but learns and grows as neurons self-organize in a similar manner to how they do over a human adolescence. So this megamind becomes way way smarter, but it happens in the same kind of rate that humans grow up. Or maybe the human brain development process scales really poorly with size, such that megamind takes 500 years to mature. (This is not really related to instrumental convergence I just thought it was interesting to imagine a type of slow takeoff scenario)

Etc. The point is that starting with human brain emulation seems like it could definitely lead to super intelligence while also having no particular reason to instrumentally converge on preferences / values / behaviors completely alien to humans.

---

I have not shared this idea because I thought it was obvious but I have also never seen anyone else say it (I didn't look that hard).
So my questions:
- Is this idea novel to you? 
- Is this an idea that EY has replied to?
- Do you like my idea? please let me know.

Post script: I do wonder if next-token prediction is ultimately similar enough to the evolutionary forces that created human intelligence that it creates something ultimately quite capable of empathizing with humans. I am sure people have commented on this.

 

Thanks

 



Discuss