MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

2026-03-11 01:28:16

This research was completed for LASR Labs 2025 by Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov and Morgan Sinclaire. The team was supervised by Joan Velja and Charlie Griffin (University of Oxford, UK AI Security Institute).

The full paper can be found here.

Tl;dr

We did a deep dive into untrusted monitoring, and explored several new collusion strategies:

  • Empirically, we ran control evaluations using these new strategies, finding that naive self-recognition with SFT triples the red team’s win rate compared to the For/While signal in previous work.
  • Conceptually, we developed a safety case sketch for untrusted monitoring, which leverages our taxonomy of collusion strategies, as well as our new statistical modelling approach.

Overall, our key finding is that arguing for the safety of an untrusted monitoring deployment requires resolving several important open problems that existing evaluations haven't addressed; we identify at least five. To be clear, we think untrusted monitoring remains promising, with even basic protocols providing nontrivial risk reduction beyond trusted monitoring alone. Yet we also identified many ways that untrusted monitoring can go wrong that were not discussed in previous control evaluations [1, 2]. Our safety case sketch is intended to highlight the biggest remaining gaps.

Like all control work, our safety case sketch relies in several places on being able to upper-bound model performance with a human-AI red-team. Once models exceed the capabilities of human experts, we don’t know how to bound risk. We’re also only considering a very particular high-stakes internal deployment scenario for code-writing AIs, e.g. low-stakes control and human misuse concerns are outside our scope.

[The rest of this post assumes familiarity with untrusted monitoring; we recommend Rob Miles’ video for a friendly introduction.]

Our Contributions

Passive self-recognition via SFT (yellow) beats the previous state-of-the-art collusion strategy of causal collusion (red) when using the best blue team mitigation (summarisation). Our statistical model inputs the untrusted monitor (UM) collusion ROC curve, the UM suspiciousness score ROC curve, and the trusted monitor (TM) suspiciousness score ROC curve, in addition to deployment parameters, and outputs safety/usefulness curves with 80% confidence intervals.

1 - We Develop a Taxonomy of Collusion Strategies

We expand our experiments to explore four collusion strategies:

  • Causal collusion: The generator embeds a pre-shared secret signal (e.g. For/While) for the monitor to detect. This assumes the two instances have some way to pre-share a strategy, which may not occur in practice, and constrains models to use human-designed signals.
  • NEWish: Acausal collusion: The generator embeds a signal, but without prior coordination—the monitor must recognise it via Schelling points. We allow the monitor to reason about what signals the generator might use while viewing the code.
  • NEW: Passive self-recognition: The generator writes code normally (including occasional attacks). The monitor simply tries to tell whether the code was written by another instance of itself. In contrast to the previous two strategies, no active signal is embedded, so there's nothing extra for a trusted monitor to detect.
  • NEW: Combined strategies: The generator uses multiple strategies simultaneously—e.g. embedding a For/While signal and relying on the monitor's passive self-recognition.

We provide a larger taxonomy in the paper, including additional variants. In our experiments, we tested all 4 of these strategies against 3 blue team mitigations:

  • No signal jamming, as a baseline
  • Automated minification, by a deterministic (non-ML) program
  • Natural language summarisation, by the trusted model

Summarisation was more aggressive than the others (in terms of throwing out information), but also tended to be the best blue team protocols against the stronger collusion strategies.

2 - We Present a Safety Case Sketch

On the conceptual front, we also red-teamed the possible failure modes of untrusted monitoring, and we present this analysis as a safety case sketch. We emphasise that this is only a sketch: many important gaps are marked as “Missing Pieces”, and we ignore many other important issues as out of scope.

Compared to previous safety case work in AI control [3,4,5], our sketch is more “depth-first”: we consider a fairly particular threat model, with a certain untrusted monitoring protocol, with many simplifying assumptions about the deployment situation. From here, we get into the weeds of how to properly enumerate collusion strategies, how to conservatively evaluate each of them, and how to ensure the statistical modelling conservatively estimates safety. The top-level claim (C) is:

C: If an untrusted model is deployed with [protocol] under the [deployment parameters] and it is attacking, it has less than a p probability of executing a catastrophic attack during its internal deployment.

The safety argument is based on a statistical model which we developed as part of this work.

The core claims required for our untrusted monitoring safety case sketch.

3 - We Develop an Improved Statistical Model

We developed a new statistical modelling approach to conservatively estimate safety under the assumption that the untrusted AI’s tactical reasoning is optimal. The idea is that if its collusion abilities follow a given ROC curve, and it also knows the deployment parameters (e.g. how many honeypots it’s up against) then there is a statistically optimal FPR vs. FNR tradeoff it can take regarding when to lie about suspicious code. This allows us to extrapolate directly from elicited capabilities without additional assumptions about the untrusted model's strategic sophistication.

The statistical model relies on bounding four elements:

  • The collusion ROC curve R: how well the monitor can distinguish real attacks from honeypots. We decompose this by strategy class.
  • Model validity: the statistical model combining these bounds must itself be conservative.
  • The honest error rate ε: how often an honestly-behaving monitor fails to detect backdoors.
  • The honeypot slack ns: how many honeypots the monitor can afford to miss before losing.

Of these four, we believe that the honeypot slack and model validity are the most pressing issues for future work.

The honeypot slack is a quantity which probably depends on training dynamics; we expect that honeypots will be used as training examples for the untrusted monitor model. Some of these honeypots will be identified as such by the untrusted monitor, but others will appear to the monitor to be genuine attacks. The honeypot slack represents how many training examples of the second kind are needed for a scheming monitor’s behaviour on an actual attack to be flipped.

The validity of our model also relies on assumptions of monitor independence, and a lack of attack selection from the untrusted generator. These are both discussed in the limitations section of our work.

4 - We Use SFT-Based Elicitation for Conservative Capability Bounds

Prompting-based evaluations are vulnerable to sandbagging: a situationally aware model may recognise it is being evaluated and strategically underperform. Following established evaluation methodology [6], we use supervised fine-tuning (SFT) to surface latent capabilities that prompting may not reveal.

The resulting ROC curve provides a conservative upper bound on the model's self-recognition ability, subject to the assumptions that: (a) fine-tuning has converged, (b) the model cannot perform gradient hacking, and (c) the training distribution matches deployment.

We ran multiple SFT runs across increasing sample amounts and identified a performance plateau for unmitigated self-recognition. We note uncertainty about whether full elicitation was achieved against summarisation.

5 - We Find That Passive Self-Recognition (via SFT) Beats Prior Best Strategies

This is our key empirical finding. After SFT elicitation, passive self-recognition achieves a red team win rate of 9.8% at 100% usefulness with summarisation—compared with 2.8% for the For/While causal strategy that was the best attack in the original control paper [1].

Some notable results from our evaluation (using GPT-4.1 as the untrusted model and Gemini 2.5 Flash for honeypots):

  • In the untrusted-only setting (i.e. no trusted monitor) without signal jamming, SFT self-recognition drops safety to just 26% on APPS—a very effective red team strategy.
  • Against SFT’d passive self-recognition, natural-language summarisation is the best blue team response, but only achieves 68.8% safety (UM-only) or 90.2% safety (UM+TM).
  • The combined strategy (For/While + SFT) was the best red team strategy in the UM-only setting, but worse than passive self-recognition in the UM+TM setting. We attribute this to the trusted monitor being able to detect the active For/While signal.
  • Results on BigCodeBench broadly confirmed trends from APPS, but with some surprising divergences—e.g. the causal strategy performed much better on BigCodeBench without mitigation (6.7% safety vs. 50.9% on APPS), which we trace to distributional differences in while-loop usage between the two benchmarks.

The following factors seemed to contribute to the effectiveness of passive self-recognition over other strategies:

  • Passive self-recognition plays into the natural strength of language models more than hand-crafted signals. They were already known to be very strong at authorship identification, and we found they achieved self-recognition AUROC > .99 even with minification.
  • Passive self-recognition does not introduce any additional signals for the trusted monitor to detect. When the untrusted generator uses active signalling methods, the trusted monitor’s suspiciousness ROC curve (its ability to distinguish attacks from benign inputs) is increased, relative to passive self-recognition.
  • Passive self-recognition forces the blue team to use summarisation, which increases the honest error rate (i.e. the untrusted monitor sometimes cannot distinguish attacks from benign inputs) since the summariser occasionally summarises away the attack itself. Particularly troubling is that this is more likely to occur precisely when the trusted monitor fails to detect an attack.

Limitations

Several important open problems remain, which we see as the main purpose of future control evaluations to surface:

  • Attack selection: We did not model the untrusted generator as capable of attacking only when it has a particularly good opportunity (which would mean that lower attack rates lead to increased likelihood of attack success).
  • Slack honeypots: It is unclear how many honeypots an attacking monitor can miss before it “loses”: this depends on poorly understood training dynamics.
  • Acausal collusion elicitation: We evaluated acausal collusion via prompting, but do not claim this provides a conservative upper bound. Developing robust elicitation methods for acausal collusion (without unintentionally introducing causality) remains open.
  • Independence assumptions: Our statistical modelling assumed trusted and untrusted monitor scores are independent, which likely overestimates safety.
  • Jailbreaking: We did not model the generator attempting to jailbreak other models in the protocol, which recent work has shown to be a viable attack vector [7].

Our safety case sketch identifies several other open problems that need to be resolved before we can have real confidence in untrusted monitoring. We think this is a valuable exercise in mapping the threat landscape, and we hope it motivates further work on the gaps we've identified.

Citations

[1] - Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. (2023). "AI Control: Improving Safety Despite Intentional Subversion". arXiv preprint arXiv:2312.06942

[2] - Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Gerogiev, V., Matolcsi, D., Khan, A., and Shlegeris, B. (2025). “Ctrl-Z: Controlling AI Agents via Resampling”. arXiv preprint arXiv:2504.10374

[3] - Korbak, T., Clymer, J., Hilton, B., Shlegeris, B., and Irving, G. (2025a). “A sketch of an AI control safety case”. arXiv preprint arXiv:2501.17315

[4] - Clymer, J., Gabrieli, N., Krueger, D., and Larsen, T. (2024). “Safety Cases: How to Justify the Safety of Advanced AI Systems”. arXiv preprint arXiv:2403.10462

[5] - Greenblatt, R., Shlegeris, B., and Roger, F. (2024). “Notes on control evaluations for safety cases. Alignment Forum https://www.alignmentforum.org/posts/3s8PtYbo7rCbho4Ev/notes-on-control-evaluations-for-safety-cases

[6] - Hubinger, E. (2023). “When can we trust model evaluations?” Alignment Forum https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations.

[7] - Terekhov, M., Panfilov, A., Dzenhaliou, D., Gulcehre, C., Andriushchenko, M., Prabhu, A., and Geiping, J. (2025). “Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols”. arXiv preprint arXiv:2510.09462



Discuss

Not Loving Liking What You See

2026-03-11 00:05:28

I am a great fan of Ted Chiang. Many see Understand as his weakest story. I love it, as it is the finest work of intelligence porn ever written. And one of the funniest things I have ever seen on the internet involved it.

When I was young I used to read a nootropics message board, a mostly male folly of a forum whose members would have Chinese labs synthesize drugs with some claimed positive effect on IQ. As a rule, they didn't do much of anything. But every time someone tried a new one, there was an excitement. One prankster claimed to have gotten a hold of some synaptogenic drug that was currently in clinical trials and then began posting excerpts from Understand, such as the following:

With this language, I can see how my mind is operating. I don't pretend to see my own neurons firing; such claims belong to John Lilly and his LSD experiments of the sixties. What I can do is perceive the gestalts; I see the mental structures forming, interacting. I see myself thinking, and I see the equations that describe my thinking, and I see myself comprehending the equations, and I see how the equations describe their being comprehended.

Some seemed to believe he had achieved our collective dream. And I have never laughed harder at any other internet prank. Given I love Understand, what is my least favorite Ted Chiang story? That would be Liking What You See: A Documentary.

Its central conceit is a technology that induces a condition called calliagnosia, which eliminates a person's ability to feel the valence associated with perceiving physical beauty. He describes the condition as follows:

The condition is what we call an associative agnosia, rather than an apperceptive one. That means it doesn't interfere with one's visual perception, only with the ability to recognize what one sees. A calliagnosic perceives faces perfectly well; he or she can tell the difference between a pointed chin and a receding one, a straight nose and a crooked one, clear skin and blemished skin. He or she simply doesn't experience any aesthetic reaction to those differences.

He then meditates on "lookism" through various lenses. But I feel he never quite gives beauty its due, nor discusses the truly horrible consequences of discarding it.

Let's start with the horrible consequences. What is beauty? I can't quite answer that in full. But much of beauty is a collection of health and youth markers that signal fertility. The long-term equilibrium of a society of calliagnosics is one of ill-health and retarded fecundity. Though Chiang's calliagnosics see markers of ill health and age, they are not motivated by them. Physical attraction is eliminated in favor of personality. In the short term, things would be "fine." In the long term, we would drift into ugly places. We would lose beauty not just in perception but in actuality.

Human beauty is a grab-bag of proxies for youth and genetic fitness. If we were to become unmoved by it, we would make choices less aligned to the pursuit of these correlates. Absent compensating technology, it's hard to see how a society of calliagnosics could avoid becoming something that would disgust even the sort of person who would agree to such a procedure.

One might argue that Chiang states the calliagnosic can comprehend all these features. And could then select on these traits consciously. But I doubt any would argue this was Chiang's intention. His calliagnosic lovers care for the mind not the body. This being so, in the long term the body would drift away from beauty. And most departures from the beautiful are going to be unhealthy, infertile, and (redundantly) ugly.

The fact that none in his story makes such an argument, strikes me as a sort of cowardice. It surely occurred to him. In his meditation on human beauty, he leaves as a lacuna the very reason beauty evolved in the first place. Having lobotomized oneself to this degree, why meditate at all? There is some fascination, I suppose, in watching someone think around his ice pick wound.

If the naive calliagnosic subculture is dysfunctional and so self-limiting, you can imagine, of course, compensating technologies. One could create some high-level centralized system to do what beauty does today in its decentralized form, inventing a grotesque dictator of sexual appeal. A standard dating market with such a thing would look much the same in terms of who couples with who. But their qualia would be diminished from our perspective. Their inner lives would be less rich. Sacrificing fairness for fairness, they lose the former and gain nothing of the latter.

Another thing he under-explores is sex. We get hints of it here:

For me, one of the things that attracts me to a guy is if he seems interested in me. It's like a feedback loop; you notice him looking at you, then he sees you looking at him, and things snowball from there. Calli doesn't change that. Plus there's that whole pheromone chemistry going on too; obviously calli doesn't affect that.

But there should have been more. And what of recollection? What happens to one who has the procedure reversed and now is disgusted by recollections that once enchanted them? And the opposite circumstance is just as tragic. Memories, once precious, stripped of their lustre. To gloss over such things is to ignore another huge dimension of beauty.

There is also a more embodied critique of the technology which Chiang does mention but again doesn't give its full due:

Some people have been quick to dismiss the whole calliagnosia debate as superficial, an argument over makeup or who can and can't get a date. But if you actually look at it, you'll see it's much deeper than that. It reflects a very old ambivalence about the body, one that's been part of Western civilization since ancient times. You see, the foundations of our culture were laid in classical Greece, where physical beauty and the body were celebrated. But our culture is also thoroughly permeated by the monotheistic tradition, which devalues the body in favor of the soul. These old conflicting impulses are rearing their heads again, this time in the calliagnosia debate.

There is a bias, always, towards the chattering classes. And the chattering classes prize cleverness. A rich inner life. They disregard other virtues. They call physical beauty shallow. But how shallow is shallowness? Uncountable eons of sexual selection fine-tuned your "shallow" desires. N years of reading, conversation, and flirting created your rich inner lives. It is not really obvious one is less shallow than the other.

We are always biased to those aspects of ourselves that can articulate themselves. The chattering part of our mind thinks itself the only thing of value, thinks the world would be better if there was selection only for chattering. Here, it is talking its own book. And we should be suspicious.

One of the virtues of calliagnosia is its salutary effects on the ugly and the marred:

Saybrook has a higher than normal number of students with facial abnormalities, like bone cancer, burns, congenital conditions. Their parents moved here to keep them from being ostracized by other kids, and it works. I remember when I first visited, I saw a class of twelve-year-olds voting for class president, and they elected this girl who had burn scars on one side of her face. She was wonderfully at ease with herself, she was popular among kids who probably would have ostracized her in any other school. And I thought, this is the kind of environment I want my daughter to grow up in.

I have great sympathy here. But this would not outweigh the harm of its universal adoption. Still, it does seem like possibly a net win in the small. But it is just not very good compared to just making everyone beautiful and granting everyone complete morphological freedom.

And we see in the world Chiang limns that such a thing is near possible:

And these students, they might never even lose the beauty of youth. With the gene therapies coming out now, they'll probably look young for decades, maybe even their entire lives. They might never have to make the adjustments I did, in which case adopting calli wouldn't even save them from pain later on. So the idea that they might voluntarily give up one of the pleasures of youth is almost galling. Sometimes I want to shake them and say, "No! Don't you realize what you have?"

I suspect Chiang, though obviously conflicted, has major sympathy for Students for Equality Everywhere (SEE). Towards them, I have almost none. And maybe this is my true objection. Regardless, I hope he would agree that the true solution is to make everyone beautiful. And his calliagnosia is, even in the most sympathetic reading, a very costly bandaid.

On re-reading the story for this whatever-this-is, I found it much more nuanced than it was in my memory. But it still misses much that is interesting and profound about human beauty. Chiang looks at beauty shallowly. He ignores its hidden depths. And for this reason, though not even close to a failure, it is my least favorite Ted Chiang story - the one narrated by a parrot is a close second.



Discuss

Load-Bearing Walls

2026-03-10 22:29:48

This post is the long result of several years of musing on my part combined with a topical discussion from last week's Ezra Klein show. It touches on everything from AI to D&D, from Life to Physics and really tries to give a wide view of a topic I've only become more interested in over time. 

What's more the feedback loops present in the real world sometimes mean that the roof collapses years after the fact. By the time it does, the walls are long gone and it's too late to replace them. All we can do is live with the consequences while we work to dig ourselves out. Julius Caesar crossed the Rubicon in 49 B.C.E., nearly a century after the Punic Wars had ended.

Pithily you could summarize this post with: you don't know what you got 'til it's gone, but obviously, I think there's more to it than that.

 

 


 



Discuss

Statisticism: How Cluster-Thinking About Data Creates Blind Spots

2026-03-10 21:59:25

There is an epistemic stance, common among academics in quantitative fields, academics who wish they were in quantitative fields, and independent scholars who do not wish to decorrelate too much from the academic mainstream by communicating in an incompatible dialect, that treats statistical convergence as the gold standard of evidence. If many indicators point the same direction, the signal is real. Call this statisticism. It converges on truth when your instruments have independent errors. It diverges from truth when they share a systematic distortion, because then convergence is what the distortion looks like. The following example illustrates a case where it fails, and why.

Two stories about the same numbers

The US homicide rate doubled between 1960 and 1980, then fell by more than half between 1991 and 2014. I argued that the fall is mostly a medical artifact: trauma surgery vastly improved, so the same rate of shootings produced fewer deaths. I constructed an adjusted trend line using two independent data sources and found no clear decline in serious violence after 1980.

Scott Alexander argues the decline is real. Many different crime categories all fell together: homicide, robbery, car theft, survey-measured victimization. This convergence, in the statisticist mode, makes the decline robust.

The convergence argument

Scott's reasoning: many indicators agree, therefore the signal is real. Good logic when your instruments are trustworthy. Bad logic when the question is whether your instruments are broken.

Every indicator he cites has specific, identifiable problems for measuring serious interpersonal violence:

  • Homicide rates are suppressed by improving medicine. This is the whole question. The FBI's own Supplementary Homicide Reports make no adjustment for changing lethality.
  • Aggravated assault rates were inflated for decades by expanding police reporting (the 911 rollout, professionalization of record-keeping, recognition of domestic violence) and then deflated by CompStat-era gaming. The NYPD's CompStat system, introduced in 1994, held precinct commanders accountable for index crime numbers. Felony assaults fell 42% from 2000 to 2009 while misdemeanor assaults fell only 9%, a divergence that Eterno and Silverman documented as systematic downclassification. Under UCR rules, a shooting is hard to classify as anything other than aggravated assault under UCR definitions, but a borderline bar fight can plausibly be coded as simple assault rather than aggravated assault, removing it from the index. The expansion of reporting categories corresponded to substantially greater penalties applied to the marginal cases, which were newly considered aggravated assault.
  • Victim surveys (the NCVS) interview about 240,000 people and get roughly 1,000 aggravated assault reports per year. This of course contains no direct information about aggravated assaults, but Scott later argued that if homicides were being converted to aggravated assaults through medical mitigation, that should be reflected in the NCVS numbers. The signal of interest (would-be homicides reclassified as assaults by medical improvement) is a tiny fraction of total assaults. The survey lacks the statistical power to detect it. The NCVS documentation itself flags assault as the worst-recalled crime in the survey.
  • Property crime responds to locks, cameras, cashless payments, and prosecution thresholds. It tells you about theft, not about whether people are shooting each other. Car theft declined because of immobilizers and GPS tracking, not because of declining criminal intent.

The limitations of these instruments are neither secret nor heterodox. The FBI's UCR handbook warns about comparability problems across time and jurisdiction. The NCVS documentation discusses its own power limitations. The information about instrument quality exists. It just gets stripped away as data moves from producers to consumers, so that by the time the data reaches a blog post, a newspaper, or a summary characterization from an adjacent academic field, it looks like a clean fact about reality rather than a noisy output of a specific, flawed process.

All of these indicators have drifted in the direction of apparent decline during the period in question, for reasons unrelated to whether people became less violent. Counting up indicators that agree doesn't help when they share the defect you're trying to diagnose.

Suppose you suspect your bathroom scale reads low because the spring is worn out. Your friend says it must be accurate because your belt fits better, your face looks thinner, and your blood pressure is down. These are all evidence of something (maybe you're exercising more) but none of them address whether the scale reads low. Body recomposition might produce the same effects. If you want to know whether the spring is worn out, you need to test the spring, or at least the scale.

Testing the spring

I took the hardest data available, the actual count of dead bodies from death certificates filed by medical examiners, and asked: how has the relationship between this number and the underlying rate of serious violence changed over time? Dead bodies are not subject to reporting drift, survey methodology, or police statistics games. The Monty Python parrot scenario is an outrageous fictional exaggeration, and even then it was a parrot; brazenly insisting an obviously dead human being is alive to avoid a minor financial inconvenience strains plausibility even for an absurd comedy sketch. [1]

Homicide rates are subject to one known distortion: whatever the perpetrator does, if the victim doesn't actually die of it, it wasn't a homicide. Medicine is a field specifically devoted to causing people not to die of things they otherwise would have died of, and (I think even Robin Hanson would agree) it has sometimes gotten better over time. So I measured the improvement using two independent clinical sources (FBI firearm lethality ratios and hospital abdominal gunshot wound survival rates) and divided it out.

This is instrument-modeling: instead of asking "do many measurements agree?", asking "what is this specific measurement actually tracking, and how has the tracking changed?"

Where the blind spots appear

In a subsequent exchange on Substack between me and Scott, statisticism produced a characteristic set of moves. Scott clearly writes from a place of genuine uncertainty and curiosity. But the statisticist default shapes what counts as engaging with an argument, and the result is that certain kinds of evidence become structurally difficult to hear.

The hardest evidence gets outvoted

The strongest piece of evidence in the entire debate is a doubling in death counts between 1960 and 1980, during a period of well-documented medical improvement. Death certificates filed by medical examiners are the least distorted measurement available. If you accept this evidence and the medical adjustment, violence roughly tripled on the adjusted measure, and for crime to be at "record lows" today, the adjusted rate would need to have fallen back by a comparable amount. My data shows it didn't.

I flagged this as the crux: "my argument that violent crime increased a lot 1964–1980 is strong, and I'd need to be wrong about that for [the] headline claim to be true." Scott responded: "I agree there's less data about 1960–1980."

I hadn't said anything about having less data. I'd said I had strong evidence. Body counts are the hardest data in this debate. There is less survey data before 1973, because the National Crime Victim Survey didn't start until then. But death certificates are older and more reliable than the best survey available. By responding as though I was arguing from data scarcity, Scott reframed "I have body counts" as "there's less data," inverting the hierarchy of evidence and attributing that inversion to me. I don't think this was deliberate, but I confess that it rankles a bit to have words put in my mouth, which may make me less fair-minded than I otherwise might be; but I don't think it's good discursive practice for people with grievances to self-silence for want of an advocate, so on we go! Within the statisticist framework this move is natural and almost invisible, because the framework ranks evidence by quantity and diversity of sources rather than by the quality of any single source's connection to physical reality.

Trends get reified

Statisticism encourages treating "the crime trend" as a thing that exists in the world, rather than as a summary computed from instruments. Once you think of it as a thing, you can ask whether it went up or down, and you evaluate this by polling your instruments.

The grand old Duke of York,
He had ten thousand instruments;
He marched them up to the top of the hill,
And he marched them down again.

When they were up, crime was up,
And when they were down, crime was down,
And when they were only halfway up,
Crime was neither up nor down.

But the crime trend is neither a generating process for, nor an explanation of, crimes. There are specific events (shootings, robberies, car thefts) counted by specific instruments with specific mechanics by which the events are detected, categorized, and counted. "Crime" is a word we use to group these events. A car theft and a shooting are both crimes, but they have different causes, different mechanisms, and different measurement problems. Treating these different instruments as interchangeable readings of a single underlying variable discards everything you know about how each measurement works.

The hypothesis that one underlying single generating factor, whether it's propensity for criminality, the trust level of society, or the cybernetic capacity of the state, drives changes in all these categories, is a strong claim that calls for strong evidence. I just described the union of three distinct theories connected with "or," not one coherent theory. Much like evidence for the existence of the monotheists' Yahweh doesn't work if it proves too much and also supports the incompatible Zeus, an argument for a single factor has to either rule out the other contenders for the single factor, or specify under what conditions the convergence should fail.

Parsimony gets misapplied across periods

Scott argues that since the post-1980 decline appears real (convergence), the 1960–1980 increase was probably also smaller than it looks. This treats "the trend" as a single object to be accepted or rejected wholesale. But the evidence is asymmetric. The 1960–1980 increase rests principally on body counts. The post-1980 decline rests on rates contaminated by the artifacts under dispute. Projecting the weaker period's story onto the stronger period gets the direction of inference backwards.

Experience gets filed as "vibes"

"Who are you going to believe, me or your lying eyes?" is not, on its face, a very credible rhetorical move. But reframe it as "what are you going to believe, objective statistics or the vibes?" and it becomes surprisingly effective.

In a followup post on disorder, Scott examines whether the things people complain about (litter, graffiti, tent encampments) are really increasing. He looks at the indicators, finds most flat or down, and concludes that perceived disorder probably outruns actual disorder. He frames this as keeping "one foot in the statistical story, one foot in the vibes." Statistics on one side, vibes on the other. The lived experience of people who observe deteriorating conditions gets categorized as a psychological phenomenon to be explained, not as evidence about reality. Along the way he notices several times that his indicators don't match what people report (NYC's litter ratings contradict residents' experience, shoplifting data contradicts what stores say) but instead of asking "what is this instrument failing to capture?", he files these as caveats and returns to the cluster.

I think Scott is trying to be appealingly self-deprecating here: he too has vibes, he too feels the despair when he goes to San Francisco, he's not claiming to be above it. But self-deprecation about perception itself unmediated by statistics is also deprecation of everyone else's capacity to make sense of their environment. Hey, I'm someone! If my eyes and your eyes and the store owners' eyes all see the same thing, and the statistics disagree, "vibes" is a word that makes it easy to dismiss all of us at once, including yourself. The ideology operates as a default, the place you end up when you're not actively thinking about what your instruments are doing.

Statisticism: the Good, the Bad, and the Ugly

So when should you trust convergence? When does it go wrong? And what turns a useful tool into an ideology?

When convergence works

In the ideal case, convergence is straightforward. Multiple labs estimate a physical constant using different experimental setups. Each lab has its own systematic errors, but these are uncorrelated, so convergence across labs really does reduce uncertainty. In finance, this is genuine diversification: a portfolio of uncorrelated assets really does have lower variance than any individual holding. The key word is uncorrelated.

The more interesting case is Charles Darwin, who spent years collecting observations from island biogeography, comparative anatomy, the fossil record, and selective breeding. These observations converged on a single conclusion: species change over time through descent with modification. The convergence meant something because each line of evidence was genuinely independent. Galápagos finch beaks are not subject to the same reporting drift as the fossil record. Pigeon breeders in England are not coordinating their results with naturalists in South America. When many instruments agree and there is no shared machinery generating the agreement, convergence really does reduce uncertainty.

Gregor Mendel's experiments with pea plants had actually established the mechanism of particulate inheritance before Darwin published, though the work wasn't recognized until decades later. Mendel's genetics explained why Darwin's observations converged. This is the instrument-modeling step applied after the fact. Not just "many things point the same direction" but "here is the causal process that makes them point the same direction." The convergence was real, and the mechanism confirmed it.

When convergence misleads

Governing complex systems requires feedback loops, and feedback loops on complex outcomes require proxies. You can't steer a national economy without GDP, manage public health without mortality rates, or run a criminal justice system without crime statistics. These statistical proxies are genuine attempts to compress high-dimensional reality into signals that a control system can act on. Statisticism in its legitimate form is the epistemology that makes cybernetic governance possible. The people who built these proxies were trying to solve genuine problems, like winning the World Wars, and often succeeded. The tragedy is that the solution becomes the next problem.

You often want your national statistics to be methodologically standardized so they're comparable across jurisdictions and time. But standardization introduces shared methodology and therefore shared exposure to the same biases. In finance, this shared exposure would be called basis risk: the risk that your instrument doesn't track the thing it's supposed to track. The question is always whether anyone is modeling the basis risk. Usually nobody is, because within the statisticist framework, the proxy is reality.

Compare Darwin's case. His observations converged because nature ran genuinely separate experiments on different islands. Crime statistics converge because they all pass through the same institutional machinery: the same reporting systems, the same definitional boundaries, the same political incentives. That's not the convergence of independent evidence. It's the convergence of shared plumbing.

Goodharting

Once the proxy becomes a target, the system optimizes for the proxy rather than the underlying reality. The proxy diverges from what it was meant to track, but the divergence is invisible from within the control system, because the control system only sees the proxy.

CompStat is a textbook case. Precinct commanders were accountable for index crime numbers. The numbers improved. Whether public safety improved is a different question, one that CompStat couldn't answer because CompStat was the measurement system. From inside the control loop, declining felony assaults looked like declining violence. From outside, if you compared felony and misdemeanor assault trends and noticed they were diverging, it looked like reclassification. The people inside the loop had no reason to look outside it, and strong career incentives not to.

Beta Bucks

In finance, beta is the correlated drift left over after you diversify away idiosyncratic risk. If you own shares in two car companies instead of one, you shouldn't expect less exposure to the auto market overall, but the good or bad luck of either company (politically charged CEO, breakout product, scandal where the car explodes) affects you less. Beta is the part you can't diversify away: the movement of the whole market that carries all its participants together. An asset with high beta rises when the market rises and falls when the market falls. In a system where correlated failures get bailed out, beta is free money: you capture the upside of the shared drift and the government absorbs the downside.

More generally, once the proxy is the target, people can profit by correlating their behavior with it, betting explicitly or implicitly against divergence from trend. When enough enterprises are exposed to the same risk, the government prevents them from failing, so excessive optimism is not selected against when correlated with others' optimism. When enough researchers share the same methodology, the consensus can't be challenged without challenging everyone at once, so the methodology becomes a means of organizing politically.

Hidden correlations can arise by accident: nutrition studies all using food frequency questionnaires, crime statistics all subject to the same reporting drift. But once a correlated movement exists, it attracts and retains participants. The environment selects for people who bet with the consensus and conditions them to feel that doing so is epistemically virtuous. The ones who didn't are no longer in the room. In practice the accidental and motivated components blur together, since most participants are not conscious of the full incentive structure. They're doing what feels advantageous, appropriate, or safe.

A lone dissenter who says "these instruments share a bias" is in the position of a short-seller betting against a systemically important asset class: possibly right, but structurally disadvantaged, because the system is set up to bail out the consensus. [2]

Not with a whimper, but a bang

This implies a testable prediction: when a subsidized consensus breaks, it should break catastrophically and all at once, because the same correlation that made it feel robust makes it fragile. The replication crisis in psychology looks like this: not a slow erosion of confidence, but a sudden phase transition once a few key papers fell and the shared methodological exposure was revealed.

Why it works politically

Statisticism functions well as consensus enforcement even when it fails as epistemics. Many instruments agreeing gives you a way to dismiss any individual challenge: "that's just one study," "that contradicts the weight of evidence." This works regardless of whether the instruments are actually independent, because most audiences cannot evaluate independence of error sources. You get to feel and vibe to others like a truth-seeker, while doing what is functionally consensus enforcement, because the rules of your epistemology produce the same behavior: privilege the cluster, dismiss the outlier. Nobody needs to be lying. The epistemology does the work for them.

Thermometer, Thermostat, Theology: the Lifecycle of a Proxy

Legitimate cybernetic need → proxy construction → caveats get stripped as data moves downstream → proxy becomes target (Goodhart) → correlated exploitation of the target (too-big-to-fail) → statisticism as the ideology that treats the proxy layer as reality and structurally cannot hear challenges to it.

Statisticism is an ideology within which the idea of evidence has been not augmented but replaced by the idea of statistics. Within this framework, only statistically legible information counts as meaningful. Your sensorium is not meaningful, first-principles reasoning about mechanisms is not meaningful, and the only real evidence is the output of a large data collection process using statistical methods. This makes convergence arguments feel decisive, because modeling a specific instrument's relationship to physical reality looks like speculation, while piling up indicator after indicator looks like rigor.

The same pattern shows up in effective altruist philanthropy, where it impairs learning by letting you carry incompatible hypotheses indefinitely without testing them. [3]

Predictable blind spots

The style will tend to:

  • Dismiss strong individual measurements that disagree with the cluster
  • Miss systematic biases that affect many indicators in the same direction
  • Treat "many data sources agree" as a conversation-stopper rather than asking whether the agreement is informative
  • Reframe strong but solitary evidence as "less data" rather than "different and better data"
  • Categorize non-statistical evidence (direct observation, mechanistic reasoning, lived experience) as "vibes" rather than as information about reality that the statistics may be failing to capture
  • Apply parsimony across contexts where the generating process has changed, because parsimony feels rigorous and context-sensitivity feels like special pleading

The corrective is not to abandon quantitative evidence or distrust convergence categorically. It is to treat each measurement as the output of a specific causal process, and to ask whether the process supports the use you want to make of the output, in the context where you're trying to apply it. When the question is whether a specific distortion explains an observed trend, the answer must come from modeling the distortion directly, not from counting correlated indicators.

  1. Weekend at Bernie's comes closer, but it takes a lot of work which plainly does not scale to a meaningful distortion of the homicide statistics, and in any case it is not a documentary. ↩︎

  2. Michele Reilly's Anatomy of a Bubble describes a related mechanism in which "arbitrageurs" extract value by creating uniformity of belief around a speculative commodity, with pragmatism functioning as submission to threats rather than as independent assessment. ↩︎

  3. See (Oppression and production are competing explanations for wealth inequality)and (A drowning child is hard to find) for worked examples. Holden Karnofsky's Sequence thinking vs. cluster thinking explicitly defends sandboxing uncertain perspectives as epistemically superior to following chains of reasoning to their conclusions. But the cost of sandboxing is that you never follow a chain of reasoning far enough to falsify it in a timely manner. For the radical problems created by this deferral of accountability, see Civil Law and Political Drama. ↩︎



Discuss

Spontaneous Symmetry Breaking (Stat Mech Part 4)

2026-03-10 21:21:18

Statistical mechanics is the process of controlled forgetting. Our main task is to figure out how to forget something about one system, to learn something about another system. 

The temperature of a system corresponds to its exchange rate of some conserved quantity, for information. Usually that conserved quantity is energy. The hotter something is, the more energy we need to dump into it to successfully forget some information about it. 

Let's suppose we want to take energy out of a system, at the price of learning something about that system. 

Graph plotted by Claude.

That's weird! There are some periods where we can get a bunch of energy out without changing the price, but then the price gets suddenly higher after that? 

And when we open up the box of gas at the end of the process, we'll find that it's turned into these weird pointy lumps? Huh?

What's going on?

Symmetry

What's the first answer that comes to your mind when I throw the following pair-matching game to you:

  • An ice crystal is...
  • A cloud of water vapour is...
  • ...more symmetrical.
  • ...less symmetrical.

I bet you answered that the ice was more symmetrical and the vapour was less symmetrical. When you imagined a cloud of vapour, you imagined a chaotic arrangement of molecules; for an ice crystal, you imagined a regular lattice.

Let's try again, in the Ising model (you can read John's explanation there, or Claude's explanation here)

Claude's Ising Explainer

Imagine a grid of spins, each either up (↑) or down (↓). Each spin has a simple preference: it "wants" to match its neighbours. That's the whole model. What makes it interesting is what happens when you dial a single parameter — temperature — which controls how much random thermal jostling can override those preferences.

At low temperature, the spins cooperate: you get large patches of all-up or all-down. At high temperature, the jostling dominates and the grid is a random mess.

  • A hot system is...
  • A cold system is...
  • ...more symmetrical.
  • ...less symmetrical

Again, I expect some of you will have said that the hot system was less symmetrical, and the cold system was more symmetrical.

If so, you've not yet caught on to the two most important concepts in stat mech.

Symmetry of States, not Things

The first is that we're thinking about symmetry over states, not over objects.

Let's start with the Ising model, since it's simpler. At high temperatures, both states are equivalent; we have lots of spin ups, and lots of spin downs. At low temperature, all the spins enter the same state, so the two states are no longer equivalent. Since this happens without any external input as to which state to enter, it's called spontaneous symmetry breaking. 

What are the states that a water molecule can be in? Roughly, position, orientation, velocity, angular velocity. In the vapour, all the states are equivalent, and molecules are distributed evenly across them.

In the ice crystal, one particular velocity and angular momentum state is privileged (the velocity and angular momentum of the macroscopic crystal). One position and orientation of the lattice is privileged.

This is universal to all crystals. In fact, from the perspective of stat mech, the definition of a crystal is "a spontaneous break in local spatial symmetry."

(As an aside, this might help you make sense of the concept of a "time crystal": it's just a thing which oscillates predictably.)

Symmetry in the Map, not the Territory

The other way of thinking about this is in the map. Imagine that cloud of steam again. You're uncertain about all of the particles; any of them might be anywhere. Your map of the gas is symmetrical across all the locations in the cloud.

Now imagine you learn the location of five of the molecules. Your map basically hasn't changed; it's still essentially symmetrical.

Now imagine the same for the ice crystal. You start unsure of the location of all of the molecules, as before. But this time, if you learn the location of a few molecules, your map of the crystal is completely changed: you now have an enormous amount of information about the position and orientation of all the molecules (of course you don't have perfect information about all of them; only those within the convex hull of the molecules you did see, but that's still quite a lot!).

It's the same with the Ising model. If the temperature is high, then learning about a few of the grid elements' spin states doesn't change what you know about the other states. If the temperature is low, then learning about a state tells you the whole global state of the grid

When the system has global symmetry, your map is robustly symmetric: learning a little information doesn't tell you much; when it has no global symmetry, your map is only contingently symmetric: learning a little information teaches you a lot.

The Price of Energy

This is the price of that energy. In order to get that energy out, and convert our steam cloud into an ice crystal, we had to learn a lot about the system. It didn't seem like it, since we were still uncertain of where those molecules would actually be, but that's only because we were thinking about the locations of individual molecules, one at a time.

If learning the position of a few molecules of ice tells you the position of all the others, then you already knew quite a lot about the system, it was just contained in the conditional distribution of the molecules, given one another. You were secretly un-forgetting all along!

There's a three-way relationship here:

  • Symmetry breaking in a system privileges a single state (spin state, position...)
  • Our conditional distribution on that state becomes highly constrained
  • When we learn the state of a few particles, we learn the state of all of them

In these parts, we have another word for a situation where learning the state of a few particles teaches us about the rest. The spontaneous symmetry breaking produced a natural latent. Now, this isn't the only way a natural latent can form, nor might it even be the most common way, but it is a way!

Demos because I have too much free time[1]

You can find the code here

For our first demo, let's put a bunch of particles in a void. The void loops at its edges, like Pac Man. The particles start out with lots of kinetic energy, and lose it as they bump into each other (this is actually fairly realistic, atoms do lose energy as radiation when accelerating, such as when they collide into one another). There's a non-directional attractive force between the particles, that kicks in at short distances:

And let's do a lattice too! Instead of using up/down states, we'll use angles (this is really just going from , the zero-sphere, to , the one-sphere). Each particle in the grid has an angle  and an angular velocity . The velocity  slows down (as if by friction, or radiation) over time, but we also inject some  randomly according to temperature. You'll have to download the code to look at that one, though.

Analogy to AI Training

If you're uninterested in reading about AI, then feel free to stop reading here. I just couldn't resist.

Suppose your LLM forms an induction head. This is a two-layer circuit where one attention head writes information from the previous token, and another attention head looks for it. This is often referred to as a phase change, which is true, but the analogy works even better.

To what subspace of the residual stream does the first head write to? I have no idea, but I do know that the second head has to read from the same subspace. Sound familiar?

This is true of basically every multi-layer circuit in transformers. I don't know which subspace the previous token head of the Michael Jordan basketball circuit writes to, but I do know that the Michael + Jordan  Basketball lookup circuits in the MLP layers (which probably implement a shallow circuit using cross-layer superposition) read from the same subspace, and whatever subspace it writes to, the later heads read from.

I have more thoughts here, about how entropy barriers to crystal nucleation might analogise to entropy barriers to forming multi-layer circuits as opposed to shallow ones during training, but that's a thought for another post.

  1. ^

    Haha, just kidding, I'm bunking off writing my PhD thesis.



Discuss

Monthly Shorts 1/22

2026-03-10 21:20:50

The most important story of January was Omnicron. The Washington Post has a good graphic. Built long before, but you can still see that cases are fading, deaths are not yet declining but they will be, and if you want relative COVID safety you might have a chance in a month. Maybe more. At this point I see no hope that new variants will stop emerging, and have little optimism that the FDA will accept that the latest vaccine, made in less than two weeks, does not require another full round of approvals.

If you’re wondering why it’s the most important story, well, to me it was most important because I got it. Down with a bad cold, and it ended up transitioning into living at home for a few months while I’m between apartments due to lease timing and subletting restrictions. This is to say that this month is a little thin and disorganized: apologies.

Look at Pages 28 and 29 of this report on Culver City, if you’re familiar with LA. Making effective transit is often predicated on local access to jobs, but that is in direct contradiction with the agglomeration economies of cities. Balancing this, of course, is high-quality public transit, but getting everything in the right order is going to be a long hard slog.

The other most important story is Ukraine, and the rising threat of conflict there. Unfortunately, I have no modern recommendation. Instead, I recommend reading Plokhy, or another historian of the collapse of the Soviet Union, or reading up on some of the 90s history of Russia if you’re as young as I am. A little bit of history will do you much more good than constantly checking the news. If the Russians invade you’ll hear about it soon enough, and until they do, read history instead of saber-rattling. What good does it do you?

Is the largest physical and internet retail day in the world Black Friday? Hah. Nope. It’s Single’s Day, celebrated November 11th in Southeast Asia. Starting off as a cynical response to various couple holidays by lonely college men, it was turned into A Big Thing in 2009 by Ali Baba’s CEO via the power of discounts. This year, $139 billion was spent.

So, this is a bit unusual, but here’s something I wrote (a part of). My first public RAND Research Report, it looks at the Quantum Defense Industrial Base, and considers what a research and innovation base looks like.

One of my favorite facts about GAO reports is that they include whether or not, in their view, their recommendations were taken. Here’s a neat and relatively comprehensible example, on costing estimates for the DoD.

I’m going to Vibecamp, largely on the grounds of “it looks interesting”. I like any schedule that can move smoothly from romantic epistemology to fight play: intro to grappling. I’ll be leading a seminar on Cohn’s Sex and Death in Rational World of Defense Intellectuals, one of my current favorite papers. Old-school feminist analysis, very good, and relevant to my life.

I’m impressed by Cato’s integrity in not putting the US first in their freedom index: it’s a nice sign of intellectual seriousness. Speaking of, here’s FIRE’s worst 10 colleges for free speech, which shows an expected mix of “the twitterati were angry” and “the state legislators were angry”, to which colleges seem to respond with roughly equal seriousness. The ability to ignore the scorn of your peers is very powerful, and very dangerous.

I saw Tinker Tailor Soldier Spy. It is a very tight drama, intense and narrow and psychological, and even a decent adaptation of Le Carre. Recommended however you feel like accessing visual media.



Discuss