MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Can We Secure AI With Formal Methods? November-December 2025

2025-11-29 22:10:14

Published on November 29, 2025 2:10 PM GMT

We did the rebrand! The previous thumbnail was a baseball metaphor, but it was very clearly someone getting out, not safe. I was testing all of you and each of you FAILED.

Here’s the prompt for the new thumbnail:

Can We Secure AI With Formal Methods? is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

i’m keeping AI in a box, doing AI CConfinement (like in yampolskiy 2012), using formal verification / formal methods. That’s my whole thing. I need art for my newsletter on these topics. I like the percival story from troyes/wagner and i like tolkien, but if you take from those elements put it IN SPACE like scifi. Also use german expressionist painting styles. Ok now give me some DALLE art.

So long “Progress in GSAI”. I still like the position paper that the old newsletter title was based on, but

  1. It’s very scifi and I think there’s more alpha in obvious/relatively easy/uncontroversial (but not done by default) work.

  2. The word “guarantee” doesn’t evoke “swiss cheese”.

  3. It’s time to double down on relationships between AI security and formal methods, directly, more explicitly than you can do within the framing of GSAI.

Also notice: gsai.substack.com is now a redirect to newsletter.for-all.dev. I’ll be hosting a bunch of my technical reports and comms/strategy outputs at that domain going forward (the subdomain newsletter will just point to substack). But don’t worry, the scope of the newsletter remains largely the same (excepting the pivot to be more directly and explicitly about AI security) / won’t devolve into being any more nakedly self promotional than it has been so far.

I received a grant from a Funder of Presently Undisclosed Provenance to do comms and strategy for AI security via formal methods, which means among other things that this newsletter will get a little more TLC.

Busy month, I expect things to be slow over christmas, after this edition I’ll see you all in 2026.

In the spirit of chivalry, I styletransferred most abstracts in this edition of the newsletter to Troyes/Cervantes style. I did not check to see if Gemini got anything wrong, but every headline is a link to arxiv or openreview which you’ll click if you’re interested.

Miri’s treaty team posts a paper!

Excited about this. They use the word “verification” in a different context then we do, they mean it in the sense of verifying the absence of enriched uranium (GPUs) or verifying that the terms of a treaty are being abided by.

Many experts argue that premature development of artificial superintelligence (ASI) poses catastrophic risks, including the risk of human extinction from misaligned ASI, geopolitical instability, and misuse by malicious actors. This report proposes an international agreement to prevent the premature development of ASI until AI development can proceed without these risks. The agreement halts dangerous AI capabilities advancement while preserving access to current, safe AI applications.

The proposed framework centers on a coalition led by the United States and China that would restrict the scale of AI training and dangerous AI research. Due to the lack of trust between parties, verification is a key part of the agreement. Limits on the scale of AI training are operationalized by FLOP thresholds and verified through the tracking of AI chips and verification of chip use. Dangerous AI research--that which advances toward artificial superintelligence or endangers the agreement’s verifiability--is stopped via legal prohibitions and multifaceted verification.

We believe the proposal would be technically sufficient to forestall the development of ASI if implemented today, but advancements in AI capabilities or development methods could hurt its efficacy. Additionally, there does not yet exist the political will to put such an agreement in place. Despite these challenges, we hope this agreement can provide direction for AI governance research and policy.

BlueRock GPLs the specs and proofs of NOVA

Three. Great. Blog posts. The third one of interest for insight into maintenance and repair of a spec and proof codebase.

NOVA is the legendary hypervisor that was specified and proven correct at BlueRock (FKA Bedrock). I say “legendary” because as a wee lad, stalking Bedrock’s github activity, hearing rumors about C++ verification, it was one of the few Ws of industrial verification at scale that I had heard about.

Look at that B-E-A-YOOT.

A hypervisor is a part of the virtual machine stack. NOVA is a hardened one for critical systems, technically a microhypervisor.

We should teach AIs to write this stuff, cuz that looks painful to type.

We don’t talk enough about separation logic here on the newsletter. Anyways,

People are playing with Aristotle, Harmonic is hiring

$120M series C.

Hardware is an interesting product area! Looks like their business model has advanced past the “mumbling to investors about curryhoward” stage. 2025, the year of mumbling to investors about curryhoward, has come to a roar of a close. I have also mumbled about curryhoward to my dearest yall, which might mean I get bayes points for a math company starting to spin up a program synthesis product. I can’t tell how obvious that sort of claim was, or is, but I know one thing: I love getting points.

If you have Aristotle access, please test FVAPPS and report back. Be sure to append the unit tests, that’s like the hardest part of the benchmark.

If I had a nickel for every benchmark prefixed “Veri-” it’d only be four nickels but it’s still weird that it happened four times

Some of these I had no good reason not to cover earlier. Abstracts styletransferred by Gemini.

Vericoding

We do hereby present and test the largest ledger of trials yet assembled for the craft known as Vericoding—the generation of a code whose certainty is sworn upon by the very stars—from the formal scrolls of specification. This, mind you, is in stark contrast to the common, wicked Vibe Coding, which spews forth a quick but bug-ridden script, born of a mere whisper of natural tongue.

Our grand ledger contains twelve thousand five hundred and four such scrolls of specification, with three thousand and twenty-nine written in the ancient runes of Dafny, two thousand three hundred and thirty-four in the sturdy tongue of Verus/Rust, and seven thousand one hundred and forty-one in the subtle logic of Lean. Of these, a full six thousand one hundred and seventy-four are entirely new, untarnished challenges.

We find that the success rate of this noble Vericoding, when performed by the Sorcerers of Language (our off-the-shelf LLMs), stands at a meager 27% in Lean, rises to 44% in Verus/Rust, and achieves a triumphant 82% in Dafny. Alas, the addition of a common, flowery natural-tongue description does not notably sharpen their success. Furthermore, the light of these Sorcerers has illuminated the pure path of Dafny verification, raising its former success rate from a humble 68% to a glorious 96% over the past twelve moons.

Veribench

The Formal Verification of Software doth stand as a promise most bright—a potential transformation wrought by the Generative Artifice of the Mind (AI). For a Provably Correct Code would utterly banish entire legions of hidden vulnerabilities, staunch the fatal breaches of critical systems, and, perhaps, forever change the practice of software engineering through trustworthy methods of implementation.

To spur this sacred domain, we unveil VeriBench, a trial meticulously crafted for judging the strength of the Sorcerers’ Models in the end-to-end verification of the Code. This task demands the generation of complete Lean 4 incantations—the working functions, the unit tests, the Theorems of Correctness, and the Formal Proofs themselves—all drawn from humble Python reference spells or their accompanying common-tongue docstrings.

Our scrutiny of this one hundred and thirteen-task suite (comprising the tasks of HumanEval, simple drills, classical algorithms, and security snares) reveals a woeful truth: the current Frontier Sorcerers compile but a small fraction of the programs. Claude 3.7 Sonnet achieves compilation on a mere 12.5%, while the mighty LLaMA-70B cannot compel a single program to compile in the Lean 4 HumanEval subset, even after fifty attempts guided by feedback! Yet, observe the noble Self-Optimizing Trace Agent architecture, whose compilation rates approach a magnificent 60%! VeriBench thus lays the unyielding stone for developing systems capable of synthesizing provably correct, bug-free code, thus advancing the journey toward a more secure and dependable digital kingdom.

VerifyThisBench

While the Grand Language Models (LLMs) have shown marvelous cunning in the quick generation of code, many existing trials are now easily conquered, and offer little guarantee of trustworthiness for the generated programs. To gain greater insight into the Sorcerers’ reasoning on matters of Formal Correctness, we present VerifyThisBench, a new, agonizing trial which assesses the end-to-end verification of programs from mere natural-tongue descriptions.

The models must complete a trifecta of chivalric deeds: (i) Extract the Formal Specifications, (ii) Implement the Code in a language that craves verification, and (iii) Construct the Machine-Checkable Proofs.

Our evaluation reveals that even the most vaunted of the modern models, such as o3-mini, achieve a pass rate of less than 4%, with many of their utterances failing to even compile! To divine the true source of this difficulty, we further propose VerifyThisBenchXS, a milder variant where partial implementations or proofs are benevolently supplied. Across nine distinct models and seven tools of verification, we observe a steady gain when refinement is driven by the whispers of feedback, yet the overall pass rates remain pitifully low, underscoring the vast chasms that yet divide the Sorcerers from true formal reasoning. We release this trial and its unified environment to spur on the verification powers of all future models.

VeriEquivBench

Formal Verification stands as the ultimate frontier for ensuring the veracity of the code spawned by the Grand Language Models (LLMs). Methods that co-generate the code and the formal specifications in austere formal languages, such as Dafny, can, in theory, swear upon the truth of their alignment with the user’s intent. Alas, the entire progress is stifled by the difficulty of judging the quality of the specifications themselves.

Current trials rely upon the perilous task of matching the generated work against a ground-truth specification—a manual process requiring deep expertise, which has limited existing datasets to a mere few hundred simple problems, and moreover suffers from a profound lack of reliability.

To remedy this, we introduce VeriEquivBench, a new trial featuring two thousand three hundred and eighty-nine complex algorithmic puzzles designed to expose the frailty of current models in both the generation of code and the deep formal reasoning. Our evaluative framework replaces the perilous ground-truth matching with a formally grounded metric: the Equivalence Score, and rigorously verifies the quality of the generated specifications and code. Our findings declare that the generation of formally verifiable code remains a profound challenge for the state-of-the-art Sorcerers. This underscores both the sheer difficulty of the task and the desperate need for trials like VeriEquivBench to hasten the march toward scalable and trustworthy coding agents.

From Galois’ blog

Specifications don’t exist

Should’ve been in last newsletter but slipped through the cracks.

We need words for the different pessimisms about FMxAI. I often talk about the world-spec gap or the world-spec problem (that formal methods don’t rule out sidechannel attacks). This post is about a different pessimism, the elicitation problem or the elicitation and validation problem. Someone should absolutely be funding an org to focus on elicitation and validation, it’s a turbo important part of the theory of change. Is anyone working on this?

Lean and claude code

Mike also has a technical post about vibecoding in Lean.

Pair it with these off the shelf “skills” (a claude code feature that’s “just prompts with extra steps”).

Rigorous Digital Engineering

What if proof engineering but too cheap to meter?

Oops i missed Logical Intelligence

Should’ve covered these folks a while ago. Yes, it appears their clientele is crypto/defi, but I have a generally positive attitude about life and I don’t want to set my “days since snark incident” counter back to zero, so we will ignore that and focus on the little we can ascertain about their tech and their claims.

There are two parts to this, there’s the part of why/how exactly they believe what they believe about their Lean product, and the part of how their Noa agent (which is not paywalled, you can just install it on github) fits into my strategic worldview.

Primitive screwheads: text-to-text. My boomstick: structural synthesis

Logical Intelligence is not bullish on autoregressive text-to-text as a program synthesis paradigm. Like Leni Aniva, they think tree search (starting with MCTS) will beat LLMs in the fullness of time. The interesting part, with a very paywalled model that I can’t test, is if they’re right why isn’t Harmonic (or Morph or a frontier company or anyone else) scooping them? It’s the same thing I say when I look at HOC: yes, text-to-text is an uncivilized approach to program synthesis, but we haven’t welded structural synthesis with the bitter lesson yet, and I don’t expect to see the gains until we do. If it could be any other way, then we’d be living in the GOFAI Extended Cinematic Universe instead of the Prompts Extended Cinematic Universe. I could write down some loose ideas of things you could try (to achieve the welding), but I will not because I’m unconvinced the d/acc case is actually the majority of the mass. I’m too concerned that Logical Intelligence, HOC, to some extent Leni are right about the superpower unleashed by structure-aware program synthesis and I don’t think we’re ready (as a QA/safety community, nor as a society).

Analyzing codebases for vulnerabilities

From their product page:

Ordering an external audit is both very expensive and very time-consuming. Our AI tool, Noa, delivers regular feedback on your code—minutes for smaller codebases and tens of minutes for larger ones. This lets you get near-real-time insight into the most critical potential security risks at a fraction of the cost. Noa integrates with GitHub: simply add the Noa bot to your repository, and after each pull request you can request a dashboard showing potential risks across the entire repository, along with their likelihood of exploitation and severity ratings.

I have a post coming out about this, but I think the sort of thing they’re trying to do here is an important part of the strategic outlook. Audits, cryptanalysis, cybersecurity consulting are an important area to automate if we’re going to know, with a finite proof synthesis budget, which components are the most critical to harden with proofs. To be clear, I have not used the product, I don’t have any codebases it’s a good fit for. But it’s a class of product I’m excited about, even (ugh) if it is (ew) for defi/crypto.

Announcements from the first round of Mathematics for Safe AI Opportunity Space at ARIA

Spot ole q doc somewhere on this page! Other highlights are the hardware verification team, the GFLowNet/SynthStats team, and the SFBench team.

Scalable synthesis of theorem proving challenges in formal-informal pairs

Apparently there was some twitter discourse about this paper but one of the discoursers was using a hidden profile. It’d be great to be more like a Zvi style newsletter full of twitter screenshots, that would just require me to log onto to twitter more, which like, no.

The Grand Confluence of Lean and the Scholarly Arts of Computation: A Fount of Trials for the Sorcerer’s Mind– The noble art of Formal Theorem Proving (FTP) hath risen as a cornerstone for judging the deep reasoning capabilities of the Grand Language Models (LLMs), enabling the automated verification of mathematical oaths upon a massive scale. Yet, the progress of this quest has been hindered by a scarcity of suitable archives, due to the high toll of manual curation and the lack of truly challenging dilemmas paired with verified correspondences between Formal Scroll and Informal Chronicle. We propose to tap into the wellspring of Theoretical Computer Science (TCS) as a boundless source of rigorous proof problems. Within this scholarly domain, the definitions of algorithms permit the automatic synthesis of an arbitrary number of complex Theorem-Proof pairs. We demonstrate this potent approach upon two realms of TCS: the Busy Beaver problems, which demand the proof of bounds upon a Turing Machine’s cessation of movement, and the Mixed Boolean Arithmetic problems, which entwine the logic of the mind with the rigor of number. Our framework automatically weaves these challenges, providing parallel specifications: the Formal Code (Lean4) and the Informal Narrative (Markdown), thus creating a scalable conduit for generating verified trials of proof. Scrutiny of the frontier models reveals substantial chasms in their automated theorem-proving prowess: while the champion DeepSeekProver-V2-671B achieves a noble 57.5% success rate on the Busy Beaver challenges, its strength wanes, managing only 12% on the Mixed Boolean Arithmetic puzzles. These findings illuminate the great difficulty of crafting long-form proofs, even for those problems whose computational verification is a mere trifle, thus showcasing the invaluable role of TCS realms in advancing the research of automated reasoning.

AI Resilience: cyberphysical systems

Friend of the newsletter Nora Ammann published AI Resilience a little bit ago. The section on cyberphysical systems is relevant to us: it relies on secure (formally verified) program synthesis becoming cheap and accessible. Resilience is a flavor of defensive acceleration that specifically targets the durable and structural resolution of vulnerabilities, vulnerabilities which get amplified by AI but which, if we’re diligent and hardworking, get ameliorated by AI as well.

Let’s formalize this step by step

One time a friend asked me “why not just put the proof synthesis in the reasoning trace and the thing you’re writing the proof about (say, a program) in the final output“. And I was like, “...huh”. And I got as far as adding a few credits to my runpod account before getting pulled into other things. Little did I know, at exactly that moment, this team was hard at work!

A Proposal for Safe Passage: The Formal Verification of the Grand Sorcerers’ Thoughts– The method of the Chain-of-Thought (CoT) prompting hath become the established ritual for coaxing forth the reasoning powers from the Grand Language Models (LLMs). Yet, to contain the hallucinations in these Chains—phantoms notoriously difficult to discern—the current remedial arts, such as the Process Reward Models (PRMs) or the Self-Consistency measures, operate as opaque boxes, offering no verifiable evidence for their judgments, thus perhaps limiting their true efficacy. To redress this failing, we draw inspiration from the ancient wisdom that “the gold standard for supporting a mathematical claim is to provide a proof.” We propose a retrospective, step-aware framework of Formal Verification which we title Safe. Rather than assigning arbitrary scores or marks, we strive to articulate the mathematical claims within the formal mathematical language of Lean 4 at the conclusion of each reasoning step, and further provide formal proofs to definitively identify these hallucinations. We test our framework Safe across various models and mathematical archives, demonstrating a significant enhancement in their performance, while simultaneously offering interpretable and verifiable evidence for their passage. Furthermore, we propose FormalStep as a new trial for the correctness of step-by-step theorem proving, containing 30,809 formal statements. To the best of our knowledge, our work represents the first valiant endeavor to utilize the formal mathematical language of Lean 4 for verifying the natural-tongue content generated by the LLMs, thereby aligning with the very reason these formal languages were created: to provide a robust and unshakeable foundation for the hallucination-prone proofs scribed by human hands.

Ulyssean website mission status: totally sick

There’s honestly no Ulyssean update in this issue, but I stumbled upon their website and loved the graphic design!

There are no benefits for paid subscriptions. A Funder of Undisclosed Provenance is backing the newsletter for 6 months.



Discuss

The Joke

2025-11-29 18:51:23

Published on November 29, 2025 10:51 AM GMT

There is a joke format which I find quite fascinating. Let’s call it Philosopher vs Engineer.

It goes like this: the Philosopher raises some complicated philosophical question, while the Engineer gives a very straightforward applied answer. Some back and forth between the two ensues, but they fail to cross the inferential gap and solve the misunderstanding.

It doesn’t have to be literal philosopher and engineer, though. Other versions may include philosopher vs scientist, philosopher vs economist, human vs AI, human vs alien and so on. For instance:

One thing that I love about it is that the joke is funny, regardless of whose side you are on. You can laugh at how much the engineers miss the point of the question. Or how much the philosophers are unable to see the answer that is right in from of their noses. Or you can contemplate on the nature of the inability of two intelligent agents to understand each other. This is a really interesting property of a joke, quite rare in our age of polarization.

But, what fascinates me the most, is that this joke captures my own intellectual journey. You see, I started from a position of a deep empathy to the philosopher. And now I’m much more in league with the engineer.

Let’s look at one more example. This time of philosophers being bullied by scientists.

When I first considered the question - I think I was about twelve back then - it was obvious that scientists are missing the point. Sure, if we assume that our organs of perception give us reliable information, then “looking” would be a valid justification. But how can we justify this assumption? Surely not by more looking - that would be circular reasoning. How can we justify anything in principle? What justifies the justification? And the justification of justification? And so on? Is it an infinite recursion? Or, if we are stopping at some point, therefore, leaving a certain step unjustified, how is it different from stopping on the first step and therefore not justifying anything at all?

It seemed to me, that the “scientific answer” is just the first obvious step. The beginning of philosophical exploration. And if someone refuses to follow through, that must be a sign of some deep lack of curiosity.

And so one may think, as I did, that science is good at answering the first obvious question. But deeper questions lies beyond it’s abilities - in the realm of philosophy.

Except... philosophy isn’t good at answering these questions either. It can conceptualize the Problem of induction or Münchhausen trilemma. But those are not answers - they are road blocks on the path to one.

One may say that philosophy is good at asking questions. Except... how good are you, really, if you are stuck asking the same questions as a 12 years old child? Oh sure, I might have been a bright one, but nevertheless when the aggregated knowledge of humankind in some discipline is on the level of some kid, it’s not a sign in favor of this discipline.

The really good questions can be asked on top of the existent answers. Thus we are pushing forward the horizon of the unknown. So if your discipline isn’t good at answering its own questions, it’s not really good at asking them either. And vice versa. After all, asking the right question, is a crucial towards getting an answer.

But the most ironic is that if one actually goes on a long philosophical journey in search for the answer to the question “How can we know things about the external world at all?”, then, in the end of this heroic quest, after the three headed dragon is slain, kingdom of reason is saved and skeptics are befriended along the way, on the diamond mural written in golden letter will be found the answer.

And this answer will be: “Pretty much by looking”.

Well, the expanded version, spoiler alert, is:

[spoiler]

  1. To be able to know things you need your cognitive faculties to be produced by some kind of optimization process that has systematically correlated them with the external world, for example to have your organs of perception that accept signals from the world (i.e. “looking”) and your brain that interpret and aggregate those signals, forming a mental model of the world, to be created by evolution through natural selection, optimizing for inclusive genetic fitness in this world.
  2. You can increase your confidence, that your mental model does indeed describe the world by collecting more evidence about it (i.e. “double check”). This gives us a mental meta-model describing reliability of your own reasoning about which you can also increase your confidence by collecting more data and so on and so forth.
  3. This will never give you absolute certainty in anything. Even the best models sometimes fail to generalize further and predict the next data point. Then you simply discard the failed model, incorporate the new data point and therefore get a better model instead.
  4. From inside it may even look similar to cyclical reasoning, after all you can only learn about evolution that justifies your learning ability, by using the learning ability created by evolution. But this is just a map-territory confusion. The actual causal process in the reality that makes our cognition engines work is straightforward and non-paradoxical. The cognition engine works even if it’s not certain about it.
  5. In fact, some uncertainty is necessary for its workings. Absolute certainty would be preventing us from updating our models further, regardless of the evidence, therefore dis-correlating them from the outside world. For all our means and purposes the amount of certainty, meta-certainty, meta-meta-certainty, etc that we can get is good enough.

[/spoiler]

But I think “Pretty much by looking” captures it about as good as any four word combination can.

Turns out that the “missing the point applied answer” isn’t just the beginning of the exploration. It encompasses the whole struggle, containing a much deeper wisdom. It simultaneously tells us what we should be doing to collect all the puzzle pieces and also what kind of entities can collect them in the first place.

On every step of the journey it’s crucial. To learn about all the necessary things you need to go into the world and look. Even if someone could’ve come up with the specific physical formulas for thermodynamics without ever interacting with the outside world, why would they be payed more attention to than literally any other formulas, any other ideas?

The answer is not achieved by first coming up with “a priori” justifications for why we could be certain in our observation and cognition before going to observe and cognize. We were observing and reflecting on these observations all the way. And from this we’ve arrived to the answer. In hindsight, the whole notion of “pure reasoning”, free from any constrains of reality is incoherent. Your mind is already entangled with the reality - it evolved within it.

“Looking” is the starting point of the journey, the description of the journey as a whole and also it’s finish line. The normality and common sense to which everything has to be adding up to.

And how else could it have been? Did we really expect that solving philosophy would invalidate the applied answers of the sciences, instead of proving them right? That it would turn out that we don’t need to look at the world to know things about it? Philosophy is the precursor of Science. Of course its results add up to it.

Somehow, this is still a controversial stance, however. Most of philosophy is going in the opposite direction, doing anything but adding up to normality. It’s constantly mislead by semantic arguments, speculates about metaphysics, does modal reasoning to absurdity and then congratulates itself, staying as confused as ever.

And while this is tragic in a lot of ways, I can’t help but notice that this makes the joke only funnier.



Discuss

A Harried Meeting

2025-11-29 15:57:58

Published on November 29, 2025 7:57 AM GMT

An old pub that nobody much visits. An owner who is always in a drugged-out stupor. Background music that never changes. A pub that has remained throughout war and revolution, and a single brick-wall that has not changed all that time. You are supposed to be investigating a murder. A gunshot in a distant land, far away from Revachol.

YOU – Run your fingers through your greasy hair, finish your eighth beer.

SHIVERS [Challenging: Success] – People pass in through here occasionally, and tap a pattern out on the wall. Then they aren't in the pub.

  1. What is the pattern?
  2. Where are they?
  3. What's the owner taking? Could I get some?

SHIVERS – Not in this city any longer.

  1. What is the pattern?
  2. Where are they?
  3. What's the owner taking? Could I get some?

SHIVERS – A man in a tall hat and a long coat taps the top brick, the bottom brick, then beside it to the left then the right 10 times.

INLAND EMPIRE – An ominous, foreboding feeling fills you, as you look at the wall.

  1. I want to know what happens to them. (Walk over to the wall.)
  2. I don't know that I want to continue this life that I currently have. (Walk over to the wall.)
  3. I like drinking this drink. (Ignore the thought.)

VISUAL CALCULUS — The wall has a timeless quality to it. No dust has accumulated. No pictures hang. In contrast to everything else in this dirty and dingy bar, it looks like not a day has passed since it was installed.

  1. Tap the wall. (Travel.)

A whooshing sound is heard. A room of people dressed as if for some convention. the quality of the clothing does not look very good, probably these people spend as much time in bars as you do. A boy stands in front of you, with a wooden stick in hand, as though he might make to attack you with it. Above him, inside the pub, a sign reads "The Hog’s Head Inn".

  1. "What's your name?"
  2. "What is a child doing in this pub?"
  3. "Is that stick for stirring alcohol? Do you have alcohol?"
  4. "Why are you dressed up like you're entering a wizard fancy-dress contest?"
  5. "I get the sense that you're even more unbelievably arrogant than I am. Are you aware that I'm a secret rockstar?"

YOU — "What's your name?"

BOY — "Harry. I'm here on official business. What's your name?"

  1. "Harry."
  2. "DuBois. Lieutenant Detective Harry DuBois."
  3. "You're not Harry, I'm Harry."
  4. "You look like your parents might be rich, do you think you can give me some money?"
  5. "Your outfit suggests you too could benefit from a communist revolution. Join me, comrade?"

YOU — "DuBois. Lieutenant Detective Harry DuBois."

HARRY — "Ah, I see the Ministry is continuing to track my movements, even after all I explicitly instructed them not to. You know at some point your boss is going to lose his job over this."

LOGIC — This isn't the kind of way a young teenager normally talks. With their peculiar clothing, perhaps everyone here is engaged in a LARP of some kind?

AUTHORITY [Easy: Success] – It seems this little upstart doesn't respect you, even after you told him your position. We should teach him some respect.

  1. "Where am I?"
  2. "Why do you think I am part of the 'ministry'?"
  3. "What movements of yours might I need to be tracking?"
  4. "Are you and everyone here mucking about like a bunch of children?"
  5. "Would you like to learn the meaning of 'lieutenant detective'? (Take out gun.)

YOU — "What movements of yours might I need to be tracking?"

HARRY —  "I'm sure you have your answer already detective, and anything I say won't affect your conclusions. That's one of the disadvantages of having the bottom line of your reasoning already written, before you make new observations or learn new evidence. You aren't really much of a conversation partner any longer."

  1. "Where am I?"
  2. "Why do you think I am part of the 'ministry'?"
  3. "What movements of yours might I need to be tracking?"
  4. "Are you and everyone here mucking about like a bunch of children?"
  5. "Would you like to learn the meaning of 'lieutenant detective'? (Take out gun.)

YOU — "Where am I?"

HARRY — "Why, you're in The Hog's Head Inn of course. I wonder what magic you used to track me here that didn't let you know where you were apparating to."

  1. "Where am I?"
  2. "Why do you think I am part of the 'ministry'?"
  3. "What movements of yours might I need to be tracking?"
  4. "Are you and everyone here mucking about like a bunch of children?"
  5. "Would you like to learn the meaning of 'lieutenant detective'? (Take out gun.)

YOU — "Would you like to learn the meaning of 'lieutenant detective'? (Take out gun.)

HARRY — "Color me surprised! I didn't know that the Ministry knew what guns were. I guess you guys are a little better clued in to how the world works than the rest of the population of this little magical enclave. Nonetheless, it wouldn't be much use for defense here, I have an anti-combustion charm in effect to prevent fire attacks that would negate the effect of the gunpowder in the chamber of that weapon."

AUTHORITY – Perhaps we could throw the gun at him?

HALF-LIGHT – If the gun doesn't work, I'm ready to do what it takes to take him out. Hell, I'd remove my own bones to club him over the head.

  1. "Where am I?"
  2. "Why do you think I am part of the 'ministry'?"
  3. "What movements of yours might I need to be tracking?"
  4. "Are you and everyone here mucking about like a bunch of children?"
  5. "Would you like to learn the meaning of 'lieutenant detective'? (Take out gun.)

YOU — "Why do you think I am part of the 'ministry'?"

HARRY — "You apparate into a pub just beside me. You tell me you're a detective. My best friend has just been killed. What other explanation is there than that you suspect m in her demise, however improbable it is?"

YOU — "I could just be another drunk like the rest of the people here."

HARRY — "...I guess that's possible. And this is a somewhat seedy bar, the base rate of detectives in here wouldn't be that low."

HARRY — "What kind of investigator are you, anyway?"

YOU — "A homicide detective."

HARRY — "Ah. You think I killed Hermione."

YOU — "...did you?"

HARRY — "I don't know that denying it would provide any Bayesian evidence, so I won't."

YOU — "...I should probably take you in for questioning."

HARRY — "Ugh, I am so disappointed in the quality of thinking in the modern day detectives. Do you even have a sense of the base rates for murdering your best friend? Have you done any Fermis on the question?"

YOU — "No? What's a 'fermi'?"

HARRY — "Ugh. Named after Enrico Fermi, leading physicist and architect of the nuclear age, he was famous for doing rapid calculations in his head. During the 1945 Trinity Test of a nuclear explosion, as the shockwave hit, he dropped a small piece of paper to float to the floor, and saw how far it was pushed back. He used this to estimate the strength of this explosion within a factor of two. Do you understand how impressive that is? But of course, from the perspective of reality, it's not impressive at all, we're constantly surrounded by massive amounts of evidence and we barely take any effort to deduce the logical implications of what we see. You should constantly be asking yourself to calculate base rates to the nearest order of magnitude. Come now, do an estimate. Take the number of people who have committed a murder, and compare it to the number of accused?"

  1. No, I won't play your foolish game.
  2. [LOGIC - Challenging 12] Sure, I'll give it a go.

YOU — Sure, I'll give it a go.

LOGIC [Challening: Failure] — Well, personally I've shot 3 people in my time at the RCM, and that's unusually low for people here, the average is more like ten to twenty. So let's say the average person has shot at ten people, and killed half, then that's a decent number of kills. As for accusations, I rarely actually arrest anyone unless I have overhwelming proof, so I think very few people actually arrested have committed crimes. So I think that our suspects in custody are often actual criminals.

YOU — "I'd say most people have killed someone, and my instincts about who are suspects are approximately never wrong."

HARRY — "I, what... how can most people have killed someone? That would mean that as many people are as alive have also been killed? Do you believe the population halved in the last 50 years via murder alone?"

YOU — "Well... we have lived through a war."

HARRY — "I... well, this at least seems to be stretching the definition of murder, which is an unlawful killing. But still I think Voldemort's people hadn't been involved in that magnitude of killings. I think our disconnect may be deeper than arithmetic."

ENCYCLOPEDIA [Easy: Success] — We really have no idea who Voldemort is.

CONEPTUALIZATION — It sounds like a good name for the leader of a heavy metal band. Maybe we should call ourselves Voldemort?

HARRY — "Anyway, if you don't mind, I am clearly grieving my friend's death, and don't need much interruption via the law if you don't have specific questions."

EMPATHY — He is just saying this to get out of the interaction, but he is also deeply sad beneath his austere outward persona.

INLAND EMPIRE — He has known pain like we have, which is especially sad for a child so young. Yet he still has life behind his eyes. Perhaps we have something to learn from him.

YOU — "Well I'm sorry about your friend kid. I'd be interested in hearing about it. But I still don't know quite where I am, and I'm looking for another drink, I have my own sorrows to drown."

HARRY — "You don't look like you know what you're doing here. I'm going to order you the alcoholic beverage that's charmed to reduce the alcohol in your system. The person I'm here to meet is late, and some part of me feels inclined to hear you out, so I'll drink some butterbeer myself."

To be continued.



Discuss

Circuit discovery through chain of thought using policy gradients

2025-11-29 15:56:17

Published on November 29, 2025 7:56 AM GMT

Circuit discovery has been restricted to the single-forward-pass setting, because the algorithms to attribute changes in behavior to particular neurons / SAE features need gradients, and you can't take a gradient through the sampled chain of thought. Or... can you?

It turns out taking gradients through random discrete actions is an essential part of reinforcement learning. We can estimate the gradients of an expectation over CoTs, with respect to the features, using the score function estimator. We can combine this with integrated gradients to produce a version of EAP-IG which works through the averages of chains of thought.

Background

Circuit discovery

The task we attempt to do is circuit discovery, defined by Conmy et al. Formally, for each subgraph  of a computational DAG  which represents a neural network, we want to find which subgraph  is responsible for a behavior. We do this by defining a 'task loss', which compares the performance of the subgraph to the performance of the whole network. Let that loss be  and then  be the clean and corrupted datapoints. The loss for a single pair of data points is:

.

the overall loss of a circuit  is simply the average of this loss over all datapoints and corrupted datapoints .

Partial edge inclusion: introducing z

To connect this to integrated gradients, we introduce variables , which control whether an edge is included in  or not. The scalar  controls whether the th edge (or node) is included in  or not. That is, the value  of the th edge is replaced by: 

If we set  the edge is not included, i.e., it has the value it would get under the corrupted input.  If we set , then the edge has the value it gets from running the comptuational graph  forward.

Integrated Gradients for attribution

Our first ingredient can be any gradient-based method for circuit discovery. I've chosen to focus on EAP with Integrated Gradients because it's still the circuit discovery algorithm with the best balance of simplicity and performance. You could make a CoT version of Attribution Patching as well.

To attribute behavioral loss to some configuration of edges (concrete value of ), we compute the gradient of the task loss with respect to , which determines whether we include an edge or not: . In EAP-IG, we average this for z between 0 and 1, for all edges of the graph  simultaneously. If we interpolate at  points in between 0 and 1, the attribution for a single data point is:

for loss defined using the task loss, the full graph and the graph corrupted by z: . Notice that we average over , so we take , and  intermediate points.

Policy gradients

Now for the second ingredient: policy gradients. Suppose I have a policy and some loss function , which depends on trajectories of actions  from the policy. The policy  is parameterized by some parameters . The expected loss over trajectories is:

We'd like to take gradients . These are tricky because the loss  does not depend on  directly. Instead, it depends on  through the distribution over actions in the trajectory, which determines the expectation of .

The policy gradient theorem tells us that the gradient is the expected gradient of the log-probability of actions, weighed by how big  is: 

Let's take this formula as given. I explain it in these two posts, but one the keys to it is that we can swap integral and differential signs, and that  by the chain rule.

What do policy gradients give us?

It's worth expanding on what policy gradients are for, and why they're useful. Policy gradients give us the gradient of how the average outcome over many trajectories varies, when we vary the parameters . It's not for a particular rollout, it's for the whole distribution. As such, any gradients that we take include the effect of the CoT on the output.

The function  can be a function of any number of steps in the trajectory. It can be just of the final step (if we're looking at e.g. full CoTs and a single token answer). It can be of many steps at the end (if we're considering a CoT + whether an answer matches the truth, as rated by some other model). It can be basically anything. That's why it's the workhorse of modern LLM RL: PPO, GRPO, etc. are all based on policy gradients.

The proposed method: integrated gradient policy gradient

So if we want to attribute behavior sampled through CoTs to parts of the network, we can just use both of these simultaneously.

We define a task loss  that depends on the tokens until now, the output of the original model and the output of the new model. The behavior that we want to study (and find sub-circuits for) is thus the expectation when sampling from the corrupted subgraph:

We sample from  autoregressively: we start with  and corrupt it to find , that lets us compute  and sample  from it; then we corrupt  to get etc. I've abbreviated this in the expression above as .

Now we see how we can use both elements.

Integrated gradients: to attribute the behavior through the CoT to components z of the model, we simply need to take  interpolated at various points for  between 0 and 1. That is, we want:

We've removed the dependence of Attrib on the data points  because we're sampling things from the model, presumably with some context. But we could average over some contexts, why not. 

Policy gradients: The gradient  is with respect to a probability distribution. To compute it, we need to use the policy gradient theorem:

To estimate this expectation, we sample a bunch of CoTs from  and average their values of .

We can just plug this in into the previous equation, and there we have it: attribution to circuit components over chains of thought.

Discussion

This method is very flexible, because it's just the old EAP-IG, except now we can also compute gradients over probability distributions. The  can be assigned to neurons, attention heads, SAE components, anything.

They don't even have to be constant across time. We can have separate components for the gradients at a particular time-step to study the effect of a component at that time step. The same is possible if the 'time step' moves depending on where a token falls, but I think you're missing some of the effect in that case.

I haven't implemented this. It's tricky with open-source packages because you can't just interpolate between the original and corrupted inputs in vLLM, and Huggingface only has quadratic sampling. To make it really efficient, it's also nice to be able to compute the gradients w.r.t  at every step using the same version of KV-cache attention that you computed.

I might fill this gap with open-source tooling myself, especially if I can get funding for a month to do it.



Discuss

Change My Mind: The Rationalist Community is a Gift Economy

2025-11-29 15:55:54

Published on November 29, 2025 7:55 AM GMT

Anthropologists have several categories for how groups exchange goods and services. The one you're probably most familiar with is called a Market Economy, where I go to a coffee shop and give them a few dollars and they give me a cup of hot coffee. Rationalists, by and large, are fans of market economies. We just don't usually operate in one.

I. Market and Gift

Lets start with some definitions and examples in case you're unfamiliar with the genre. Allow me to describe two ways of organizing.

Someone offers you an experience you want, maybe some music. You take them up on it, which basically just involves walking over to their place and sitting down to listen; if you're not close enough you just go the Youtube channel with all the music. You have a great time. Later you write some story and put it on the internet where anyone can read it, sending a note to a friend you wrote it for. Your friend gushes about it back to you. A little while later your friend runs a party with a bunch of food and invites you; you don't go, but you hear the musician did and had a good time. 

Or there's the other way.

There's some experience you want, maybe some music. It'd bring you joy, but you can't listen without offering the musician something first. You go write some stories, and offer to let people read the stories if and only if the people give you a kind of marker. One of those people arranges a small conference venue and makes some food, selling tickets that to get their own markers. They give you markers for your story, and you give markers to the musician. The musician does this a lot, it's how they get markers to go to places with people to talk to and food to eat.

Which of those sounded more like the rationalist community to you?

Hint: Here's the Bay Area Secular Solstice with music, you can go listen right now if you want. Here's one of, if not the, best rationalist story, which you can go read in its entirety right now if you want, or go listen to the audio version. Here's a list of places you can go hang out with rationalists; some of those have ticket prices but most don't, and many of them have free-to-the-attendee food. 

Sometimes goods and services are gated by money within the rationalist community, but they mostly aren't. Even when money is a requirement it usually involves people working for free or for far below their market price. We know how to offer a good or service contingent on someone paying us a market price, including the cost of our labor. We just don't do it. 

Someone reading this is thinking of bringing up the larger Solstices, or big events like the East Coast Rationalist Megameetup (which, if you will pardon a shameless plug for a moment, is happening the weekend December 19th, 2025, and does sell tickets) or LessOnline. Those events do gate attendance on paying. . . somewhat. 

But at LessOnline many people volunteered their time in exchange for cheap or free tickets, and the exchange rate was bizarre both ways; professional software engineers spent twelve hours moving chairs and tables instead of paying the ~$500 ticket price. And the price range for Californian Bay Solstice, a price range which gets you effectively the same ticket (as in it's not like front row seats vs balcony) currently look like this:

That's not market pricing. That's not even really price discrimination. That's a collection hat being passed around inviting gifts.

The far more common transaction around these parts is for people to just give you things.

II. What does a gift economy need?

Gift economies rely on a handful of components.

First, gift economies are most commonly seen where reputation can be tracked and remembered. Gifts between individuals work. You give me a nice pie, or let me sleep on your couch for a week when I'm traveling. I remember this. Even amid a vaguely Dunbar's number sized crowd people often can mentally track who is helping out other folks a lot.

Second, gift economies often take advantage of reciprocity. There's a very human instinct to do nice things for people who have done nice things for you. Sometimes this gets almost codified; when grandma gives you a toy for Christmas, you write her a thank you letter. But we also do nice things for people who do nice things to others. We do this all the time. The archetype here is the village priest never having to eat alone, or a questing knight being offered shelter even if they haven't slain any dragons around this town yet.

Another key fact is that gift economies rarely balance out. When I pay the pizza place twenty bucks for a pizza, that's it, we're square. If I decide to tip, I don't expect them to remember that and throw in an extra couple slices next time I order. Instead, everyone involved tries to provide a bit more than they take most of the time.

Lastly, gift economies are long term. They don't work well as a one-shot, prisoner's dilemma style. In small towns or villages where people don't leave very often, a gift economy can hum along with the goodwill you've earned coming back to you over time. If people are moving away all the time and you might not see them again, people are more likely to use market economies where that accumulated value can be taken with them when it's time to go.

Look at the rationalist community.

Local groups outside of the bay are sub-Dunbar. Many rationalists travel; even amid that crowd you start to recognize a few of the same faces from conference to conference whether the gathering is in Berkeley or Berlin. That's reputation.

For all the supposed ruthless optimization, rationalists are pretty open and warm. People keep trying to give their favourite blogger nice books or swords. I basically have to beat off offers of housing with a stick when I travel. Sure, me and Scott are unusually prominent, but it's not like offers of a couch are hard to get around here.

And oh man, do many rationalists have whole complexes about not having done enough to help others. I still think a lot of you all need to sit down with Atlas Shrugged to get nudged in a usefully more selfish direction.

It's tautological to say, but rationalists who stick around tend to stick around. Many people have been interacting for a decade or two at this point. Present rate no singularity, many people expect to be interacting with some of the newcomers for a decade to come. People may leave town (to go to Berkeley usually) but maybe they'll come back, or wind up working adjacent to you in the future.

III. Is this good?

I think it's pretty good.

This is a low friction, high trust way to be. The gift economy in these parts is taking a bit of an advantage of the generally high surpluses afoot; some folks make a lot of money in the market economy around us, there's a lot of talent and competence on display from a high percentage of people, other folks really like baking bread.

It does run into a few problems. We burn people out a bit often, and when we do we often can't replace them at prices people are happy with. We don't have as clear a signal on what's important to do as we might like, because there isn't a measurable signal coming in.

I'm not making an argument that we should change from being a gift economy. I have this buried anthropologist inside me that just wants to say out loud what's going on, because it's a useful Rosetta stone for understanding how many awesome parts of the community manage to exist in defiance of anything market shaped.

It's also useful to notice when something should shift from gift to market. There comes a scale of Solstice where the venue isn't someone's living room any more and it expects to be paid. At some stage of an organization's life and goals, it's better to swap to paying people a salary to work there.

And I think gift economies are easy to misunderstand if you're coming with a market mindset. You really can just eat the food at your local house party unless the host specifically says otherwise. It's not an explicit trade where you have to give the right sort of gift back. We do a lot of 'pass it forward' around here. If you haven't been in a gift economy, if you're used to having to earn each inch and settle the tab by the end of the night, you can relax a bit. 

(Some people should reverse advice they hear, including this advice.)

Anyone want to try and change my mind?



Discuss

Epistemology of Romance, Part 1

2025-11-29 14:18:47

Published on November 29, 2025 6:18 AM GMT

The Notebook is one of the most beloved romance films of the 21st century. When I run this activity, whether it’s at a rationality workshop or Vibecamp, and I ask someone to summarize what it’s about, there’s usually (less and less as the years go by) someone who, eyes shining bright, will happily describe what a moving love story it is—all about a man, Noah, who falls in love with a woman named Allie one summer, writes her every day for a year after they're separated, reconnects with her years later, and then in old age reads their story to her in a nursing home because she has dementia, helping her remember their love until they die peacefully in each other's arms.

And none of that is wrong. That is the bulk of what happens in the movie.

But when I follow up by asking them how the relationship started (which, in a dozen instances of running this event, only one person has ever independently volunteered)... well, my favorite reaction was back in 2021, when the teenage girl’s face immediately shifted from glowing enthusiasm to a sheepish “OH… right, that was actually quite problematic, wasn’t it?”

“Problematic.” Ah, the British understatement.

What happens is this: Noah sees Allie at a carnival and asks her out. She says no. He asks why not, and she says “because I don’t want to?” Her date, who was nearby the whole time, leads her away.

Noah follows them, sees her on a ferris wheel, and climbs it until he can force himself to sit between them.

As they’re freaking out, he introduces himself and says he’d really like to take her out. The ferris wheel operator stops the ferris wheel and yells at him that the seats are designed for two. He says okay, climbs out of the seat, then hangs from the bar in front of the seat, well over 600 feet above the ground, so he can ask her out again.

She says no again.

He asks why not again.

She repeats that she doesn’t want to.

So he lets go with one hand, saying she leaves him “no choice.”

More freaking out ensues. Others start yelling at him to stop. He says “not until she says yes,” and starts to visibly struggle to keep his grip with just one hand. Someone yells at her to just say she’ll go out with him, and still clearly freaking out, she says fine, she’ll go out with him.

He then makes her repeat it multiple times until she’s yelling she wants to go out with him, and only then does he use both hands again and climb down.

This is not a minor detail. This is the inciting incident of the entire love story.

And somehow, in the cultural memory of this film, it’s often just not part of the narrative. When people describe The Notebook as romantic, I don’t think they’re explicitly endorsing that scene—most seem to just kind of forget it's there (though these days you’ll find a lot more people calling it out). The emotional beats that stick are the rain kiss, the letters, the nursing home. Not the part where the relationship began with what would, in any other context, be recognized as stalking, coercion, and emotional blackmail.

I'm not bringing this up to cancel The Notebook or to argue that enjoying it makes you a bad person. I'm bringing it up because it's a vivid example of something important: the sources we learn about romance from are not optimized for teaching us true things about healthy romances.

And if that's true of the most popular romance movie of a generation, it's worth asking: what else have we absorbed about romance, dating, and sex?

What do you think you know, and why do you think you know it?

Think about where your beliefs about dating and relationships actually came from. For most people, the answer is some combination of:

  • Media
  • Family
  • Religion and culture norms
  • Friends and peers

And it’s worth explicitly noticing that none of these sources are optimized for truth.

Each one has its own set of values, incentives, and blind spots that warp what it says or omits about love and sex. And most of us have never really examined which parts of our romantic worldview came from where, or whether those sources should be trusted.

There may be some other area of our lives as important as this that we build on more shaky epistemic foundations, but if there is, I can’t think of it.

Media

This problem applies to some degree between books, television, games, and films, but when it comes to romantic tropes, two most powerful media forces in most young people’s lives are Disney films and Hollywood blockbusters. While these have certainly gotten better in some ways over the years, the classics are still common canon for youngsters even outside of Western countries. Aladdin is about a boy who spends most of the film lying to his love interest, Cinderella involves a prince who falls for a woman purely on her (also magically deceptive) beauty and a single dance, and as for Beauty and the Beast… yeah.

As for Hollywood, just about every action movie has a shoehorned romance subplot that involves two people meeting and falling for each other, or a one sided love that becomes reciprocated, in the space of a handful of conversations and scenes. And this would be fine if it was portrayed the way it happens in real life, but the framing is almost exclusively triumphant and implying of happily-ever-afters. And situations like The Notebook are alarmingly common when you know what to look for. When’s the last time you watched the original Blade Runner, for example? Remember how that romance subplot consummated?

The most straightforward reason we can’t trust movies or TV or even novels is that they aren't trying to teach you how to score a date, or how to start and maintain a healthy relationship. They're trying to entertain you. They're optimized for emotional catharsis, dramatic tension, and satisfying narrative arcs.

This creates predictable distortions:

Persistence pays off. In movies, if someone says no, that often just means "try harder" or "make a grander gesture." The protagonist who doesn't give up gets the love interest in the end. In reality, someone who won't take no for an answer is, at best, exhausting, and at worst, dangerous.

Grand gestures work. Showing up at someone's workplace with flowers, making public declarations of love, flying across the country to stop a wedding—these are dramatic, which makes them entertaining. They're also often boundary violations that put pressure on someone to respond positively in front of others. They can work in real life, but usually only when there’s already a romantic spark or mutual interest.

Conflict means passion. Couples who fight constantly are portrayed as having "chemistry." The will-they-won't-they tension, the dramatic breakups and makeups—it all reads as more romantic than two people who actually get along and communicate well. But relationships built on constant conflict are exhausting to actually live through, and tend to cause people to pretzel themselves or each other.

Love at first sight works out. Instant mutual attraction is presented not just as a normal part of courtship, which for many it is, but as destiny. Slow-building connection, developing feelings over time, choosing someone who brings out your best self rather than just excites you sexually—these are less cinematic.

“Soul Mates” are real. The idea that there's exactly one right person out there for you, and your job is to find them. Not only is this statistically absurd, it creates a conflicting framework  where any relationship difficulty is evidence that maybe this isn't your soulmate, rather than a normal part of two humans building something together.

There's also a massive selection effect at play. Healthy, stable relationships between people who communicate clearly and maturely don't make for the easy conflicts that are all most writers can manage to resolve. There's no dramatic tension between two people who talk through their problems, respect each other's boundaries, and build a life together without crisis. So the relationships we see on screen are wildly unrepresentative of the relationships worth having.

To be clear, this isn’t to say that nothing you ever see in films can ever work in real life. Sometimes grand gestures do work to start a relationship, even ones that might be considered boundary pushing or risky. Many people’s sexuality does include Consensual Non-Consent dynamics.

But especially when we’re young and impressionable, which everyone is at some point, the weight of each film, book, and show is hard to overstate, and more impactful than most realize.

Oh, and I’d be remiss not to mention the impact of porn on people’s views on sex, young or old, but I expect most people are aware of that particular set of distortions, and it would take a whole other article to go in-depth. Suffice to say, if you want to know what sex is really like, from good to bad to weird, there's not much between superstimuli on one side and awkward conversations with real humans on the other.

Family

Raise your hand if you ever got dating advice from your parents.

Put your hand down if it mostly concerned how to dress or present yourself.

Put your hand down if it mostly concerned warnings about STDs and pregnancy.

Put your hand down if it mostly concerned what sorts of people you shouldn't date.

Put your hand down if it was generic encouragement like "just be yourself" or "you'll find someone when you're ready."

In my experience running this, people tend to be surprised by how few hands go up in the first place, let alone survive to the end. And of those that do, the advice is rarely about the actual skills involved—how to flirt, how to ask someone out, how to read interest, how to navigate rejection, all that stuff is never even touched on, let alone how to navigate consent beyond “no means no,” or communicate what you want in bed.

Older siblings might be more helpful here, but it really depends a lot on the sibling, and the relationship people have with them, and advice that comes across gender lines is often different from advice that’s passed down between boys and girls. 

For most of you, your family wasn’t trying to mislead or sabotage you. But they also weren’t optimizing for truth either, even if you could trust their epistemics. They're optimized for a mix of their own comfort, transmitting their values, and your safety.

The distortions here are different:

Sample size of one (or zero). Most people's primary model of "how relationships work" comes from watching their parents. But that's one data point, from a different generation, between two specific people whose dynamic may or may not generalize. If your parents had a bad relationship, you might have learned what to avoid but not what to aim for. If they had a good one, you might assume their specific approach is universal when it might actually just be well-suited to them.

Shame. Most parents will never teach their children how to flirt, how to express romantic interest, how to be sexually attractive, or how to navigate the actual mechanics of dating. The most you're likely to get is warnings, with the implicit message that sexuality is dangerous and needs to be contained, not that it's a normal part of life you could get better at with guidance.

This isn't because parents don't want their kids to eventually find love. It's because talking about these things feels awkward (usually for both parents and kids!), and because parents often have their own shame and hangups. In many cultures, this sort of thing just isn't conceived of as part of the parent's role. 

But the result of all this is that most people enter adulthood with zero explicit instruction in one of the most important skill sets they'll ever need. Imagine if we treated learning to drive or manage money the same way—just warnings about what not to do, encouragement to believe in yourself, and then you're on your own.

Unexamined assumptions. The advice you get from family often comes with invisible premises about gender roles, what relationships are for, what counts as success. "Happy wife, happy life" and "men are simple creatures" and "never go to sleep angry” aren't wisdom; they're cached heuristics that shut down curiosity and the complexity of individual experiences.

Their fears, not yours. Parents especially tend to give advice oriented around their anxieties. Don't date that kind of person. Don't move too fast. Don't let them pressure you. Any of these might be useful wisdom for your circumstances, but are often motivated by projection and protective fear, not a reasoned examination of the particulars of your personality and your partner’s.

Generational and cultural mismatch. Dating has changed. The scripts your parents followed—how to meet people, when to commit, what the milestones are—may not apply anymore, especially if you’re a first-generation immigrant. Their advice might be accurate for a completely different world than the one you’re living in.

For the most part, parents mean well, and want you to succeed. When it comes to marriage and children, most parents are actually quite invested in your success! But good intentions aren't the same as good information. You can't rely on them to be accurate sources, even when they're genuinely trying to help… and sometimes, their own traumas and hangups will fuel a twisted picture of reality that they then try to pass on to you “for your own good.”

Religion and Culture

If you've ever traveled internationally, or dated someone from a different cultural background, you've probably noticed how different the assumptions can be. Who pays for dates, when you meet each other's families, what physical affection is acceptable in public, or even what counts as “cheating.” It becomes obvious very quickly that what feels like "just how things are done" is actually "how things are done here." And while there may be a purpose to any particular norm, that doesn’t make it less arbitrary, or superior to alternatives.

When I ask people where they learned that sex before marriage is bad, or that men should make the first move, or that people can’t be platonic friends with their exes or the opposite gender, they often can't point to a specific conversation. It was just... in the air. Absorbed from church, or relatives, or the way people talked about others who did things differently. The rules were transmitted without ever being explicitly argued for, which makes them pretty hard to question, let alone critically examine.

As with the previous categories, we can easily see again that religious and cultural traditions around romance and sex aren't optimized for your personal happiness, let alone for empirical accuracy. They're optimized for things like social stability, reproduction, community cohesion, and whatever moral frameworks are dominant.

This doesn't mean they're worthless. Traditions that have persisted often contain some embedded wisdom—about the value of commitment, about not making major decisions in the heat of infatuation, or even just to set a coordination standard that works for 90% of people.

But the wisdom is tangled up with everything else, and it's rarely presented as "here's a useful heuristic" rather than "here's how things can work effectively," or worse, “here’s the only moral way to do this.”

Some common distortions here include:

Gender roles as divine mandate. Men lead, women follow. Men provide, women nurture. Men pursue, women are pursued. These aren't presented as one possible arrangement that works for many people—they're presented as the natural order, or God's design, or just "how it is." If your actual preferences or strengths don't match the script, you’re the one that’s assumed to be bad or wrong.

Purity frameworks. The idea that sexuality before or outside of marriage is not just inadvisable but corrupting—that it makes you damaged goods, that it's something to be ashamed of, that your "body count" determines your worth as a partner. These frameworks don't just affect behavior, they shape how people feel about themselves and others, even long after they've consciously rejected the overarching beliefs attached to them.

Marriage as institution over relationship. In many traditions, marriage isn't primarily about two people's happiness together—it's about family alliances, economic arrangements, producing children, and maintaining social order. The relationship between the actual spouses is secondary. This can show up as pressure to marry by a certain age, to marry within certain groups, or to stay married regardless of how the relationship is actually going.

Courtship scripts that don't match reality. Formal courtship, chaperoned dating, arranged introductions—these made sense in contexts where young people had limited ways to meet, where family reputation mattered, and where marriages were partly economic negotiations. If you're living in a time and place with dating apps and geographic mobility and careers that delay marriage by a decade, the old scripts might still work for some people, but won’t for everyone.

The challenge is that it's hard to extract the useful parts from the rest.

"Commitment is necessary to get through rough patches" can be genuine wisdom. "Commitment means staying no matter what" causes plenty of grievous harm. 

"Don't date just for sex” can be good advice. "Don't have sex until marriage" is disconnected from how most people find romantic fulfillment, and ignores the reality of birth control.

The tradition often doesn't distinguish between the core insight and the specific implementation, and questioning any part often results in backlash that prevents evolution or nuance.

And unlike the media, where most people understand on some level that movies aren't instruction manuals, religious and cultural teachings often come with the explicit claim that they are the correct way to live. That makes them harder to examine critically, and harder to update when your experience doesn't match what you were taught.

Friends and Peers

Finally, we reach what was, for many people, the most trusted sources of info. Friends can be your most honest source of information about romance… but "most honest" is a relative term, and there are plenty of biases worth noticing.

Think back to the conversations you had about dating in middle school, or high school, or even college. How much of it was useful? How much of it do you think was honest?

If you were around teenage boys, you probably heard a lot of exaggerated conquest stories implying more experience and smoothness than is plausible. The incentive is to seem like you know what you're doing; admitting confusion or inexperience is low-status, and guys who have even a couple stories to share, even second-hand ones from older brothers or cousins, get a lot of eager ears. The result is a lot of distorted advice that becomes received wisdom, passed around with the confidence of someone who's definitely had loads of sex, trust me.

For girls, the dynamics are often more complicated. There can be a tension between "don't be slutty" and "I'm more grown up than you" that creates a weirdly tiered information environment. More experienced girls might drop hints or share selectively with those less knowledgeable than them, but there's an assortative quality to it—girls tend to form close bonds with others at similar experience levels, and information travels through the grapevine in whispers as often as boasts. You might overhear things, or pick up on hints, but you’re still not going to get a full honest picture.

Either way, the incentive is to craft an image, not to share accurate information. Neither version tells you much about how attraction actually works or what a good relationship looks like. But if that's your primary data source, you're building your model of romance on a foundation of competitive self-presentation.

Performance over accuracy. People describe the relationships they want others to believe they have. The fight that happened last night doesn't always make it into the brunch conversation. The sex that's become routine gets described as "fine." The doubt about whether this is really the right person gets swallowed. There's social pressure to present your relationship as good, your partner as great, your choices as correct. This isn't usually conscious deception—it's just what people do. And while the specific distortions change as people get older, they don't disappear entirely.

Selection effects on who shares what. The friends who talk most about their dating lives aren't a random sample. Sometimes the people with the most to say are the ones with the most drama, which skews your picture of what relationships look like. Meanwhile, the friends with the most embarrassing or uncertain stories, or the ones in stable, happy relationships might rarely mention them at all.

Validation over truth. Sometimes what you want from a friend is someone to tell you you're right, and sometimes what a friend wants is to support you, not challenge you. These incentives align toward telling you what you want to hear. "You deserve better" is easier to say than "I think you're being unreasonable." "He sounds like a jerk" requires less courage than "Have you considered that you might be part of the problem?" Friends willing to say the hard thing are rare and precious.

Limited sample size, confidently generalized. By the end of highschool, you friend group as a whole has had a handful of relationships in total, if that? And from that tiny sample, they'll generalize freely. "It works every time," "That never works," "When you know, you just know." These generalizations feel authoritative because they're delivered with confidence by people you trust, but they're drawn from a minuscule and non-representative slice of the population.

The blind leading the blind. If you're in your twenties trying to figure out dating, your friends are... also mostly in their twenties trying to figure out dating. You're all working from the same limited experience and the same polluted information sources. Pooling your confusion doesn't magically produce clarity. Sometimes it just produces a shared set of misconceptions that feel true because everyone agrees on them.

All that said, honest friends who are willing to share their unfiltered, vulnerable truths are invaluable… they're just rare. And you have to actively cultivate the conditions for that honesty: making it safe to be vulnerable, rewarding candor instead of punishing it, and being willing to hear things that go against accepted wisdom.

If you're lucky enough to have older friends, or friends who've been through more, or friends who think differently than you do, pay attention to them. They're your best chance at escaping the echo chamber. But even then, remember: they're still just individuals, with their own filters and biases and limited data sets.

Now What?

If you've made it this far and this is the first time you've considered these things, you might be feeling a bit adrift. I've just argued that the four main sources of information for romance and sex are all unreliable in different ways. Media wants to entertain you. Your family wants to protect you and pass on their values. Religion and society want conformity and predictability. Your friends want you to like and respect them. None of them are optimizing for truth, and information from each comes filtered through a different set of incentives, blind spots, and distortions.

So where does that leave you?

The good news is that there are some decent resources out there, if you know where to look. The Gottman Institute has decades of research on what actually predicts relationship success and failure. Aella has done fascinating survey work on sexual preferences and experiences. Books like He's Just Not That Into You and Models: Attract Women Through Honesty cut through some of the romantic mythology with practical, grounded advice.

But if you want to be an active participant in your own knowledge, and have the tools to evaluate information others pass along… well, better general epistemics is a good starting point. But navigating the territory of romance and sex in particular is complicated enough that it deserves its own article—especially because it necessarily leaves the realm of purely descriptive epistemic analysis, and takes a step toward more normative assertions and personal analysis.

I'm not the first to notice that we're failing, as a society, to provide good answers to these questions. There are researchers from various fields trying to provide answers, some rigorously, and some… less so. There are also communities and influencers trying to create a fifth source of romantic advice, independent from the traditional sources, each with their own biases and incentives.

In Part 2, I'll go deeper on good sources like the Gottmans and Aella, as well as corrosive movements like The Red Pill and influencers like Andrew Tate—what makes them so appealing, and the progressive blind spots that make them more appealing than they should be. And finally, I'll make the case for what I think is the most crucial part of good romantic epistemics overall: learning to have real, honest, vulnerable conversations with actual humans about what they want and experience.

Until then, consider this an invitation to start noticing. When you have a belief about how dating works, or what men want, or what women want, or what a healthy relationship looks like—ask yourself where that belief came from. You might be surprised how often the answer is "a movie I saw when I was fourteen" or "something my mom used to tell me" or "I don't actually know."

That noticing is the first step. Without it, you're more likely to just end up running scripts written by others… hopefully not Hollywood’s.



Discuss