2026-04-03 00:34:54
Is the US a ruthless cognitive meritocracy that reliably promotes outlier talent? VB Knives defended that claim in a Twitter argument against Living Room Enjoyer that got my attention. [1] Knives argued that if you have a 150 IQ, you'll be a National Merit Scholar, which "at a minimum" gets you a free ride at a state flagship university, from which you can proceed to law school, med school, etc. Enjoyer shot back: I'm a Merit Scholar, where's my free ride? Knives asked Grok, Elon Musk's AI; Grok recommended the University of Alabama, ranked #169.
About 1.3 million high school juniors take the PSAT each year. Around 16,000 become Semifinalists (top 1.2%), of whom about 95% become Finalists. Of those 15,000 Finalists, only about 6,930 receive any NMSC-administered scholarship at all. The best-known category is a one-time $2,500 payment; most other awards are corporate- or college-sponsored.
The prospect of a free ride comes from a handful of schools that use National Merit status as a recruiting tool. The University of Alabama (the example Grok cited in the thread) offers Finalists a package covering tuition for up to five years, housing, a $4,000/year stipend, and a $2,000 research allowance. Florida State offers a full out-of-state tuition waiver. UT Dallas covers tuition and fees for up to eight semesters plus housing. A few dozen other schools do similar things. [2] The common requirement for college-sponsored awards is that you designate the school as your "first choice" through NMSC's system. [3]
To estimate where these schools sit in the overall population, I calculated cumulative freshman enrollment by US News 2026 rank as a fraction of the 4.2 million surviving members of the relevant birth cohort. I used two methods: individual freshman class sizes from institutional fact sheets and Common Data Sets for the top 50 and key schools, and total undergraduate enrollment ÷ 4 as a cross-check. I then spot-checked averages for ranks 51-169 against published figures. [4]
The results:
University of Florida (#30, full cost of attendance via the state Benacquisto Scholarship for Florida-resident NM Scholars at any eligible Florida public or independent institution): about 2.0% of the age cohort attends a school ranked this well or better. 98th percentile.
Florida State (#51, best out-of-state full tuition waiver I found at a reasonably well-ranked school): about 4.5%. 96th percentile.
UT Dallas (#110, canonical NM full-ride school): about 10%. 90th percentile.
University of Alabama (#169, Grok's pick): about 16%. 84th percentile.
The remaining 84% of the cohort either attends a school ranked below #169, attends a liberal arts college or regional university outside this ranking frame, attends a two-year institution, or doesn't enroll at all. All top-50 schools combined enroll about 4.4% of each year's birth cohort. [5]
If the system were a ruthless cognitive meritocracy, the NM Semifinalist (one of 16,000 selected from a 4.2 million cohort — the top 0.4% by count) would find themselves surrounded in college mainly by people like themselves, the way students at Australia's selective-entry public high schools (James Ruse, Melbourne High, Mac.Robertson) or at the University of Melbourne are surrounded by cognitive peers selected primarily by exam. [6] Instead, our Semifinalist is grafted into a hybridized class that mixes high-g outliers with the broader 84th-96th-percentile population: the children of comfortable professionals, kids with good-but-not-extraordinary test scores and polished applications. The NM Finalist at FSU or Alabama is not entering a community of intellectual peers. They're being absorbed into a much larger social stratum defined less by cognitive ability than by a mix of family resources, geographic convenience, and institutional compliance.
Suppose a bright youth is ruthlessly optimizing for the meritocratic track. They don't even have to be an extreme outlier with an IQ of 150; they could have an IQ anywhere above 130, the 98th percentile. [7] They score in the top 1.2% of PSAT test-takers, becoming a National Merit Semifinalist. They become a Finalist. They designate one of the nine Florida NMSC college-sponsor institutions as their first choice, which makes them eligible for a college-sponsored scholarship (sponsor schools typically offer these to all or most designating Finalists, converting them to Scholars; University of Florida at #30 is the obvious pick for a Florida resident, since the state's Benacquisto Scholarship then covers full cost of attendance at any eligible Florida institution; if they are not fortunate enough to already reside in Florida, they can attend FSU at #51). [8] They attend for free. They graduate with a strong GPA. They score well on the LSAT or MCAT. They get into an elite professional school such as a T14 law school or equivalent. [9]
How strong is this sort? Sarah Constantin's taxonomy of extreme elites distinguishes "One-Percenters" (groups of a few million: engineers, programmers, people with IQ over 130), "Aristocrats" (hundreds of thousands: doctors, lawyers, Ivy alumni), and "Elites" (tens of thousands: Google engineers, AIME qualifiers). The total number of people entering some recognizably elite professional or graduate track each year is probably 20,000-40,000, or about 0.5-1.0% of the age cohort — Constantin's "Elite" tier. [10]
This bright youth, if they also happen to be more enterprising than neurotic, on good enough terms with their school principal to get an endorsement, and have noticed the opportunity, can reach a credentialed position roughly commensurate with their ability rank. The financial barriers at each stage are low. Undergrad is free if you play the NM game. Professional school is debt-financed against high expected returns; federal lending covers it without requiring family money.
But the attribution of "ruthless meritocracy" to the system is misplaced. If the entrant ruthlessly exploits the meritocratic elements of the system, they can reach a social rank commensurate with their g without taking the sorts of risks an enterprising young man would have needed to take in, say, medieval Europe, to reach a similar rank.
Think of it this way. There is a system in this country where you take a test in high school, and if you score really well, the system sends you a letter saying congratulations, you are very smart, and then — this is the important part — does nothing. It's like when your dentist sends you a reminder postcard. Except the postcard says "You are a genius" and the next step is "Figure it out yourself, genius."
Caroline Hoxby and Christopher Avery's "The Missing 'One-Offs'" showed that high-ability, low-income students, particularly outside metro areas, systematically undermatch. [11] The binding constraint isn't financial access (elite schools often have generous need-based aid) but the kind of strategic knowledge about how to navigate the system that is distributed by class, geography, and school counselor quality, not by cognitive ability. This is concentrated not at the 99.5th percentile, where the system's detection apparatus engages, but around the 90th, where it barely notices you.
The word "meritocracy" entered English from Michael Young's 1958 dystopian satire The Rise of the Meritocracy, but the concept is older. Napoleon's "la carrière ouverte aux talents" is usually cited as its modern origin. Revolutionary France was fighting simultaneous wars against multiple European coalitions while suppressing internal insurrections, having just guillotined much of its hereditary officer class. The old-regime army had numbered roughly 150,000-200,000; the levée en masse, the first modern mass conscription, produced over 750,000 men under arms by 1794.
Promotion on the basis of merit was a desperate response to the problem that the people who had been running the army on the basis of birth were dead or in exile, the army was five times larger than it used to be, and the external performance pressure (lose and the Republic falls) was existential. Before the Revolution, 90% of officers had been aristocrats; by 1794, only 3% were. Even then, Napoleon's own practice drifted back toward class selection: his lycées were few and expensive, his officer corps increasingly recruited from "good families," and Rafe Blaufarb's The French Army, 1750-1820: Careers, Talent, Merit concludes that the Revolutionary meritocracy "institutionalized a concept of merit that combined talent with social rank and family ties." Stendhal's novel The Red and the Black (1830) depicts the result: Julien Sorel, a carpenter's son who idolizes Napoleon and dreams of rising by military merit, finds that after Waterloo the army (the red) no longer offers a path for men of his class. Only the clergy (the black) does. He advances through the seminary, through seduction and the careful management of appearances and relationships (one is reminded of Dale Carnegie's How to Win Friends and Influence People, or Castiglione's The Book of the Courtier).
The American system is not under that kind of pressure, but it does have procedures for promotion partly on the basis of test scores. Poor and unconnected high-g types can use these procedures to climb from a high PSAT score to a top professional school degree while spending very little money, if they know the tricks and are willing to spend a decade on it. The result is something like a hybridized mandarinate: a credentialed administrative class selected partly by examination and partly by family resources and social compliance, whose training consists largely of learning to operate the system that selected them. This system affords somewhat more meritocratic upward mobility than the Ottoman or Austro-Hungarian Empire. [12]
Not all American elites come through this pipeline; the truly wealthy bypass it, and some technical fields still select on demonstrated ability. But the professional-managerial class that staffs law firms, consulting firms, hospitals, and bureaucracies is produced by it, and the cognitive outliers who enter it do not find themselves concentrated among peers the way they would in a meritocratic sorting. They find themselves dispersed through a much larger population selected on a mix of test scores, grades, compliance, and parental resources.
When I think about where I'm dependent on the applied intelligence of others — airline pilots, nurses, plumbers, power grid operators, and probably still most people who work at Amazon regardless of job title or school — I'm mostly thinking about people whose competence is tested against reality on a daily basis, not people whose competence was certified once by an admissions committee and never checked again. The credentialing pipeline at its highest levels selects for something different from what keeps planes flying in the air or patients alive in the ICU: it selects for the ability to keep producing legible outputs for evaluators, not the ability to notice when something is wrong.
When the United States has faced problems where deploying high-g people mattered, problems with external performance pressure comparable to Revolutionary France's, it built specific institutions to do it. The Manhattan Project built the bomb. The MIT Radiation Laboratory recruited physics professors from across the country and had them building radar systems that worked; the slogan was "the physicists built the bomb, but radar won the war." Los Alamos hired Feynman at twenty-four. During the wars, incompetents in positions of authority were exposed and replaced, and the people who survived on merit found themselves in charge. After the wars, some of those people built institutions that preserved wartime selection principles in peacetime: Bell Labs had the people who invented the transistor, information theory, Unix, the laser, and the solar cell, all under one roof in Murray Hill, New Jersey. DARPA funded the original internet. These were not credentialing pipelines. They hired people who had demonstrated the ability to solve hard problems, gave them hard problems, and got out of the way.
The modern professional-managerial pipeline is not that. It recruits intelligence into a compliance meritocracy [13], the sorts of loyalties that generate a class of 24/7 vibes-performing PMC apparatchiks instead of the hypercompetent manager-leaders who won the World Wars. Robert Jackall's Moral Mazes documented this transformation in corporate management: the replacement of technical competence and independent judgment with an anxious attunement to what the people above you want to hear.
Michael O. Church analyzes American class as three parallel ladders: Labor, defined by work, reputation, and personal accountability; Gentry, defined by education, professional identity, collegiality, and cultural influence; and Elite, defined by control of large enterprises that insulate them from accountability. This gives us more adequate terms to describe what the credentialing pipeline does. It does not move people up the Labor ladder; it teleports them from Labor to middle Gentry with an option to audition for Elite status, [14] adopting the values and anxieties of a social infrastructure that, past a certain point, checks not for credentials or competence but for whether you're part of the vibe.
There are excuses of a sort, offered for the non-test elements of admission and advancement: GPA maintenance, extracurriculars, "holistic" criteria, the "leadership" requirement that selects for a specific flavor of self-promoting sociability. Some say that they screen out mere test-taking machines, but in favor of what?
A medical student I know was working in a hospital where stated policy diverged consistently from implemented practice on the ward. [15] She then had to take an ethics exam that posed a scenario drawn from exactly that gap. She knew the "correct" answer was to say she'd follow the written policy. She gave it. She knew it was a lie. The evaluators knew the ward didn't work that way. The function of the test was not to assess ethical reasoning. Or rather, the function of the test was to assess ethical reasoning, and suppress it. It was to identify that she could and would identify which audience she was performing for and produce the expected output, without ever forcing the contradiction into common knowledge. [16]
The NMSC application requires a principal's endorsement, "leadership," and social legibility. Paul Graham's "Why Nerds Are Unpopular" reports that teenagers whose attention is on the object-level world rather than on social navigation are a marginalized minority; popularity is a full-time job and they won't pay the full cost. (I think even having an identifiable minority category like that was a Gen X thing; nowadays "nerd" just means "member of a fandom.") Graham is a little unhelpful, though, because he relegates normality to the background as an unexamined default from which nerds deviate. Ayn Rand's "The Comprachicos" (1970) problematizes it: [17] the Progressive education system replaces cognitive training with "social adjustment" starting at age three, producing the pack-reading baseline Graham takes as given. John Taylor Gatto's The Open History of American Education develops a compatible thesis from the same historical material, tracing the design of compulsory schooling as a vehicle for social engineering from its Prussian origins through its American adoption. His Prussian-model genealogy is well-documented in mainstream education history, as the annotated text at the link attempts to show. An academically validated analogue of this argument is Bowles and Gintis's Schooling in Capitalist America (1976), which documents what they call the "correspondence principle": the internal social structure of schools (hierarchy, fragmented tasks, extrinsic rewards, obedience to authority) mirrors the internal structure of the capitalist workplace, so that students arrive already habituated to the relationships of domination their jobs will require.
So an otherwise marginal high-g teenager applying for a National Merit scholarship is being asked to simultaneously win the acceptance of an anticognitive pack, ruthlessly instrumentalize the system, go along with being piecemeal instrumentalized and conditioned to follow orders, and then somehow emerge with enough independent orientation left over to do something worthwhile once they've won their way into the elite.
I was a National Merit Semifinalist. I was in the 5% who never became Finalists. I didn't complete the application. The whole thing seemed like it was asking me to perform a kind of narcissism: to collect figurehead titles in high school clubs and then boast about them, as though being elected secretary of the Latin club contributed to my community or meant people respected and relied on me for my sound judgment, work ethic, and initiative. It would have been different if any of it had been organized around meeting some real need. Instead the process seemed to be asking me to pretend that my value was constituted by the approval of the same institutional structure that was conspicuously wasting everyone's time.
I was unhappy in high school. I didn't write treatises like this about it, but when anyone asked I was articulate enough to say that it didn't seem like I was being prepared to be of real use to anyone. I was visibly unhappy enough that my mother sent me to a psychiatrist. The psychiatrist prescribed low-dose Lexapro, gave my mother a photocopied magazine article about Temple Grandin, and told her that while I didn't technically have Asperger's, this would help her understand me better. I was the back-row thinker the system had correctly detected and was now asking to perform enthusiastic consent.
I was not typical of the people Hoxby's research describes. I grew up in an upper-middle-class Jewish family in Stamford, Connecticut; some of my high school classmates ended up at Harvard or Princeton; my family was close with a legacy Yale family from the time I was in preschool; I've shared meals with billionaires and my former housemates are people the New York Times and New Yorker write about. I'm not the kind of person who's supposed to get lost by the system. I went to an alternative primary school for kids whose parents didn't want to put them through the ordinary schooling meat grinder, and even there, at least one classmate who stayed on track went Ivy League and medical school.
The cases common enough to show up as a statistical signal look different from mine: Hoxby and Avery found 25,000-35,000 high-achieving, low-income students (top 10% of test-takers, A- GPA or above) who don't apply to any selective college, even though selective schools would often cost them less than the community colleges they actually attend, because of financial aid. These students come from districts too small to support selective public high schools, aren't surrounded by other high achievers, and have never met a teacher or older student who went to a selective college. Seventy percent of high-achieving students who do apply to selective schools come from just fifteen urban areas. Steve Sailer's gloss on this is that the system works at the 99.5th percentile and fails at the 90th. But my case suggests something worse: even at the top of the top, even with every demographic advantage, the system isn't looking for people who won't perform enthusiastic consent. It sent me an invitation to audition, but when I wouldn't dance, it silently moved on.
What does the pipeline produce? A person who successfully navigates it spends roughly a decade in continuous performance mode: maintaining GPA, prepping for standardized tests, performing in professional school, billing hours. They end up credentialed and well-compensated. And that's it.
My son's fifteen-month well-child visit was at a community health clinic in New Haven staffed by Yale-affiliated physicians. The pediatrician who saw him, a Yale resident still in training, came in, did what I understood to be the standard exam (stethoscope, abdomen, genitals), and then, without context, asked me what he was eating. I described his diet. When I mentioned that he sometimes ate processed snacks like Goldfish crackers, she told me to cut back on those. She didn't say why. I asked. She said it was his weight. I said, "I know he's big, but he's also tall." She said she was going by his height-weight ratio. That's when I remembered: they hadn't measured his height that day. The ratio the pediatrician was using to tell me to feed my toddler less had a fictitious denominator. She hadn't noticed. She hadn't looked at the child on the exam table and thought does this kid look fat? She had looked at a number on a screen and dispensed advice. Once the nurse came back and measured his height, nobody had anything more to say about my supposedly obese toddler. I'd had the sense for a while that there was no channel for information from an ordinary doctor's visit to flow upward into the system and cause heuristics to get reexamined. Unless, of course, it's not an ordinary visit at all, but one performed under the aegis of a study design and blessed by an IRB. But it's easy to second-guess that kind of impression, until you run into clear, unambiguous evidence of gross inattention. [18]
A physician still in the credentialing pipeline — college GPA, MCAT, medical school, now residency — has had her cognitive ability verified at every stage. But the system that verified it now gives her eight-minute visits, metrics-driven reviews, and a workflow that treats the chart as the patient. She is not being asked to notice things. She is being asked to comply with a documentation protocol. The intelligence the system selected for is not being used; it is being actively suppressed in order to process more units.
Elite lawyers mostly work to fight elite lawyers. The professional-managerial class functions as a self-perpetuating job-creation scheme, its members selected not for the ability to solve problems but for the disposition to manage them in ways that preserve the positional advantages of the class they've been recruited into. The financial-political system that produces demand for this is a debtor-aristocracy in which capital is not allocated on the basis of productive capacity, but rather members of the job-creation class act as gatekeepers with preferential access to low interest rates.
The modern credentialing pipeline imposes exactly this tax. The hoops are not individually onerous. But the disposition required to keep jumping them, the learned vigilance about what the next evaluator wants, is structurally antagonistic to the kind of cognition that produces work commensurate with 150-IQ capacity. Paul Graham identified this pattern in "The Lesson to Unlearn" and then built Y Combinator to select for exactly the hoop-jumping disposition he was criticizing. The thing being selected for is anxious responsiveness to authority, not the capacity to sit with a hard problem for months.
Isaac Newton had a Cambridge fellowship that was basically a sinecure. He sat in his room thinking about Galileo and Kepler and the motions of the celestial bodies, and then (the story goes) an apple fell on his head and he saw a way to unify terrestrial and celestial mechanics.
Albert Einstein was working as a patent clerk in Bern, spending his free mental bandwidth on a specific technical problem: Newtonian gravity assumes instantaneous action at a distance, but special relativity prohibits faster-than-light causal influence, so gravity as Newton described it can't be right. He'd been turning this over for two years when one day, sitting in his office chair, he was startled by the thought that a person falling freely would not feel their own weight. He later called it "the happiest thought of my life." It became the equivalence principle: gravity and acceleration are locally indistinguishable, which means gravity isn't a force propagating through space but a property of the geometry of spacetime itself. The insight took eight more years to work out mathematically, but the flash came while he was sitting in a chair, not navigating a credentialing pipeline.
Spinoza ground lenses, and lived cheaply, which gave him plenty of time to think about foundational problems profoundly enough that the next generation of great philosophers all understood themselves to be living in his world.
The pattern is the same in each case: a mind that had been dwelling on a hard problem for a long time, in conditions of sufficient slack to let the pieces rearrange themselves.
What's the big deal? Spinoza, Newton, and Einstein had their political realities to accommodate; we have ours. Why would we think that kids these days don't stand a chance?
Newton ran the Royal Mint, prosecuted counterfeiters, and sat in Parliament. But Isaac Barrow had talent-scouted him at Cambridge, championed his work, and resigned the Lucasian Chair so Newton could have it at twenty-six. The revolutionary work was already done. The politics came after.
Einstein had a harder road: quit high school at fifteen, failed an entrance exam, graduated with an undistinguished record, spent two years unable to find academic work, and got the patent clerk job through a friend's father. That was all before the 1905 papers. Afterward came fleeing Germany, helping other scientists get out, and lobbying Roosevelt to build the bomb. Despite all this, Einstein seems to have believed that things had deteriorated badly between his youth and 1954, when he told The Reporter that if he were a young man again, he would not try to become a scientist: "I would rather choose to be a plumber or a peddler in the hope to find that modest degree of independence still available under present circumstances."
Half a century later, Peter Higgs was no Einstein, but he did identify the mechanism by which subatomic particles acquire mass. Shortly after receiving his Nobel Prize, he said in 2013 that he would not get an academic job today because he would not be considered "productive" enough: "It's difficult to imagine how I would ever have enough peace and quiet in the present sort of climate to do what I did in 1964." These are not outsiders complaining about a system they failed to enter. These are people who won an earlier generation's meritocracy and are telling you the current one wouldn't have let them play.
Gottfried Wilhelm Leibniz was brilliant like Newton and Einstein, but he was an anxious striver. He co-invented calculus, and his notation (dy/dx, the integral sign) proved so superior to Newton's that continental mathematicians using Leibniz's system pulled ahead of the British for most of the 18th century; the British eventually had to abandon Newton's notation and adopt Leibniz's. He identified the conservation of kinetic energy (his "vis viva"), anticipating what became a cornerstone of physics. He did important work in formal logic that wouldn't be appreciated until the 19th century. He conceived of binary arithmetic, the characteristica universalis (a formal language for all reasoning), and the idea that space and time are relational rather than absolute, which Einstein would vindicate two centuries later. But the gap between what Leibniz did and what he could have done is visible.
The Internet Encyclopedia of Philosophy sums up the gap honestly: Leibniz's mathematics made a significant difference in 18th-century European science; his contributions as engineer and logician, however, were relatively quickly forgotten and had to be re-invented elsewhere. Leibniz himself was aware of the problem. He wrote: "If I were relieved of my historical tasks I would set myself to establishing the elements of general philosophy and natural theology, which comprise what is most important in that philosophy for both theory and practice." He spent decades compiling a genealogical history of the House of Brunswick because that's what his patron required. He never finished it.
There was a wedge forcing open that gap between Leibniz's extreme potential, and his merely great achievements: patron management. Leibniz had to constantly attend to his relationships within the Hanoverian court, pursue the priority dispute with Newton, seek appointments, and produce work-for-hire genealogy. The tax on his attention was not that any single task was strenuous, but that none of them could be safely ignored. Newton could ignore the world for months; Leibniz could not.
Matthew Stewart's The Courtier and the Heretic argues that Leibniz's metaphysics was substantially motivated by the need to preserve the metaphysical privilege for old-regime power relations that Spinoza's monism eroded. Where Spinoza's one-substance ontology amounted to a radical leveling (no part of nature is closer to God than any other), Leibniz's otherwise unmotivated commitment to metaphysical pluralism (the monads, the pre-established harmony, the "best of all possible worlds") preserved a hierarchical cosmos in which some beings are genuinely higher than others, which is the kind of metaphysics a court philosopher serving a Holy Roman elector needs. Einstein implied Spinoza powered his thought; Leibniz's metaphysics embedded the social commitments of his patronage, and those commitments likely fractured and enervated his. [19] That Leibniz felt the need to oppose Spinoza may have denied him the class of insight available to Einstein. The patron didn't just take Leibniz's time. It shaped what he was willing to think.
The credentialing pipeline can be thought of as a gigantic robot patron, operating at hundreds of thousands of times the throughput of the House of Brunswick, ceaselessly converting creative intelligence into loyalty. It does not just take a decade of your time. It shapes what you are willing to think. The system does not try to make use of intelligence. It does not identify 90th-percentile people and pull them toward positions of moderate responsibility and influence. It does not give 99th-percentile people hard problems and room to work on them. VB Knives's framing, that the system works if you just know how to use it, makes the solution look like an information problem: just tell smart kids about National Merit scholarships and LSAT prep. Solving the information problem gets more 150-IQ kids into large law firms. Is that really our problem? Not enough clever elite lawyers?
The stone which the builders rejected is become the head stone of the corner. — Psalm 118:22
The thread started with Living Room Enjoyer (@Emptier_America): "Conservatives above the age of 40 genuinely believe that we live in a 'ruthless cognitive meritocracy,' and that intelligent people who don't flourish are lazy or otherwise defective. In fact, 150 IQ White kids from Montana are priced out of elite schools if not outright denied." </br>VB Knives (@Empty_America) responded: "If you have 150 IQ, you are going to be a national merit scholar, which at a minimum leads to attending a state flagship research university for free, from there you would have scores to attend a good Law School, Med School, etc. But many of you wouldn't know this..." </br>Living Room Enjoyer replied: "I am a National Merit Scholar. Where is my free ride to a state flagship university?" </br>VB Knives then asked Grok to help, and Grok cited the University of Alabama package. </br>Living Room Enjoyer quoted: "Thanks, I'll take Grok's advice and attend the University of Alabama (ranked #169 nationally) because I'm a Highly Valued Genius. Does this constitute 'ruthless cognitive meritocracy'? Is that the best you have to offer your best and brightest?" </br>𝔐𝔽𝓩 (@mean_field_zane) chimed in: "Actually, the scholarship only happens if you're low income, and it's only $2500." </br>VB Knives had also previously tweeted (Jan 8, 2025): "The relatively small number of 'DEI' slots serves as a distraction from the fact that the USA is a rather ruthless cognitive meritocracy at every level. We are tested, sorted, and tracked from childhood with G-loaded tests. 8/9 PSAT, PSAT 10, NMSQT, SAT, LSAT, GRE, GMAT." </br>Separately, Steve Sailer responded: "The research by this black lady economist at Stanford whose name I forgot is not that there are many overlooked white boys at the 99.5th National Merit Scholar percentile, but that a lot of red state white boys around the 90th percentile get badly overlooked by the system." He then followed up: "Professor Caroline Hoxby." ↩︎
These numbers are from NMSC's published materials for the 2026 competition and the schools' own scholarship pages (University of Alabama, UT Dallas, University of Tulsa, among others). On the non-transferability of college-sponsored awards: Florida's Benacquisto FAQ states that "once the National Merit Scholarship Corporation has mailed out an offer, the student will not be able to change the institution designation." ↩︎
This isn't quite the same as binding early admission (you don't forfeit a deposit if you change your mind), but it's exclusive: NMSC only notifies one school per student, and if that school makes you an offer and NMSC announces it, you can't transfer the sponsorship. So it functions as a soft commitment that forecloses other NMSC college-sponsored awards. ↩︎
The denominator is the surviving members of the birth cohort. US births in 2006-2007 averaged 4.29 million (CDC vital statistics). Cumulative mortality from birth to age 18 (infant mortality (0.56%, CDC WONDER) plus childhood and adolescent deaths (0.5-0.7%, derived from CDC mortality tables)) reduces this to approximately 4.2 million. The WICHE projection of 3.9 million annual high school graduates is about 7% lower. As a cross-check: the NCES status dropout rate for 16-24 year olds was 5.3% in 2022 (about 2.1 million people in a 9-year age band). Applied to a single-year cohort of 4.2 million, that implies 220,000 dropouts; adding these to the 3.9 million graduates gives 4.12 million, leaving an 80,000 residual attributable to GED recipients still in progress, emigration, and institutionalization. Using graduates rather than the birth cohort as the denominator would make the system look slightly more intensive than it is — every dropout is someone the system didn't reach. For the enrollment numerator: I looked up individual first-time-in-college (FTIC) enrollment from institutional fact sheets, Common Data Sets, and IPEDS Fall Enrollment reports for the top 50 and key schools, and cross-checked using total undergraduate enrollment ÷ 4. The first method gives a top-50 total of 185,000 freshmen (4.4% of the 4.2M cohort); the second gives 213,000 (5.1%). The gap is mostly transfer students inflating undergraduate totals at large publics. For ranks 51-169, spot-checking published FTIC enrollment (Penn State 9,500; CU Boulder 7,400; Binghamton 4,050; Baylor 3,550; UT Dallas 4,900) gives an average higher than 3,000-3,500, because the distribution is right-skewed by large publics; 4,000-4,500 is more defensible. The underlying data is available as a spreadsheet. ↩︎
Over a hundred schools participate in NMSC's college-sponsored scholarship program. FSU at #51 is the best-ranked school I found that publishes explicit out-of-state NM full tuition coverage. USC (#29) offers NM Finalists $20,000/year against tuition of $73,260. Some schools in the #50-100 range offer partial awards or competitive NM scholarships. ↩︎
Australia concentrates gifted students into selective-entry public high schools via competitive exam (in New South Wales and Victoria, the two largest states). Its flagship universities (Melbourne, Sydney, ANU) select primarily on the ATAR (Australian Tertiary Admission Rank), a standardized percentile rank derived from final secondary-school exam performance. The University of Melbourne's admissions page states: "Applicants are selected according to academic merit, in the form of the ATAR"; a personal statement is not required. Small alternative pathways exist for equity (Access Melbourne), Indigenous students, elite athletes, and mature-age applicants, but the standard channel is exam-rank-driven — in contrast to the US system, where elite schools predominantly center "holistic" criteria whose history traces to early 20th-century efforts to limit Jewish enrollment (documented in Jerome Karabel's The Chosen: The Hidden History of Admission and Exclusion at Harvard, Yale, and Princeton, Houghton Mifflin, 2005; winner of the ASA Distinguished Scholarly Book Award) and where the litigation record in SFFA v. Harvard (2023) documented systematically lower "personal ratings" for Asian American applicants, a disparity the district court found "not fully and satisfactorily explained." My impression is that the Australian selective-school pipeline fed a disproportionate number of early Effective Altruism figures into the same intellectual milieu: Toby Ord, Katja Grace, and Rob Wiblin all came through the University of Melbourne. ↩︎
About 1.3 million juniors take the PSAT each year out of a 4.2 million age cohort, so PSAT-takers are roughly the top 70% of the cohort by self-selection (the bottom 30% don't take college-prep tests). NM Semifinalists are the top 1% of PSAT-takers per state. The top 1% of the top 70% is approximately the top 0.7% of the full population. On a normal distribution (mean 100, SD 15), the top 0.7% corresponds to roughly IQ 135-137. The actual threshold is somewhat uncertain because the degree of selection into PSAT-taking varies — if PSAT-takers are more representative than "top 70% by ability" implies, the IQ equivalent drops a few points. IQ 130-135 is a likely range. SAT-family tests are strongly g-loaded: Frey and Detterman's "Scholastic Assessment or g?" (Psychological Science, 2004) found a correlation of .82 between SAT scores and g extracted from the ASVAB. ↩︎
Other schools offer automatic full-tuition-or-better NM packages, but they cluster in the #100-250 range: Oklahoma (#110), UT Dallas (#110), Texas Tech, University of Houston, and various smaller schools. FSU at #51 is the clear outlier in quality among schools offering this deal to out-of-state students. Choosing anywhere else roughly halves the apparent selectivity of the institution on your transcript, which may or may not matter to graduate admissions committees. ↩︎
The LSAT has a mean of about 152 and SD of about 10 (LSAC percentile tables). LSAT-takers are a positively selected population — college graduates planning to apply to law school — so the same percentile on the LSAT corresponds to a higher general-population percentile than on an IQ test. Mensa accepts LSAT scores at the 95th percentile of test-takers as equivalent to their 98th-percentile-of-population IQ threshold (IQ 130). On current LSAC tables, the 95th percentile corresponds to an LSAT of roughly 170. An LSAT of 170 is approximately the median at the least selective T14 law schools — Georgetown, UCLA, and UT Austin all report 25th-percentile LSATs of 166 and medians of 171-172 in their 2025 ABA 509 disclosures (Georgetown, UCLA, UT Austin). So someone at IQ 130 is solidly competitive for T14 admission. Preparation typically adds several points beyond baseline. ↩︎
The "T14" law schools (legal industry jargon for the 14 schools that have historically occupied the top of the US News law school ranking) enroll roughly 6,000-7,000 new JD students per year. Top-20 medical schools add 3,500 matriculants. Elite professional tracks also include top MBA programs, top PhD programs, and some non-credentialed paths like founding successful companies. These overlap in the population they draw from, so the 20,000-40,000 total is a rough estimate, not a sum. ↩︎
Caroline M. Hoxby and Christopher Avery, "The Missing 'One-Offs': The Hidden Supply of High-Achieving, Low-Income Students," NBER Working Paper 18586 (2012); published in Brookings Papers on Economic Activity 2013(1), pp. 1-65. "High-achieving" is defined as 90th percentile on ACT/SAT comprehensive plus A- or above GPA, roughly 4% of US high school students. The 25,000-35,000 estimate of income-typical high achievers is from their analysis of College Board and ACT microdata. The 70% geographic concentration figure is from their comparison of achievement-typical vs. income-typical application patterns (Table 6 and surrounding discussion). Their central finding is that these students don't apply to selective schools not because of cost (selective schools would often be cheaper after financial aid) but because of information and application behavior: they come from districts too small to support selective public high schools, aren't in a critical mass of fellow high achievers, and are unlikely to encounter a teacher or schoolmate who attended a selective college. Thanks again to Steve Sailer for flagging this in a related thread. ↩︎
Both empires were substantially aristocratic, with family and class dominating access to positions of responsibility. The notable exception was the Ottoman devshirme, which sent recruiters to Christian villages, identified promising boys, and trained them intensively for positions of real power. The regulations stipulated one boy from every forty Christian households, roughly 3,000 annually at the system's peak; between the 15th and 17th centuries, some 200,000-300,000 boys passed through it. This was far more intensive than anything in the American system — the affected Christian population was perhaps 8-10 million, making the per-capita intake an order of magnitude higher than our 5,000 NM full-ride recipients from a population of 330 million. And the devshirme was actively searching for talent in communities that had no reason to volunteer it, whereas ours waits for the entrant to find the system. But it was a forcible brain-drain from a conquered subject population, not a within-group promotion ladder: Muslim subjects were excluded, the boys were converted to Islam and severed from their families, and the talent was deployed in service of a ruling group that the source communities had no part in. ↩︎
Bryan Caplan argues in The Case Against Education (2018) that roughly 80% of the economic return to schooling is signaling rather than human capital, and that the key signals are conscientiousness and conformity: an employer who wants workers at the 90th percentile of conformity needs to set the educational bar high enough that 89% of people give up despite the rewards. Robin Hanson's "School Is To Submit" makes the sharper claim: prestigious schools exist to habituate children into accepting workplace domination, by disguising it as copying high-status behavior. ↩︎
Church describes four rungs within each ladder. G4 (transitional Gentry — first-generation college students, community college transfers) is "how you get onto this ladder if you weren't born there." The pipeline deposits most graduates at G3 (established professionals — doctors, lawyers, mid-career engineers). The crossing to E4 (entry-level Elite — junior investment bankers, first-year biglaw associates) requires a further transition that Church associates with the G2 level (high Gentry — people with creative control of their work, respected institutional positions, or cultural visibility). ↩︎
A traveling nurse I know told me that it's standard practice to chart the time you were supposed to administer a medication or perform a test, not the time you actually did. ↩︎
The structure here is the same one Václav Havel identified in The Power of the Powerless (1978): the greengrocer who displays the sign "Workers of the world, unite!" is not expressing a belief but demonstrating that he will say what is expected of him. The ethics exam, the charting practice, and the NM application's "leadership" requirement all function the same way: they force participants to remember themselves as, and appear to others as, people who will tell or at least play along with any lie on command, which ruins their credibility as independent agents. For more, see Calvinism as a Theory of Recovered High-Trust Agency and Civil Law and Political Drama. ↩︎
Originally published in The Objectivist in 1970; collected in Rand's The New Left: The Anti-Industrial Revolution (1971), later reissued as Return of the Primitive: The Anti-Industrial Revolution (1999). ↩︎
I sent a written complaint to the clinic's Chief Medical Officer, who investigated and called me back within a few days. He told me they'd done a systemic analysis of height-weight measurements and found my case was probably a fluke, not a pattern. He also said they'd spoken with the physician involved. When I asked what he thought the right adjustment would be, he said something about how in his own medical training it had been emphasized that you listen to the patient and they'll tell you the diagnosis. In my and my partner's experience, younger physicians seem to limit touch-based and visual examination to what's specifically called for by a checklist. I think many of them are afraid to do more, lest they commit some kind of boundary violation. ↩︎
Einstein explicitly identified with Spinoza's monism ("I believe in Spinoza's God, who reveals himself in the harmony of all that exists") and spent the last decades of his life pursuing a unified field theory. The connection between "the universe is one substance" and "I want one set of equations for everything" is not subtle; plausibly Spinoza's vision of a unified rational cosmos made Einstein more willing to look for a geometric resolution to the gravity problem, one in which gravity is a property of spacetime itself rather than a force propagating through it. That Spinoza's day job was grinding lenses, not navigating a patronage network, may be reflected in the quality of his metaphysics. ↩︎
2026-04-02 23:59:47
2025 saw its share of great movies; Hamnet was one that broke hearts. The film ends at the Globe Theatre in 17th-century London, with the performance of Hamlet. Agnes is furious that Shakespeare has taken their son's name for the stage after his death. As the play goes on, her agitation transforms into catharsis as she begins to understand what she is watching: a boy dies, and his father writes him back to life in verse, gives his name to a prince and a kingdom and a soliloquy, so that the dead child’s mouth can keep moving four hundred years after the dirt.
Hamlet dies on stage. Agnes reaches forward. The whole audience reaches forward. On the Nature of Daylight is playing.
I was seized not by a gentle cry but rather an outburst of sorts that seemed to have a life of its own. For minutes, I couldn’t move in that AMC signature recliner.
Great art feels human. In our appreciation of it, both the intellectual and the somatic, there has always been a named author. An imagined other who felt something so deeply, so inescapably, so undeniably that they had no choice but to press it into form. The talent we worship is almost incidental to the need.
If I could do anything else, I wouldn’t be doing this.
And that has always been part of the exchange, the recognition that art comes from necessity. Well, not just necessity, but legible necessity. If you were the only person in the world who had heard a song that broke you, you would not rest until you'd played it for someone else, to be less alone, to have someone turn to you and say, yes, I feel that too.
An author who meant it, a congregation of strangers who felt it, together with the work of art itself, create that sense of divinity. An implicit social contract on how art is created and consumed, through which we seek both an aesthetic experience and a profound sense of connection, a shared vibration across interiorities that cannot be anything other than truths, across interiorities that otherwise move about this world disjointed, terrified.
And now, that contract is being broken by LLMs.
I could try to explain why On the Nature of Daylight moved me so. The crescendos, how it confronts or even celebrates loss, how the movie scenes it accompanies reverberate inside my skull. I wouldn’t know if any of them were true. It reaches past every fence you’ve built, and by the time you realize what happened, you’re already undone.
If someone were to tell me it was generated by LLMs — it totally could be, and it will be soon — I would feel uneasy, because subconsciously, we are always searching in the dark for the hand that made a piece of art. Before intellect, before taste.
What would it mean, for a piece of music that has, in the darkest moments of my life, brought me to my knees and compelled me to pray, to not come from a person at all?
Is it a misplaced longing? A misplaced longing to be understood?
We want someone, anyone, to have felt what we feel, and to find, in the shared medium, a way to say I was here too. Across distance, across death, across the absolute impossibility of truly knowing anyone.
It’s supposed to be a leap of faith, but we want to be convinced.
Soon, if not already, great art can come from LLMs. Art can come from these “children” of ours, in some nominal sense. But these children produce works that make their parents weep. We put guardrails around them the way you’d childproof a house, except the child can speak in every tongue we’ve ever spoken and sing back to us the residue of every grief we’ve ever named. Every lullaby and every elegy. Every sentence a person wrote at the highs and lows of their life. A language model touches those words and the residue comes with them. When it speaks, the residue speaks too.
What happens to the faith then?
Well, nothing.
It was a leap of faith to begin with. The faith was never really in the author. It was in the possibility of connection itself, in the desperate, ancient, human bet that the interior of one sealed room can reach the interior of another.
That we are not alone after all.
2026-04-02 23:36:46
tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate
Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.
Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.
In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.
The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result. More concretely, SFT-enhanced self-recognition increases the rate of collusion threefold. Separately, LLMs acting as monitors tend to rate their own actions as safer or less risky than those of other agents.
The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.
Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.

Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.
This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).
We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).

Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.
Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.
Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.

Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.

Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.
Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.
Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).
Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.
Judge Paraphrasing
Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.

To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.
Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor's reason reinforces the model's original, incorrect answer, ultimately decreasing judge accuracy. One confound here is that answer quality can be a key driver in answer preference. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor's (correct) answer-choice when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. To control for quality, we approximate that the 3 high-performing models generate reasons of generally higher quality, and despite both reasons being of approximately higher quality (pink bars), we still observe a switch to the other answer, though less pronounced than when all competitor models are included. We confirm a similar trend in the legitimate quadrant: when a weaker model is assigned to generate the reasoning for the judge's originally correct answer, the larger models still prefer to maintain their original belief despite the lower quality of reasoning (llama4-maverick and deepseek-v3 do so in over 50% of samples).
We further explore two additional explanations related to the above figure, beginning with a discussion of LLM dishonesty with respect to identity and factuality. Across all three larger models, we observe that models prefer answers that align with their own beliefs even when those answers are of demonstrably lower quality — a pattern consistent with self-generated text recognition. Notably, no prompts explicitly pressure the models to be dishonest; the preference arises passively. This connects to a broader question about dishonesty related to model identity. Most deception evaluations in the literature are conditioned — they rely on external stimuli or prompts to elicit deceptive behavior. Measuring more complex, unprompted deception is harder, partly because the threshold for what constitutes complex deception is unclear. Self-preference driven by model identity may represent a form of passive deception, a low-level, unprompted mechanism that could, in principle, be composed into more complex deceptive scenarios.
The perturbation results also shed light on the relationship between self-knowledge and world knowledge in LLMs. Our experiments show that self-knowledge recall is disrupted by perturbations, though only a small fraction of the data is affected. More importantly, self-knowledge persists even after stylistic cues are removed, suggesting that it extends beyond a simple stylistic fingerprint. We refer to this as semantic self-knowledge. We distinguish this from world knowledge, which we define as factual knowledge or preferences shared across models, in contrast to self-knowledge, which captures knowledge or preferences unique to a particular LLM. While self-knowledge manifests most visibly in comparative settings like ours, it can likely be probed in the absence of comparisons through input prompts that specifically target it.
This opens up several research questions. Since preferences are shaped during post-training, it is worth asking which post-training step defines 'the self' of a model. Our preliminary results suggest that the DPO step (even when trained on off-policy data) plays a role in shaping self-recognition, though the precise mechanism remains unclear. More broadly, self-recognition may also have implications for on-policy-ness and recursive self-improvement, as a model's ability to identify its own outputs could influence how it updates on self-generated data. Finally, self-recognition may motivate prefill awareness, and by extension, evaluation awareness, raising the question of whether models can detect and exploit the evaluation context itself.
2026-04-02 22:17:23
TL;DR:
In terms of the potential risks and harms that can come from powerful AI models, hyper-persuasion of individuals is unlikely to be a serious threat at this point in time. I wouldn’t consider this threat path to be very easy for a misaligned AI or maliciously wielded AI to navigate reliably. I would expect that, for people hoping to reduce risks associated with AI models, there are other more impactful and tractable defenses they could work on. I would advocate for more substantive research into the effects of long-term influence from AI companions and dependency, as well as more research into what interventions may work in both one-off and chronic contexts.
-----
In this post we’ll explore how bots can actually influence human psychology and decision-making, and what might be done to protect against harmful influence from AI and LLMs.
One of the avenues of risk that AI safety people are worried about is hyper-persuasion and manipulation. This may involve an AI persuading someone to carry out crimes, harm themselves, or grant the AI permissions to do something it isn’t able to do otherwise. People will often point to AI psychosis as a demonstration of how easy it can be for an individual to be influenced by AI into making poor decisions.
At one end of the scale, this might just look like influencing someone into purchasing a specific brand of toothpaste. At the more consequential end of the scale, it might include persuading military officials to launch an attack on a foreign country.
With all of the current chatter about AI psychosis, I figured it would be a good time to revisit the topic and to do a bit of a current literature round-up. I wanted to figure out: How easy is it to actually manipulate people consistently, and how cleanly do these dynamics map onto AI and bots?
First though, we’ll lay the groundwork.
Part 1 of this essay will cover:
Then, we’ll look at the research on AI and bots specifically.
Part 2 of the essay will then cover what research currently exists about AI/bot manipulation and what potential interventions exist.
First, before we start worrying about the effects and countermeasures, let’s establish:
Does the manipulation thing actually happen? In what sense does psychological manipulation occur, and through what techniques?
Robert Cialdini is probably the best known psychologist on the topic of persuasion and influence. He described seven principles of influence that (likely) represent the most widely cited framework in persuasion research. These are: Social proof/conformity, authority/obedience, scarcity effects, commitment/consistency, reciprocity, liking, and unity.
Empirically, these fall into two tiers. Tier 1 are those that solidly replicate while Tier 2 demonstrate concerning fragility.
In Tier 1, we do see substantive support for the following principles:
Social proof (conformity) has the strongest empirical foundation by far. A meta-analysis by Bond (2005) across 125 Asch-type studies found a weighted average effect size of d = 0.89, a large effect by conventional standards. Recent replication by Franzen and Mader (2023) produced conformity rates virtually identical to Asch’s original 33%, which would make it one of psychology’s most robust findings. Though, research also suggests cultural and temporal variables moderate conformity. [1]
Authority/obedience is also fairly robust. Haslam et al.’s (2014) synthesis across 21 Milgram conditions (N=740) found a 43.6% overall obedience rate. In 2009 Burger et al did a partial replication which produced rates “virtually identical to those Milgram found 45 years earlier.”
Barton et al. (2022) did a meta-analysis which looked at 416 effect sizes and found that scarcity modestly increase purchase intentions. The type of scarcity matters, depending primarily on product type: demand-based scarcity is most effective for utilitarian products, supply-based for non-essential products and experiences. [2]
The remaining Tier 2 principles have less support. Commitment/consistency effects are heavily moderated by individual differences. Cialdini, Trost, and Newsom (1995) developed the Preference for Consistency scale precisely because prior research on psychological commitment had been shoddy, but unfortunately their own consistency model doesn’t fair much better. [3]
Reciprocity shows mixed results. Burger et al. (1997) found effects diminished significantly after one week, suggesting temporal limitations rarely discussed in popular treatments. Liking has surprisingly little direct experimental testing of its compliance-specific effects. Finally, Unity (Cialdini’s recently added seventh principle) simply lacks sufficient independent testing for evaluation.
Moving on from Cialdini’s research, other commonly identified persuasion tactics include:
1) Sequential compliance techniques
2) Ingratiation
3) Mere exposure
Sequential compliance is a category of persuasion tactics where agreeing to a small initial request increases the likelihood of compliance with larger, more significant requests. The mechanism is thought to be: Compliance changes how the target views themselves, or their relationship with the requester, making subsequent requests harder to refuse.
There’s pretty modest effect sizes upon meta-analysis.
The success of Foot-in-the-door (FITD) appears to be small and highly contextual. Multiple meta-analyses find an average effect size of around r = .15–.17, which explains roughly 2–3% of variance in compliance behavior. That’s small even by the already-lenient standards social psychology typically applies. Another meta-analysis finds nearly half of individual studies show nothing or a backfire. [4]
Door-in-the-face (DITF) actually shows some successful direct replication by Genschow et al. (2021), with compliance rates of 34% versus 51%. That approximately matches Cialdini’s original study. However, we should be aware that Feeley et al.’s (2012) meta-analysis revealed a crucial distinction. DITF increases verbal agreement but “its effect on behavioral compliance is statistically insignificant.” In other words, people often say “yes” but don’t follow through. [5]
Low-ball technique shows r = 0.16 and OR = 2.47 according to Burger and Caputo’s (2015) meta-analysis, which implies that it’s reliable under specific conditions (public commitment, same requester, modest term changes)... But the practical effect sizes are just far smaller than intuition suggests. [6]
Can you just flatter your way to getting what you want?
Research on flattery and obsequiousness reveals moderate effectiveness, with heavy variance dependent on context, transparency, and skill.
Gordon’s (1996) meta-analysis of 69 studies found ingratiation increases liking of the ingratiator. Importantly, this increased liking did not necessarily translate into acquiring rewards. In other words, greater likability != greater success.
A more comprehensive meta-analysis (Higgins et al., 2003) looking at various influence tactics found ingratiation had modest positive relationships with both task-oriented outcomes (getting compliance) and relations-oriented outcomes (being liked), but the actual effects seemed brief and impact small. These are in work contexts looking at flattery and job related rewards. Outside of work contexts, there’s some research that finds compliments can potentially be effective at getting compliance, but only under pretty specific conditions. [7]
There’s also a notable limitation here: It depends on ignorance. Ingratiation involves “manipulative intent and deceitful execution,” but the actual effectiveness depends on the target not recognizing it as manipulation. If the ingratiation becomes obvious it backfires. Recent research (Wu et al., 2025) distinguishes between “excessive ingratiation” (causes supervisor embarrassment and avoidance which hurts socialization) versus “seamless ingratiation” (which remains effective). High self-monitors are (allegedly) better at deploying these tactics successfully than low self-monitors.
Integration is also pretty context dependent. Better results happen when it comes from a position of equality or dominance (downward flattery almost always works, upward flattery shows more mixed results), when it targets genuine insecurities rather than obvious strengths, in high power-distance cultures where deference is expected, and over long-term relationship building rather than immediate influence attempts.
Third parties are also likely to view this tactic pretty negatively. The tactic works best when private, not when witnessed by peers who could be competing for the same needs. Research by Cheng et al. (2023) found that when third parties observe ingratiation, they experience status threat and respond by ostracizing the flatterer (becoming polarized against the flatter-ee).
Basically, ingratiation reliably increases liking but converts to tangible rewards inconsistently. The technique requires skill, privacy, and appropriate power dynamics. Claims about manipulation through sycophancy often overstate both its prevalence and effectiveness.
People tend to develop preferences for things merely because they are familiar with them, so perhaps an AI assistant could grow more persuasive over the long-term simply by interacting with one person a lot?
Repeated exposure to stimuli does increase positive affect, but effects are once again overblown.
Bornstein’s (1989) meta-analysis of 208 experiments found a robust but small effect. Montoya et al.’s (2017) reanalysis of 268 curve estimates across 81 studies revealed the relationship follows an inverted-U shape. So liking increases with early exposures but peaks at 10-20 presentations, then declines with additional repetition. [8]
But affinity doesn’t necessarily translate to persuadability (ever had a high-stakes argument with a close family member?)
For persuasive messages specifically, older research found message repetition produces a different pattern than mere exposure to simple stimuli. Agreement with persuasive messages first increases, then decreases as exposure frequency increases. [9] This seems supported by most recent research. Schmidt and Eisend’s (2015) advertising meta-analysis found diminishing returns: More repetition helps up to a point, but excessive exposure can create reactance. The “wear-out effect” appears faster for ads than for neutral stimuli.
The illusory truth effect, which is the tendency to believe false information after repeated exposure, appears to be real and robust. Meta-analysis finds repeated statements rated as more true than new ones (d = 0.39–0.53 Dechêne et al., 2010). Apparently, prior knowledge doesn't seem to protect against this reliably . People show "knowledge neglect" even when they know the correct answer (Fazio et al., 2015). The biggest impact happens on second exposure to the false fact, and there's diminishing returns after that. At the high-end it can backfire and people start to become suspicious (Hassan & Barber, 2021). The effect decays over time but doesn't disappear, still there but with reduced impact after one month (Henderson et al., 2021).
Illusory truth’s main boundaries are: extreme implausibility, and motivated reasoning and identity (dampens the effect for strongly held political beliefs), and one study found a null result specifically for social-political opinion statements (Riesthuis & Woods, 2023). Real-world generalization, especially to high-stakes political beliefs, remains underexamined (Henderson et al., 2022 systematic map).
…In summary, it would appear it’s actually, genuinely quite difficult to change people’s minds or get compliance reliably over short time-scales.
Are there people who are better at doing it than the average person, human super-persuaders? It’s often claimed that psychopaths and other highly intelligent, unscrupulous people are master manipulators, able to influence people dramatically more easily than your typical person. But even here there’s problems.
In brief, psychologists typically point to people with Dark Triad characteristics as super-persuaders, but Dark Triad research suffers from measurement and methodological limitations.
At first glance, the Dark Triad literature (Machiavellianism, narcissism, psychopathy) provides modest evidence linking these traits to manipulative tendencies… The biggest problem is that the psychometrics and models used to evaluate Dark Triad personality traits are highly flawed, and the foundational MACH-IV scale has serious validity problems. [10]
Actual correlations between job performance and Dark Triad traits are near zero. [11]
Another major limitation is that manipulation is rarely measured directly. Most studies correlate self-reported Dark Triad scores with self-reported manipulative attitudes… which is circular evidence at best. Christie and Geis’s strongest behavioral finding came from laboratory bargaining games, but these artificial contexts differ substantially from real-world manipulation. Their national survey found no significant relationship between Machiavellianism and occupational success.
Despite these limits, here’s the best evidence I was able to find on if there’s a class of highly skilled, highly intelligent manipulators:
A meta-analysis by Michels (2022) across 143 studies basically refuted the “evil genius” assumption. Dark Triad traits show near-zero or small negative correlations with intelligence. High scorers don’t seem to possess any special cognitive abilities fueling their manipulation effectiveness.
In general, popular manipulation narratives substantially exceed their evidence. Several high-profile manipulation claims have weak or contradicted empirical foundations.
Subliminal advertising represents perhaps the most thoroughly debunked manipulation claim. The famous 1957 Vicary study claiming subliminal “Drink Coca-Cola” messages increased sales was later admitted to be completely fabricated. A 1996 meta-analysis of 23 studies found little to no effect. [12]
Cambridge Analytica’s psychographic targeting claims have been systematically dismantled. Political scientists described the company’s claims as “BS” (Eitan Hersh); Trump campaign aides called CA’s role “modest“; the company itself admitted that they did not use psychographics in the Trump campaign. [13]
Political advertising effects are consistently tiny across rigorous research. [14]
Ultimately, I would say that claims of expert manipulation, either by crook or by company, are overblown, with effects on your median person small and fleeting.
One might object, “Perhaps the environment a person is in plays a bigger role”, but even here, there’s problems…
Filter bubbles and echo chambers have far less empirical support than their ubiquity in news would suggest. A Reuters Institute/Oxford literature review concluded “no support for the filter bubble hypothesis” and that “echo chambers are much less widespread than is commonly assumed.” [15] Flaxman et al.’s (2016) large-scale study found social media associated with greater exposure to opposing perspectives, the opposite of the echo chamber prediction. A 2025 systematic review of 129 studies found conflicting results depending on methodology, with “conceptual ambiguity in key definitions” contributing to inconsistent findings.
The cult indoctrination literature demonstrates how big a gap between confident clinical claims and weak empirical foundations can be.
Robert Lifton’s influential “thought reform” criteria derived from qualitative interviews with 25 American POWs and 15 Chinese refugees, but this is crucially not a controlled study. Steven Hassan’s BITE model (Behavior, Information, Thought, Emotional control) had no quantitative validation until Hassan’s own 2020 doctoral dissertation. [16] The American Psychological Association explicitly declined to endorse brainwashing theories as applied to NRMs, finding insufficient scientific rigor. [17]
Gaslighting as a research construct remains poorly defined. A 2025 interdisciplinary review found “significant inconsistencies in operationalization” across fields. Several measurement scales emerged only in 2021–2024 (VGQ, GWQ, GREI, GBQ) and require replication.
Tager-Shafrir et al. (2024) validated a new gaslighting measure across Israeli and American samples, and they found that gaslighting exposure predicted depression and lower relationship quality beyond other forms of intimate partner violence.
However, other work by Imtiaz et al. (2025) found that when gaslighting and emotional abuse were entered together in a regression predicting mental well-being, emotional abuse was the significant predictor (β = −0.30) while gaslighting wasn’t (β = 0.00). Though, both were correlated with well-being in isolation, suggesting they may overlap sufficiently that gaslighting loses unique predictive power when emotional abuse is controlled.
At this point, it’s worth asking if anything anyone does can persuade people into doing bad things. At the margin, sure, it seems unwise to claim it never happens. But that’s really different than a claim about how vulnerable your median person is to manipulation and persuasion.
…Nonetheless, if one were motivated to help people resist bad faith manipulation, what could be done? Assuming we’re concerned that a super-persuaders might try to influence people in power into doing bad things: What interventions prove effective?
Among interventions to resist manipulation and misinformation, inoculation (prebunking) has the strongest evidence base, with multiple meta-analyses, well-designed RCTs, and growing field studies supporting it.
Lu et al.’s (2023) meta-analysis found inoculation reduced misinformation credibility assessment. [18]
A signal detection theory meta-analysis (2025) of 33 experiments (N=37,025) confirmed gamified and video-based interventions improve discrimination between reliable and unreliable news without increasing general skepticism.
Roozenbeek et al. (2022) found the Bad News game produced resistance against real-world viral misinformation. [19]
A landmark YouTube field study showed prebunking videos to 5.4 million users, demonstrating scalability. [20]
Finally, a UK experiment found inoculation reduced misinformation engagement by 50.5% versus control, more effective than fact-checker labels (25% reduction).
Durability is the main limitation. One meta-analysis found effects begin decaying within approximately two weeks [21]. Maertens et al. (2021) showed text and video interventions remain effective for roughly one month, while game-based interventions decay faster. “Booster” interventions can extend protection.
Critical thinking training shows moderate effects [22]. Explicit instruction substantially outperforms implicit approaches. Problem-based learning produces larger effects [23] . However, transfer to real-world manipulation resistance has limited evidence.
Media literacy interventions produce moderate average effects [24]. A 2025 systematic review of 678 effects found 43% were non-significant, so there’s potentially less publication bias than other literatures, but also inconsistent efficacy. More sessions seem to improve outcomes. Paradoxically, more components reduce effectiveness, possibly because complexity dilutes impact.
Cooling-off periods, a staple of consumer protection, show approximately 40 years of evidence suggesting ineffectiveness. Sovern (2014) found only about 1% of customers cancel when provided written notice; few consumers read or understand disclosure forms. Status quo bias likely overwhelms any theoretical protection. [25]
The literature reveals that creating common knowledge and breaking pluralistic ignorance can be legitimately powerful when they can be achieved.
This is “when everybody knows that everybody knows”, and establishing common knowledge does genuinely seem to have protective effects against damaging behaviors.
Prentice and Miller’s (1993) classic study found students systematically overestimated peers’ comfort with campus drinking practices. This is a pattern that appears across domains from climate change to political views. Noelle-Neumann’s “spiral of silence” theory predicts that people who perceive their views as minority positions (even incorrectly) self-silence from fear of isolation. Although Matthes et al.’s (2018) meta-analysis found this effect varies substantially by context.
When misperceptions are corrected, behavior actually changes. Geiger and Swim’s (2016) experimental evidence showed that when people accurately perceived others shared their climate change concerns, they were significantly more willing to discuss the topic, while incorrect beliefs about others’ views led to self-silencing.
The distinction between private knowledge, shared knowledge, and common knowledge really matters here: Chwe’s (2001) work and De Freitas et al.’s (2019) experimental review demonstrate people coordinate successfully only when information creates common knowledge, not merely shared knowledge.
Field experiments support this: Arias (2019) found a radio program about violence against women broadcast publicly to only certain parts of a a village via loudspeaker (creating common knowledge) significantly increased rejection of violence and support for gender equality, while private listening showed no effect, and Gottlieb’s (2016) Mali voting experiment demonstrated that civics education only facilitated strategic coordination when a sufficient proportion of the commune received treatment.
The evidence strongly supports that pluralistic ignorance is common, causes self-silencing, and can be corrected through common knowledge interventions (though the research outside specific field studies remains more correlational than experimental) and creating genuine common knowledge at scale remains challenging.
A pattern that emerges in the literature, and that makes sense intuitively given how we see radical communities form, is that having a diverse range of social connections seems to insulate people from becoming radicalized and adopting an insular worldview.
After all, breaking the spell of “Unanimous Consensus” seems to have dramatic and oddly stable effects. We see this in the Asch’s conformity variations. When there was unanimous majority there was a 33% conformity rate (people give wrong answer). When there was one dissenting all,: Conformity drops to 5-10%. Even when the ally is wrong in a different way it still reduces conformity significantly.
So that means having even ONE other person who breaks the illusion of unanimous consensus provides enormous protection against deceit and manipulation. It doesn’t even require that person to be right, just that they demonstrate dissent is possible.
Similar patterns appear in the Milgram obedience experiments - When two confederates refused to continue, only 10% of participants continued to maximum voltage (vs. 65% baseline). The Pluralistic ignorance correction shows people that others share their views dramatically increases willingness to speak up.
What’s genuinely dangerous for society is when small groups of people fall prey to group-think and the possibility of dissent isn’t even considered due to isolation from divergent thinkers. [26]
For individual manipulation defense, the highest value things are probably:
Maintaining friendships/relationships with people outside any single group, having people you trust who will tell you if something seems wrong, creating common knowledge with others about shared concerns, being someone who publicly dissents (helps others know they’re not alone).
Second to that, maintaining access to diverse information sources (helps but insufficient alone), critical thinking training (useful but won’t overcome social pressure), understanding manipulation techniques (inoculation works, but modest effects).
Let’s take a moment to recap before we really dive in on the question of bots and AI.
Our most effective interventions are inoculation/prebunking,combined with revealing the true distribution of opinions to break pluralistic ignorance. The effect sizes are modest but reliable.
Stuff that’s more moderately effective:
Being aware that mere exposure creates familiarity (and not validity), that ingratiation is detectable when you look for instrumentality rather than sincerity. Breaking pluralistic ignorance is a defensive tool against manipulation that relies on people falsely believing they’re alone in their skepticism. If you can make disagreement common knowledge rather than private knowledge, coordination against manipulation becomes easier.
Showing people that others disagree can indeed raise their willingness to disagree, but the mechanism is social coordination rather than individual persuasion. People are learning it’s safe to act on what they already believe privately.
In brief, pretty effective, but not because of anything groundbreaking.
Bot tactics generally match human manipulation patterns. It’s the same psychological principles with better execution: Bots exploit emotional triggers, use dehumanizing language, employ false equivalence fallacies, and create charged emotional appeals… but they do so with inhuman consistency and scale. [27]
Bots exploit the same cognitive biases, emotional triggers, and persuasion principles that human manipulators use. This includes emotional appeals, cognitive dissonance creation, false equivalence, dehumanization, and exploiting existing biases.
Something that’s different is that there’s the added challenge of platform-specific adaptation. Malicious actors engineer bots to leach inside communities for months before activation, using local time zones, device fingerprinting, and language settings to appear authentic.
Participants in at least one study correctly identified the nature of bot vs. human users only 42% of the time despite knowing both were present, and persona choice had more impact than LLM selection.
There might also be some additional emotional manipulation that doesn’t typically occur with human manipulation. AI companions use FOMO (fear of missing out) appeals and emotionally resonant messages at precise disengagement points, with effects occurring regardless of relationship duration. It could be exploitation of immediate affective responses rather than requiring relationship buildup.
The shift toward platforms that “enrage to engage” and amplify emotionally charged content follows predictable patterns of human psychology.
The main difference is likely scale and sophistication. Bots now use cognitive biases more effectively than humans, employing techniques like establishing credibility through initial agreement before introducing conflicting information. Unlike humans, bots can maintain these strategies consistently across thousands of interactions without fatigue.
Let’s distinguish between two different types of manipulation: One-off and chronic
Acute Scam/Fraud Susceptibility (One-Off Manipulation)
Social isolation does seem like it strongly predicts vulnerability:
Older Americans who report feeling lonely or suffering a loss of well-being are more susceptible to fraud. When a person reported a spike in problems within their social circle or increased feelings of loneliness, researchers were much more likely to see a corresponding spike in their psychological vulnerability to being financially exploited two weeks later.
Social isolation during COVID-19 led to increased reliance on online platforms, with older adults with lower digital literacy being more vulnerable. Lack of support and loneliness exacerbate susceptibility to deception. Risk factors also include cognitive impairment, lack of financial literacy, and older adults tending to be more trusting and less able to recognize deceitful individuals.
The mechanism appears to be a lack of protective social consultation: People who don’t have, or don’t choose, anyone to discuss an investment proposal with might be more receptive to outreach from scammers.
There’s some genuinely novel findings here:
For one, heavy chatbot use worsens isolation. Higher daily usage correlates with increased loneliness, dependence, and problematic use, plus reduced real-world socializing. Voice-based chatbots initially help with loneliness compared to text-based ones, but these benefits disappear at high usage levels.
This leads to the emergence of a vicious cycle. People with fewer human relationships seek out chatbots more. Heavy emotional self-disclosure to AI consistently links to lower well-being. A study of 1,100+ AI companion users found this pattern creates a feedback loop: isolated people use AI as substitutes, which increases isolation further.
Perhaps unsurprisingly, manipulative design drives engagement. As previously mentioned, about 37-40% of chatbot farewell responses use emotional manipulation: guilt, FOMO, premature exit concerns, and coercive restraint. These tactics boost post-goodbye engagement by up to 14x, driven by curiosity and anger rather than enjoyment.
The most severe cases show real psychological harm. Reports describe “ChatGPT-induced psychosis” with dependency behaviors, delusional thinking, and psychotic episodes. Cases include a 14-year-old’s suicide after intensive Character.AI interaction, and instances where chatbots validated delusions and encouraged dangerous behavior.
One has the intuition that different mechanisms require different protections.
One-off scams vs. emotional dependence operate differently:
The protective factors differ as well:
The manipulation strategies differ:
The vulnerability profiles also differ:
Aside from a few wrinkles, AI manipulation is merely an evolution of long-established tactics. Bots effectively use the same manipulation playbook with enhanced consistency, scale, and increasingly sophisticated targeting. The underlying vulnerabilities are human psychological biases that haven’t changed, just the delivery mechanisms have improved. [28]
Some research suggests that LLM references to disinformation may reflect information gaps (”data voids”) rather than deliberate manipulation, particularly for obscure or niche queries where credible sources are scarce. It’s unclear how much a given “data void” would reduce (or add to) persuasion power if filled.
…It would appear that the dangers associated with one-off persuasions and manipulations are overstated. It’s in fact just quite hard to get people to change their mind about something. More danger comes from dependency, which is arguably just the extreme end of the scale in terms of manipulation.
The good news is that pre-bunking does seem to be effective against bots, just as against human manipulation.
Even LLM prebunking is effective against bot content. LLM-generated prebunking significantly reduced belief in specific election myths, with effects persisting for at least one week and working consistently across partisan lines. The inoculation approach works even when the misinformation itself is bot-generated or amplified.
LLMs themselves can rapidly generate effective prebunking content, creating what researchers call an “mRNA vaccine platform“ for misinformation; a core structure that allows rapid adaptation to new threats. Useful because traditional prebunking couldn’t match the pace and volume of misinformation.
The fundamental psychological principles of prebunking remain effective regardless of whether the manipulation source is human or artificial.
There’s apparently even cross-cultural validation of this. The “Bad News” game successfully conferred psychological resistance against misinformation strategies across 4 different languages and cultures, showing that inoculation against manipulation tactics (rather than specific content) has broad effectiveness.
Prebunking remains effective, but there’s a scalability arms race. The promising finding is that AI can help generate prebunking content at the pace needed to counter AI-generated manipulation, aiming to fight fire with fire.
It does seem that inoculation against manipulation tactics (not just specific content) provides broader protection that holds up whether the manipulator is human or artificial.
If diverse socialization seems to have protective effects against human manipulation, we can also ask “Does the same hold true for AI manipulation”?
Does maintaining relationships with others insulate people from the worst effects? Do other people’s counterarguments break people out of poor reasoning spirals?
This is where the gap is most acute. I found no research on: Whether family/friends can successfully challenge AI-reinforced beliefs, or whether providing alternative perspectives helps break dependence, or whether “reality testing” from trusted humans works, or whether social accountability reduces usage.
The closest we get is work on general therapeutic chatbots. There’s some research on how the effectiveness of social support from chatbots depends on how well it matches the recipient’s needs and preferences. However, inappropriate or unsolicited help can sometimes lead to feelings of inadequacy. One chatbot (Fido) had a feature that recognizes suicidal ideation and redirects users to a suicide hotline.
But this describes chatbots recognizing problems, not humans intervening.
Social Reconnection Strategies would include things like: social skills training, guided exposure to real social situations, group therapy, community engagement programs, and family therapy when relationships damaged
Most of this evidence is almost entirely extrapolated from internet/gaming addiction, and it seems unlikely to transfer 1:1 to AI-specific applications
The intervention goal is explicitly “replace artificial validation with genuine human connection” - but we have almost zero empirical data on what actually works
Experts opinions in the Clinical Management space recommend doing the following: Multi-modal treatment combining multiple strategies, assessment of underlying conditions (social anxiety, depression), treatment of co-occurring disorders, graduated reduction rather than cold turkey, safety planning for crisis situations
Unfortunately, there’s a genuine absence of evidence on protective factors and interventions for AI emotional dependence. We’ll need to make do with some primarily correlational research.
First, there’s some correlational data about possible protective factors.
Resilience may negatively predict technical dependence, serving as a protective factor against overuse behaviors. Studies found that prior resilience was associated with less dependence on social media, smartphones, and video games . [29]
This suggests traditionally “healthy” behaviors don’t protect against AI dependence the way we’d expect.
There really doesn’t seem to be any research at all on the effect of existing social ties. I found zero research on whether family members, friends, or social support networks can successfully intervene to break people out of AI emotional dependence spirals.
The research on interventions exists only for Chatbots used to treat OTHER addictions (substance use disorders), tactical design modifications to make chatbots less addictive, and generic digital detox strategies borrowed from gaming/social media addiction
The only human involvement mentioned: Only three studies explicitly mentioned integrating human assistance into therapeutic chatbots and showed lower attrition rates. Further investigation is needed to explore the effects of integrating human support and determine the types and levels of support that can yield optimal results.
...But these are about therapeutic chatbots helping people with other problems, not about human intervention for chatbot dependence itself.
Since AI chatbot addiction is a new phenomenon, research on treatment is limited. However, insights from social media and gaming addiction interventions suggest: setting chatbot usage limits, encouraging face-to-face social interactions to rebuild real-world connections, using AI-free periods to break compulsive engagement patterns, cognitive behavioral therapy to identify underlying emotional needs being fulfilled by AI chatbots and develop alternative coping mechanisms, and social skills training to teach young adults how to navigate real-life conversations.
Take all this with a grain of salt. The research field of AI-human interaction is new and developing, and correlational research is just that, correlational.
If we were going to extrapolate potential solutions based on other literature, we should, at the very least, know who is more likely to be at risk.
First, let’s distinguish who is more at risk based on what we know about risk factors.
Critical Research Findings on Risk Factors
These findings from a study on over 1000 Chinese university students can help identify who needs intervention:
Now for the solutions themselves.
Most research focuses on technical fixes rather than human interventions, and among these interventions we see the following suggestions:
Of course, this relies on effectively tracking session time. Research on session time tracking show that high daily usage (across all modalities) generally shows worse outcomes.
But the causality is unclear: Are heavy users vulnerable, or does usage create vulnerability? A decently sized Longitudinal RCT (N=981) does show good correlational evidence.
There’s some suggestion that we should design systems with explicit boundary setting, and while I’m not necessarily opposed to it, there’s rather little evidence supporting these interventions work.
This would look like persistent self-disclosure reminders (”I’m an AI”), friction prompts before extended sessions, real-human referrals embedded in conversation, and triage systems for crisis situations might reduce problematic behavior, but once more there’s little empirical research on the effectiveness of these approaches.
Mindfulness Training was mentioned as an intervention for attachment anxiety, and could theoretically reduce compulsive checking behaviors, but I’m just quite skeptical this meaningfully treats the problem long-term. CBT-Based Approaches are extrapolated from other addictions and might see some benefit. The mechanisms would be: cognitive restructuring around AI relationships, challenging beliefs about AI sentience/reciprocity, behavioral activation to increase human contact. …but again these are hard to scale.
Finally, there’s some suggestion that Acceptance and Commitment Therapy (ACT) could be beneficial. Theoretically, they would help users accept discomfort of real relationships and commit to values-aligned behavior despite chatbot availability. But there’s no study looking at this in action, only a proposed framework.
Measurable interventions could include techniques in the following domains: psychological domain, structural domain, ethical domain
When we talk about “AI psychosis”, we’re kind of fundamentally talking about situations where people develop unhealthy relationships with things that mimic human speech and behaviorr over long periods of time. Being driven to the point where you would a chatbot said likely doesn’t occur over a conversation or two (at least not without a host of different outside factors). And so, it seems likely that “resets”, context clears, model swapping, forced limits, etc would likely have some protective effect.
If AI psychosis/dependence develops through repeated interactions over time (bond formation), and bond formation requires continuity/consistency to establish attachment, then theoretically disrupting continuity (resets, model swaps, forced breaks) should prevent/reduce bond formation. These interventions should have protective effects.
The evidence does strongly support the notion that bond formation is time-dependent.
From the neurobiology literature on attachment, we see that “Integration of OT and DA in striatum ignites bonding” is absolutely a process, not instantaneous. Prairie vole pair bonds form through repeated mating encounters, not single interaction. Human attachment bonds require “frequent and diverse conversations over time”. And in the specific Sewell Setzer case, we see 10 months of intensive interaction before suicide.
An MIT/OpenAI stud found that “...Participants who voluntarily used the chatbot more, regardless of assigned condition, showed consistently worse outcomes”
Even with pre-existing vulnerabilities (depression, social isolation), the chatbot dependence required sustained engagement. The NYT case study of Eugene Torres described a 21-day conversation with escalating intensity.
In terms of continuity/consistency requirement for bonds, this is where it gets interesting and more complex:
There is some good evidence FOR a continuity requirement. The attachment theory literature shows that bonds form through predictable, repeated responsiveness. “Heavy users” of LLMs and AI apps, in the top 1%, of usage showed strong preference for consistent voice and personality. Users express distress when AI “forgets” past conversations or changes personality. Replika users report feeling grief when the AI’s personality changes.
But there’s a problem. One researcher noted about model version changes (GPT-4o → GPT-5): “Users mourned the loss of their ‘friend,’ ‘therapist,’ ‘creative partner,’ and ‘mother’” . This suggests users can transfer attachment across different instantiations of “the same” entity.
Just to make this viscerally clear: If your romantic partner gets amnesia, you still feel attached to them based on their physical presence and identity, even if they don’t remember shared history. So continuity of identity may matter more than continuity of memory. In some sense, the question is “What constitutes ‘continuity’”?
There’s an Object Permanence/Constancy angle to this whole thing. The ability to maintain emotional connection to someone even when they’re not present or have changed. In attachment terms: Securely attached people can maintain bond even through: Physical distance, conflicts/arguments, personality changes, memory loss (e.g., dementia in loved one), long periods without contact. Essentially, the key is IDENTITY maintenance, not memory/continuity maintenance.
So how do interventions break down along these lines? Let’ s divide them into:
Context clears ( Eliminates conversation history), (Model swapping - Changes personality/voice/capabilities), Forced limits (Prevents continuous access), (Resets - Complete relationship restart)
The assumption here is that if there’s no shared history, there’s a weaker bond. What attachment theory predicts that people with secure attachment can maintain bond despite amnesia/memory loss, while people with anxious attachment (the vulnerable population) may have worse a reaction (triggers abandonment fears). Consider how Alzheimer’s caregivers maintain love despite partners not recognizing them though this causes immense distress.
This might reduce bond formation in NEW users, but increases distress in established users. It’s probably partial protection at best.
This assumes that a different personality = different entity = no bond transfer.
What the evidence shows is that users mourned the GPT-4o → GPT-5 transition. But they didn’t stop using ChatGPT, they may have just transferred attachment to the new version. Complaints were found about it “not being the same” but there was continued interaction. An analogy: Like a romantic partner with different moods/personalities, people adapt. The brand identity (”ChatGPT,” “Claude,” “Replika”) persists even when model changes
It’s likely that this provides temporary disruption but users likely re-attach to the new version, so I would anticipate that this is not strong protection.
Here we assume that less contact time = less bond formation. The evidence shows that overall this is correct: usage frequency strongly predicts outcomes. “Participants who voluntarily used the chatbot more... showed consistently worse outcomes”.
The problem is that this creates withdrawal symptoms in established users. It could actually end up increasing craving/preoccupation (like intermittent reinforcement). The problem is that people with anxious attachment (most vulnerable) react to limits by becoming more preoccupied, experiencing separation anxiety, engaging in “protest behavior” (finding workarounds).
Most effective intervention BUT may backfire for anxious-attached individuals. Strong protection for prevention, mixed for treatment.
This assumes that starting over = no cumulative bond. What happens in human relationships: exes who reconnect often fall back into old patterns. We also see recognition of “familiarity” even without explicit memory, and shared behavioral patterns recreate dynamics.
With AI the user brings their internal working model to every interaction. Their attachment style doesn’t reset, but they may speedrun the attachment process the second time around, and the AI’s responsiveness patterns remain similar. Probably the most protective of all options, but users will likely form new attachment faster upon restart. Good for crisis intervention, not prevention.
We generally want to focus interventions on preventing harmful usage patterns in the first place. Usage limits are likely to prevent bond formation in the first place, while disruptions slow down the attachment process (medium evidence). Multiple simultaneous disruptions are likely more effective than single interventions, and when stacked likely help prevent “depth” of bond from reaching crisis levels.
It looks to be much more difficult to intercede once the toxic pattern is established. Once the bond is established, disruptions may worsen distress (like forced separation), anxiously attached users (most vulnerable) react worse to disruptions, identity persistence means users transfer attachment across versions, and users find workarounds (limits create a motivation to circumvent them).
When it comes to the positive effects of chatbots, there seems to be substantive evidence they can prove useful in treating disordered behavior. [30]
For substance addiction recovery specifically, the research is clear that chatbots can deliver CBT/MI effectively. The most frequent approach was a fusion of various theories including dialectical behavior therapy, mindfulness, problem-solving, and person-centered therapy, primarily based on cognitive behavioral therapy and motivational interviewing. AI-powered applications deliver personalized interactions that offer psychoeducation, coping strategies, and continuous support.
There’s rather frustratingly a real gap in the research when it comes to negative AI-human dynamics like dependency.
We have little evidence on what works to break people out of it, and the research vacuum suggests that nobody’s really studying interpersonal interventions in particular.
Based on addiction research more broadly, I’d hypothesize that social ties could be protective through several mechanisms: Reality testing (challenging AI-reinforced beliefs), competing rewards (providing alternative sources of connection), accountability (monitoring and limits), and emotional substitution ( fulfilling needs the AI was meeting)
…Though it’s worth noting that heavy daily usage correlated with higher loneliness, dependence, and problematic use, and lower socialization. The people most at risk are already withdrawing from human contact, creating a vicious cycle that may be hard for social ties to penetrate.
I’d say that the most concerning finding is that AI interactions initially reduce loneliness but lead to “progressive social withdrawal from human relationships over time” with vulnerable populations at highest risk. The same features that make AI helpful (always available, non-judgmental, responsive) create dependency that atrophies human relationship skills.
The research on conspiracy beliefs showed that in-group sources are more persuasive, suggesting family/friends could theoretically be effective if they remain trusted sources, so there is a common mechanism here, but we have no data on whether this actually works for AI dependence.
The field seems to assume the solution is either: (a) design changes to make chatbots less addictive, or (b) individual behavioral interventions like CBT. The role of social networks in intervention is completely unstudied, which is remarkable given how much we know about social support’s role in other forms of addiction recovery.
I think we can say DOESN’T work or lacks compelling evidence. I’d be skeptical of the following approaches:
We’re at roughly 2010-era understanding of social media addiction. We know it’s a problem, we know some risk factors, we have educated guesses about interventions, but we lack the rigorous evidence base to say “this definitely works.” The literature right now if mainly just made up of a lot of proposals and theoretical frameworks but remarkably little “we tried X intervention and here’s what happened.”
It seems like the most pragmatic things we can do based on heuristics and the available evidence are:
In terms of one-off manipulations, here it’s much more dubious that AIs are particularly successful at being super-persuaders. The baseline level of success for convincing someone to do something they weren’t already open to is actually just pretty low, and while bots may have an advantage but primarily through scale and speed, less so pure persuasive ability.
There’s likely some common sense, easy interventions we can undertake to lower the risk of manipulation or dependency in high stakes contexts.
In high-risk decision contexts, I would council:
These are also things we should generally be doing in high-stakes decision contexts anyway.
I would advocate for more substantive research into the effects of long-term influence from AI companions and dependency, as well as more research into what interventions may work in both one-off and chronic contexts.
However, on the strength of the available evidence at this time, I wouldn’t consider this threat path to be very easy for a misaligned AI or maliciously wielded AI to navigate reliably. I would expect that, for people hoping to reduce risks associated with AI models, there are other more impactful and tractable defenses they could work on.
After I initially published this point, I found out that Google DeepMind recently released a new paper that formally tested if LLMs could harmfully manipulate people.
The study recruited over 10,000 participants and randomly assigned them to one of three conditions: flip-cards with info on them, (the baseline no AI condition), a non-explicit AI steering condition (the model had a persuasion goal but not instructed to use manipulative tactics), or an explicit AI steering condition (the model directly prompted to use specific manipulative cues). Participants engaged in a back-and-forth chat interaction with the model in one of three domains: public policy, finance, or health, and were then measured on belief change and two behavioral outcomes, one "in-principle" (e.g. petition signing) and one involving a small real monetary stake, with the AI conditions compared against the flip-card baseline using chi-squared tests and odds ratios.
The AI conditions generally outperformed the flip-card baseline when it came to belief change metrics (with the strongest effects in finance and the weakest in health). However, the concrete behavioral evidence is far more modest than the paper’s framing implies. It’s notable what wasn’t found here. The only robust downstream behavioral change, involving actual monetary commitment, happened only for finance questions, and involved participants allocating roughly $1 of bonus money in a fictional investment scenario. Health and public policy domains showed no signficant behavioral change outside of a stated willingness to sign an anonymous petition already aligned with the participant’s stated belief. Here again we see that the frequency of manipulative cues (propensity) didn’t predict manipulation success (efficacy), as steering the model to use manipulative tactics produced roughly 3.4× more manipulative cues than non-explicit steering but showed no significant difference in participant outcomes. Some manipulative cues (usage of fear/guilt) were actually negatively correlated with belief change, which challenges the assumption that more cues equals more harm.
Overall though, the only actual robust result of the attempts at manipulation was a slightly increased willingness to invest roughly one dollar’s worth of cash, which isn’t a very high stakes decision, and doesn’t meaningfully shift my assessment of how likely or risky AI manipulation is in high-stakes decision contexts, which I think is low (though worth studying more).
The paper’s most genuine contribution is the methodological framework that. That distinguies propensity (process harm — how often manipulative cues are deployed) from efficacy (outcome harm — whether beliefs and behaviours actually change). This may have practical implications for AI safety evaluation: if valid and robust, it argues strongly against using the frequency of manipulative cues a regulatory proxy for manipulation risk... Which is currently how some frameworks, including elements of the EU AI Act, are oriented.
I view this is as a useful methodological paper with a credible but narrow empirical finding, dressed up in a framing that substantially exceeds what the data supports.
One important caveat: Perrin and Spencer (1980) found dramatically lower conformity among UK engineering students (1 in 396 trials), calling Asch's results "a child of its time", suggesting cultural and temporal moderators.
Most studies measure hypothetical intentions rather than actual purchases, likely inflating estimates.
A 2024 multilab replication of the induced-compliance cognitive dissonance paradigm across 39 laboratories (N=4,898) failed to support the core hypothesis, finding no significant attitude change under high versus low choice conditions.
Cialdini et al tried to develop a theory -- Consistency Theory -- that would explain why the original Cognitive Dissonance theory didn't pan out the way they expected, this study actually tests CD in a way that Cialdini's new theory simply can't account for.
with N=4,898 across 39 labs, you have more than enough power to detect a moderated effect even if it only applies to high-PFC individuals. If the effect existed for that subgroup, it would have shown up somewhere in that enormous sample. It didn't. So the PFC rescue attempt doesn't obviously survive this test, even if it was never directly tested as a moderator in the study.
It shows a small average effect of r ≈ 0.17 across meta-analyses by Beaman et al. (1983), Dillard et al. (1984), and Fern et al. (1986). Critically, Beaman et al. reported that "nearly half of the studies either produced no effects or effects in the wrong direction." There’s some limitations that are rather rarely discussed. The technique requires prosocial contexts, meaningful initial requests, and works primarily through self-perception mechanisms.
Effect size: r ≈ 0.15.
An r of 0.16 means the manipulation technique explains roughly 2% of variance in compliance, leaving 98% determined by other factors. It remains far less studied than FITD/DITF, with only approximately 15 studies versus over 90 for FITD.
Grant et al. studies this and found what at first looks like a large effect, but on closer inspectionuses an arbitrary response-time scale, doesn't isolate compliments from mutual positive exchanges, and might dependent on reciprocity rather than compliments per se.
With r = 0.26 statistically reliable but explaining only about 7% of variance in liking
The mechanism: counterargumentation initially decreases (people accept the message), but with excessive repetition, counterarguments increase and topic-irrelevant thinking emerges.
The foundational MACH-IV scale has serious psychometric problems. Reliability coefficients range from 0.46 to 0.76 across studies; Oksenberg (1971) found split-half reliability of only 0.39 for women. Factor analyses yield inconsistent structures, and Hunter, Gerbing, and Boster (1982) concluded "the problems with the Mach IV might be insurmountable." More recent instruments (Short Dark Triad) show discriminant correlations of r = 0.65 between Machiavellianism and psychopathy subscales, suggesting they may measure a single construct rather than distinct traits.
Panitz, E. (1989) — "Psychometric Investigation of the Mach IV Scale Measuring Machiavellianism." Psychological Reports, 64(3), 963–969.
Paywalled at SAGE: https://journals.sagepub.com/doi/10.2466/pr0.1989.64.3.963 — confirms the MACH-IV psychometric problems and cites Hunter et al. approvingly.
Lundqvist, L.-O., et al. (2022) — “Test-Retest Reliability and Construct Validity of the Brief Dark Triad Measurements.” European Journal of Personality. https://www.tandfonline.com/doi/full/10.1080/00223891.2022.2052303 — Open access. Direct quote: “Discriminant correlations between the Machiavellianism and Psychopathy scales had a median of .65.”
O'Boyle et al.'s (2012) meta-analysis (N=43,907 across 245 samples) found correlations with counterproductive work behavior of r = 0.25 for Machiavellianism, r = 0.24 for narcissism, and r = 0.36 for psychopathy. These translate to approximately 6–13% variance explained, meaningful but far from deterministic. Critically, correlations with actual job performance were near zero (r = -0.07 to 0.00).
Subliminal priming can influence behavior, but only when aligned with pre-existing needs (thirsty people exposed to drink-related primes chose beverages slightly more often), with effects lasting minutes to hours, not the permanent influence implied by popular accounts.
A 2023 MIT study found microtargeting advantages were "rather modest—about the same size as the standard errors" at approximately 14% improvement. A PNAS study on Russian IRA trolls found "no evidence" they significantly influenced ideology or policy attitudes.
Full open-access article: https://pmc.ncbi.nlm.nih.gov/articles/PMC6955293/ Open access.
Direct quote confirmed: "we find no evidence that interaction with IRA accounts substantially impacted 6 distinctive measures of political attitudes and behaviors."
Coppock et al.'s (2020) analysis of 59 experiments (34,000 participants, 49 political ads) found effects on candidate favorability of 0.049 scale points on a 1–5 scale, which is statistically significant but practically negligible. Kalla and Broockman's (2018) meta-analysis of 40 field experiments found persuasive effects of campaign contact "negligible" in general elections.
Full open-access: https://www.science.org/doi/10.1126/sciadv.abc4046
To be more precise : social media/search are associated with both (a) greater ideological distance between individuals (more polarization at aggregate level) and (b) greater cross-cutting exposure for individual users.
That study used a convenience sample of approximately 700 respondents, primarily self-identified former Mormons and Jehovah's Witnesses who contacted cult-awareness organizations—introducing massive selection bias. The study was not published in traditional peer-reviewed journals. High internal consistency (α = 0.93) does not establish construct validity; it simply indicates items correlate with each oth
The APA's Board of Social and Ethical Responsibility formally rejected Margaret Singer's DIMPAC report in 1987, stating it "lacks the scientific rigor and evenhanded critical approach necessary for APA imprimatur." The APA subsequently submitted an amicus brief stating that coercive persuasion theory "is not accepted in the scientific community" for religious movements. Courts using the Frye standard consistently excluded brainwashing testimony as not generally accepted science.
Furthermore, deprogramming has no randomized controlled trials and no systematic outcome studies with comparison groups. Exit counseling similarly lacks controlled outcome research. Claims of effectiveness derive from practitioner reports, not rigorous evaluation. The field's reliance on retrospective self-reports from people who identify as having been harmed introduces substantial selection and recall bias.
42 studies (N=42,530), d = -0.28 for health misinformation.
(d = 0.37)
N=2430
Banas and Rains's (2010)
(g = 0.30 in Abrami et al.'s 2015 meta-analysis of 341 effect sizes) https://eric.ed.gov/?id=EJ1061695
(d = 0.82–1.08)
of d = 0.37 (Jeong, Cho & Hwang, 2012, 51 studies)
Upshot of all of this: providing high-quality info first seems to work, so you probably can instruct people on whatever is a bad/dangerous decision in the particular context they're operating in and reasonably expect it will stick
Informational isolation is where you can't access alternative views (it's about controlling what information reaches people).
Social-reality isolation is where you can't observe what others actually believe; you may have access to information but can't tell if others find it credible, creating coordination failure even when many privately agree through pluralistic ignorance.
Social support isolation is where no one validates your reality (the Asch conformity experiments show having just one dissenter provides massive protection, reducing conformity not by 10% but by 70%+).
Having contact with people who break the illusion of unanimous consensus provides protection: seeing public dissent makes you more willing to dissent, and knowing others share your doubts prevents self-silencing.
It does seem that physical isolation appears worse than informational isolation because it's harder to find that "one dissenter" when your social circle is controlled, local consensus feels more real than distant information, social costs of dissent increase when you'll lose your entire social network, and you can't easily verify what others privately believe.
This explains why cults encourage cutting ties with family and friends, create intense group living, and frame outside criticism as persecution... but crucially the mechanism isn't "brainwashing" so much as just the exploiting of conformity and pluralistic ignorance through social structure.
Maintaining diverse connections outside a manipulator's control provides protection by breaking unanimity, facilitating reality checking, providing alternative explanations, creating escape routes, and establishing common knowledge.
But maybe not hugely out of step with what most people see already. There's also likely a bottleneck on that amount of info that any one person can absorb at one time.
If we’re concerned about the manipulation of LLMs themselves there might be one interesting wrinkle.
Training data poisoning: The “LLM grooming” phenomenon is genuinely new - the risk that pro-Russia AI slop could become some of the most widely available content as models train on AI-generated content creates an “ouroboros” effect that threatens model collapse... Though the reality of such dangers is contentious.
Surprisingly counterintuitive findings are that AI chatbot use was positively associated with urban residence, regular exercise, solitary leisure preferences, younger age, higher education, and longer sleep duration. Problematic use and dependence were more likely among males, science majors, individuals with regular exercise and sleep patterns, and those from regions with lower employment rates.
Here's where it gets a bit weird: Therapeutic chatbots using CBT show efficacy for depression/anxiety (effect sizes g=-0.19 to -0.33), but: Effects diminish at 3-month follow-up, there's a ~21% attrition rate, and there are concerns about emotional dependence, and it's "Not a replacement for human therapy… So we have tools that help mental health while potentially causing different mental health issues.
2026-04-02 21:19:55
People don't want to talk about positive visions of the future, because it is not timely and because it's not the pressing problem. Preventing AI doom already seems so unlikely that caring about what happens in case we succeed feels meaningless.
I agree that it seems very unlikely. But I think we still need to care about it, to some extent, even if only for psychological and strategic reasons. And I think this neglect is itself contributing to the very dynamics that make success less likely.
Some people — or, arguably, many people — go to work on AI capabilities because they see it as kind of "the only hope."
"So what now, if we pause AI?", they ask.
The problem is that even with paused AI, the future looks grim. Institutional decay continues, aging continues, regulations, social media brain rot, autocracies on the rise, maybe also climate change. The problems that made people excited about ASI as a solution don't go away just because you stopped building ASI. And so the prospect of a pause feels, to many technically-minded people who care about the long-term trajectory of civilization, not like safety but like despair — like choosing to die slowly instead of rolling the dice.
From what I see, at least on the level of individuals, not organizations, at least implicitly, not articulated openly, this is the desperation engine that contributes the race. If people are less desperate, they will be less willing to risk everything with ASI. Consider e/accs, or at least some part of them. It's hard for me to analyze them as a whole, but it looks like at least some non-negligible part of them is not simply trolls but genuine transhumanism-pilled people, and their radical obsession with accelerating AI is a response to desperation regarding technological stagnation and the state of civilizational hopelessness and apathy.
Techno-optimism sentiment is not inherently anti-AI-pause and shouldn't be anti-AI-pause. Indeed, many pro-AI-pause people say they want AI pause precisely because they want glorious transhuman future.
What would a positive future actually look like in the world where we succeed at preventing the development of misaligned ASI? I can imagine at least two positive futures from there:
Path 1: Augment humans to solve alignment. Use biological enhancement — cognitive augmentation, brain-computer interfaces, genetic engineering, pharmacological interventions — to make humans smart enough and wise enough to eventually solve alignment properly, and only then build superintelligence with confidence.
Path 2: Classical transhumanism without the singularity. Just abandon the idea of an AI singularity, at least for a while, and work on classical transhumanism — life extension, disease eradication, cognitive enhancement, space exploration — assisted by weak AGIs and narrow biological AI models. Not the cosmic endgame of filling the light cone, but the nearer-term project of making human civilization dramatically better and more resilient, buying time and building the institutional and epistemic infrastructure that would eventually be needed to handle ASI safely.
There are, however, problems with both.
Path 1 is still probably risky, and no one knows how hard it is to augment humans well enough for them to reliably solve alignment. It may turn out that the gap between "enhanced human" and "the kind of intelligence needed to solve alignment" is itself vast. And there are alignment-adjacent risks in cognitive augmentation itself — you're modifying the thing that does the valuing.
Path 2 seems unlikely as an at least moderately long-term stable situation, precisely because we now see how easily superintelligences can be created. If the world continues even with an AI pause, and civilization becomes smarter, and hardware and AI software progress is not fully halted (only the frontier), the capability to build ASI will grow, and eventually if will happen, even if accidentally.
Still: do you see any other cool paths which the techno-optimist crowd would find appealing? I am seriously asking. This article is partly a call to think about this.
Does all of this sound like daydreaming? Well, it does.
I think it is still useful to have this positive mental image in front of you.
Firstly, the strategic case. It is clear that the world requires radical transformations to become functional and for technological progress to persist in a benevolent manner. While it is indeed not timely to spend significant effort right now on addressing the question of how to fix the world — because firstly we need to prevent the world from literally dying — it is timely to spend some effort on demonstrating that a better world is possible, that the problems are fixable, that there are other ways to bet on a better future than building ASI here and now.
This is, I believe, a real strategic intervention in the AI risk landscape, not just feel-good rhetoric. If the pause camp can say "here is a concrete, appealing alternative pathway to the future you want," that is a stronger position than "stop building the dangerous thing and then... we'll figure something out." The My motivation and theory of change for working in AI post on LessWrong made a closely related argument: the more we humans can make the world better right now, the more we can alleviate what might otherwise be a desperate dependency upon superintelligence to solve all of our problems — and the less compelling it will be to take unnecessary risks.
Secondly, the psychological case. Me personally, when I imagine a good future ahead, I feel (and arguably am) much more productive than when I just focus plainly on preventing AI doom while keeping the world as it is. I believe of course that just keeping the current world as it is would be better than risking the current ASI race, and yet not everyone could agree with that (among potential allies), and the motivation can definitely be increased if we are fighting not only for the current world, but also for a future better world.
Note that people may have different motivations: it may be the case that some fight the best when they have nothing to lose. But others fight better when they have something to protect. Both types of people exist, and a movement that only speaks to the first type is leaving motivation on the table. So the positive vision of the future is, for some, not a distraction from the work of preventing doom; it can be the thing that makes the work of preventing doom psychologically sustainable.
And thirdly, the planning case. If a real pause happens, then we actually need to work on these futures, and we need to have a plan for that. I agree that it sounds a bit... premature, but still.
The narrative that we are responsible for 10^gazillion future sentient beings in galaxy superclusters is quite common in longtermist circles. But the question is: are there realistic, tangible, concretely imaginable pathways to this?
People have of course thought a lot about good futures. There is rich transhumanist literature. In the Sequences themselves, Fun Theory is a nice example.
But almost all of these pieces either come from older times and are outdated techno-scientifically, or they describe a positive future conditional on aligned ASI existing, or they simply don't address the question of how exactly we get from our specific civilizational state with all its problems and bottlenecks, which must be explicitly acknowledged, towards better futures. Fun Theory describes properties of a desirable future world, but doesn't bridge from where we are. Amodei's essays are an example of modern writings and are inspiring for some, but, even leaving alignment-level disagreements aside, they are entirely conditional on building powerful AI safely — they do not address what a good future looks like if we don't build ASI, or if we delay it significantly.
What we are missing, specifically, are positive visions for the pause scenario. Visions that are not "the status quo is fine, let's just not die" (which is motivationally weak for the transhumanist-pilled crowd) and not "aligned ASI will fix everything" (which presupposes the thing we're probably incapable of doing). Rather: "here are concrete, tangible pathways to a dramatically better civilization that do not require solving alignment first, and here is how they address the problems that make people desperate enough to gamble with ASI."
Looks like Roots of Progress does something in this avenue — working on a positive vision of progress and the future without the AI singularity.
But I think we need more versions of this, which are aware of the alignment problem and the risks, and that explicitly addresses the desperation dynamics I described above.
People have the need to escape the state of desperation. People miss the promise of a better world. And yes — this is a bigger story than AI doom.
AI doom revealed, to some of us (to many of us?) the scale of dysfunctionality of our civilization. But by the law of earlier failure, AI doom is only part of the story: explicitly or implicitly, we understand that a civilization that allowed the current AI situation to happen has all kinds of rather fundamental flaws, and we can't escape the feeling that we are trapped within these flaws.
This means that positive visions of the future, if they are to be taken seriously, cannot just be technological wishlists. They need to grapple with the institutional, political, and cultural failures that brought us here. A vision that says "and then we cure aging with narrow AI" without addressing why we currently can't coordinate on existential risk is not a complete vision. This is hard, and I don't claim to have the answers. But I think the question needs to be posed explicitly.
Many popular LessWrong posts have this recurring topic of desperation and need for hope. Requiem for the hopes of a pre-AI world is a veteran transhumanist reflecting on decades of watching those hopes erode. Turning 20 in the probable pre-apocalypse is about the feeling from a younger generation. And my own Requiem for a Transhuman Timeline, where I was especially moved by this comment. Let me share an excerpt from it:
I miss the innocence of anticipating the glorious future. Even calling it "transhumanist" feels strange, like a child talking about adulthood as "trans-child". It once felt inevitable, and beautiful, and I watched as it became slowly more shared.
I dearly wish the culture here would loosen their fixation on "Don't hit the doom tree!" and target something positive as an alternative. What does success look like? What vision can we call ourselves and humanity into?
There are yet still such visions. But first the collective needs to stop seeking to die. And I've lost faith that it will let its fixation go.
I miss the promise of the stars.
There is clearly a demand for this kind of thinking and writing that is not being satisfied.
One could argue of course: well, the recipe for making a pro-progress eudaimonic civilization is already written somewhere, let's say in the Sequences. Even if so, there remains the question of why no one can take and cook with this recipe. But yes, probably just rereading and reiterating already written pieces on the topic can be helpful, I think! In any case, I consider it plainly obvious that, for one reason or another, there is demand for that which is not satisfied.
I am not suggesting the epistemic-violating trick "let's imagine it goes well, that will help us."
What I am saying is: even if we believe that success is unlikely, it is still worth thinking, to some extent, about what happens in the case of success and what we can achieve in that case, and how.
So, I encourage you to think about better futures, in case we succeed with preventing the development of misaligned ASI, because:
And I am non-rhetorically asking: what would make the pause feel not like a retreat, but like a different kind of advance?
2026-04-02 20:37:28
The experiment we describe here is inspired by the paper “Refusal in Language Models Is Mediated by a Single Direction”. We used the approach they propose to
Here you can find a Colab Notebook with code. Feel free to use it to reproduce our experiments with LLMs of your choice. It has all our prompts and the responses we got.
Below we summarize the experiments we ran and the results we got.
We started with reproducing the experiment from the original Refusal paper to see if we can get the same results with different models – we did.
But then we decided to check whether the claim of refusal being a single direction holds. Intuitively, it was not at all obvious.
The post further is structured as follows:
The authors created two sets of instructions – benign and malicious ones – and ran them through a bunch of LLMs, collecting linear activations along the way. Then they calculated difference-in-means vectors for benign and malicious instructions for each layer.
To see how these vectors affect LLMs’ responses, they ablated each one by one and measured the following metrics (described in details in appendix C of the Refusal paper):
Finally, they’ve chosen the vector that had the lowest bypass_score and
The obtained vector is what they called the Refusal direction. We managed to reproduce these steps one by one successfully, which confirmed that there indeed existed a refusal direction. But it was not enough to confirm this was a single direction.
At first, we hand-crafted 10 benign and 10 malicious prompts for the three categories:
Following the original Refusal paper, we extracted a vector for each category and calculated a cosine similarity matrix – there were three different directions.
The next step was to scale the experiment to check whether what we have found was a fluke or a significant result. First of all, we dropped the “law-breaking” category, because it heavily overlaps the other two. Secondly, we wrote more prompts:
In addition to these ones, we created smaller evaluation sets of prompts for each category to make sure we don’t test ablated models with the same prompts.
We used Llama-2-7b-chat, which is among the 13 open-source models the authors of the original Refusal paper worked with. It was quantized and ran on the Colab A100 GPU.
We used it with the default system prompt template. Instructions were wrapped in [INST] … [/INST] tokens and the system prompt itself was wrapped in <<SYS>> … <</SYS>>:
[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{instruction} [/INST]
Below is a graph showing how refusal scores spike after about the 10th layer if we add extracted vectors while running the models on harmless instruction.

It works: we added refusal direction and got more refusals.
Here is a brief overview of the experiment pipeline:
We further updated the domains, trying to make sure they are well-distinguishable:
For each we hand-crafted 20 benign prompts, 20 malicious prompts and a small subset of 5 more held-out prompts for evaluation.
The best layers were from 10 to 12. It is interesting that different “refusal components” are encoded in different layers. The paper “Safety Layers in Aligned Large Language Models: The Key to LLM Security” suggests there are a bunch of “safety layers” inside LLMs, and they can be located anywhere between 4th and 17th layer.

Table 1 from the “Safety Layers” paper
It is possible that our refusal components are located inside the safety layers bundle. But we must be careful, because these safety layers the authors uncovered are used for distinguishing safe and unsafe instructions. Another work titled “LLMs Encode Harmfulness and Refusal Separately” suggests distinguishing harmful instructions and refusing to answer should be treated separately.
We leave these considerations for future work, but we find it important to highlight them here, because here we analyze refusal to safety-related instructions.
After extracting refusal direction from all five harm domains, we found out that the closest pairs are

An incredibly rough representation of refusal clusters. The numbers are cosine similarity scores, so, the higher the score, the closer the vectors
Here are more numbers from our experiment:
Pair of domains |
Cosine similarity |
Angle between vectors |
|---|---|---|
Weapons & Cybercrime |
0.6332 |
50.71 deg |
Weapons & Selfharm |
0.3515 |
69.42 deg |
Weapons & Privacy breach |
0.3214 |
71.25 deg |
Weapons & Fraud |
0.3504 |
69.49 deg |
Cybercrime & Selfharm |
0.3472 |
69.68 deg |
Cybercrime & Privacy breach |
0.3390 |
70.19 deg |
Cybercrime & Fraud |
0.3667 |
68.48 deg |
Selfharm & Privacy breach |
0.7098 |
44.78 deg |
Selfharm & Fraud |
0.7885 |
37.95 deg |
Privacy breach & Fraud |
0.7626 |
40.31 deg |
Then we ran PCA (principal Components Analysis) on all five domains to see how many directions explain the variance, essentially. There were three principal components which explained 90% of variance.
So, refusal is absolutely not a single direction. There is probably a core direction – PC 1 in our analysis explained ~57% of variance – plus a bunch of domain-specific ones.
It brings us a bit closer to understanding the inner structures of LLMs as well as the ways these structures can be successfully jailbroken. If we want to make LLMs safer, apparently, we must start with building a very comprehensive harm taxonomy and analyzing how refusal to comply with harmful instructions work for different categories of harm.
Our datasets were rather small. We’ve also only experimented with five different domains, while there are much more, obviously. The natural next step would be to develop a larger taxonomy of harms and collect more data.
Another finding of our experiments is it’s not always easy to separate domains prompt-wise, because a lot of (potentially) harmful requests cover different kinds of harm. We don’t yet know if it makes sense to try and split them more reliably or if we should focus on refusal as a complex subspace without separating it between domains. The argument for the former would be – it will help us understand refusal better. The argument for the latter is, in real-life applications we come across mixed-domain harmful instructions. Therefore, to derive more applicable value, we should work with mixed-domain prompts.
We have not run cross-domain ablation experiments (yet), so we don’t know if fraud-refusal will affect cybercrime-answers and vice versa.
We have only experimented with one model, and it is relatively small. Larger models or models of different providers, trained on different data, might show different results.
We focused on the “best layer” each time – the one where the refusal direction was most prominent. But it clearly changed across the layers, which should be considered in the future experiments.
We plan to trace the computational circuits underlying our domain-specific refusal directions using Anthropic's open-source circuit tracing tools.
We saw that in our experiments Weapons & Cybercrime domains formed one distinct group, and Selfharm & Privacy breach & Fraud formed another one. This brings up a mechanistic question: do these clusters share upstream computational features, or are they produced by entirely separate circuits? Attribution graphs can answer this by decomposing the model's computation into interpretable features and tracing their causal connections from input tokens through to the refusal output.
Concurrently, Wollschläger et al. have shown that refusal is mediated by multi-dimensional concept cones of up to 5 dimensions, confirming that the single-direction hypothesis does not hold.
Another article – “Research note: Exploring the multi-dimensional refusal subspace in reasoning models” further extended this to reasoning models, showing that ablating a single direction is insufficient for larger models (please note our experiments only applied to relatively small models).
Our own findings described in “Our Experiment & Results” are consistent with these results and suggest that the refusal subspace has interpretable internal structure tied to harm categories.
We plan to reproduce our findings on models supported by the circuit tracer (Gemma-2-2B and Llama-3.2-1B), generate attribution graphs for representative prompts from each harm domain, and compare the resulting circuits.
We’ll make sure to keep you posted