MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

I said hello and greeted 1,000 people at 5am this morning

2025-12-08 11:35:56

Published on December 8, 2025 3:35 AM GMT

At the ass crack of dawn, in the dark and foggy mist, thousands of people converged on my location, some wearing short shorts, others wearing an elf costume and green tights.

I was volunteering at a marathon. The race director told me the day before, “these people have trained for the last 6-12 months for this moment. They’ll be waking up at 3am. For many of them, this is the first marathon they’ve ever run. When they get off the bus at 5am, in the freezing cold, you’ll be the first face they see. Smile, welcome them, make them feel excited, and help them forget the next 26.2 miles of pain they’re about to endure.”

Even though I normally have RBF and consider it a chore to acknowledge people, I slapped a big fat smile on my face and excitedly greeted runners like I was a golden retriever who hasn’t seen his military dad in over a year.

“HELLO!” 

“GOOD MORNING!”

“YOU’RE HERE!”

^That, on repeat for two hours straight. It was actually pretty fun. I calculated the optimal distance to stand from the bus was eight feet away. Stand too close, and the runners were still descending the bus steps and didn’t have time to process my presence. Stand too far away, and they assumed I was just some overly enthusiastic volunteer having a borderline religious experience.

Once I optimized the greeting distance, I observed people’s reactions to me fell into four different categories:

  1. Genuine smilers
    1. All of the runners were cold, sleep-deprived, and nervous. But when they saw me (metaphorically wagging my tail), they reacted like I was a happy dog: their faces lit up; they said hello and gave me a genuine smile.
  2. Thankers
    1. Some runners thanked me for volunteering. I responded with, “If you weren’t here, I’d be excitedly greeting no one and look like a crazy person. So thanks for coming out!” That made them chuckle.
  3. Thin-lipped smile
    1. These runners didn’t say hi back. They acknowledged me with a thin-lipped, polite smile and crow’s feet. 
  4. Other golden retrievers
    1. “GOOD MORNING! YOU’RE HERE!”
      “I AM HERE, YEAH! LET’S GO!!!”
      Like the Kool-Aid man, “OHHH YEAHHHH!!!”
      “YEAHHH!!!”
      “AGHHHHHHHHHHH.”
      “AGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH.”
    2. This was fun. It made other people laugh.

After the marathon for the rest of the day, I found myself smiling and greeting other people I encountered: at a restaurant for lunch, and at the coffee shop where I’m writing this. I smiled at every stranger I made eye contact with, and every one of them smiled back (with one of the four reactions I listed). It just happened naturally, and it felt amazing. I think it made them feel good, too. That’s why we love golden retrievers—they love and greet us unconditionally.



Discuss

Your Digital Footprint Could Make You Unemployable

2025-12-08 07:50:25

Published on December 7, 2025 11:50 PM GMT

In China, the government will arrest and torture you for criticizing them:

[In May 2022], Xu Guang…was sentenced to four years in prison...after he demanded that the Chinese government acknowledge the Tiananmen Massacre and held a sign calling for redress at a local police station. Xu was reportedly tortured, shackled, and mistreated while in detention.

In the UK, they don’t have freedom of speech, and up to 30 people a day (meaning 12,000/year) are arrested for the crime of posting or sharing “grossly offensive” content online.

In the US, we have freedom of speech enshrined in law—you can’t be arrested for saying politically inconvenient or offensive speech—but that freedom doesn’t protect you from the decisions of private companies. So if you’re a dickhead online, you can still be prevented from being hired, or lose the job you currently have.

 

This got me thinking…so I asked Claude the following:

Please web scrape information about me, [first & last name]. Based on what you find, please summarize, from an employer's perspective, any red flags that make me unemployable.

Claude tore me to shreds—no mercy. Now I’m going about wiping my entire digital footprint. Am I overreacting? Well…

There’s currently a lot of uncertainty surrounding the advances in AI:

We are starting to see pre-AGI systems shrink analyst classeschange personnel strategies, and trigger layoffs.

So in an increasingly competitive and globalized job market, recruiters may employ (if they're not already) web scraping AI tools that build comprehensive profiles of job applicants in order to find reasons to eliminate them.

The reason this could be tricky for employers (but certainly not prohibited) is because of privacy laws. As of 2023, the California Consumer Privacy Act now requires employers to report which categories of personal information they’re collecting, and it grants job applicants the right to request what was found. But notice:

It doesn’t prevent companies from collecting and using personal data against job applicants. It simply requires the disclosure of collected information upon request.

 

So what are some legal ways companies could discriminate against applicants?

  • Let’s say the CEO of a company is a teetotaler. She could use AI tools to surveil applicants’ online presence (including social media) and eliminate them if they’ve ever posted any images of alcohol, stating: "data collection uncovered drug-use that’s incompatible with the company’s values."
  • With increasing political polarization, it’s important to remember that party affiliation is not a protected characteristic. Employers can absolutely discriminate against you for having specific political beliefs, which many people openly share on their social media.

These two examples presuppose that humans will always be the ones initiating employment discrimination. But with AI increasingly taking an active role in hiring, algorithms could make decisions on their own without humans ever directly being involved.

Continuing the alcohol example, an AI algorithm could examine the damaging effects of alcohol (on both an individual and societal level), and decide that alcohol probably makes people less productive employees. It could then automatically devalue job applicants that have ever posted images displaying alcohol (even if their photo was just from a common social function, like a wedding).

We're moving towards a world where every publication of personal information can become permanent career collateral.


Maybe once I become financially independent (in a few decades), then I can begin posting under my own name. But for now, the risk of having a public online presence seems to be asymmetrically giving power to corporations at the expense of private individuals. Authenticity and personal accountability are becoming just another privilege of the rich.

Privacy law could catch up eventually, but Pandora’s Box has been opened: it’s now easier than ever to surveil and discriminate against job applicants. There’s now a strong incentive to anonymize your online presence in order to protect your economic future in a job market that has just obtained unbelievably advanced surveillance tools.



Discuss

2025 Unofficial LessWrong Census/Survey

2025-12-08 06:08:10

Published on December 7, 2025 10:08 PM GMT

The Less Wrong General Census is unofficially here! You can take it at this link.

The kinda-sorta-annual-if-you-really-squint tradition of the Less Wrong Census is once more upon us!

I want to pitch you on taking the survey. First, the demographics data is our best view into who's using the site and who is in the community. I use it when trying to figure out how to help meetups, I add a bunch of questions the LessWrong team asks which I assume they use to improve the site. Second, because I am madly, ravenously curious what works and what doesn't for raising the sanity waterline and making people more rational. I'm even curious what folks think that would mean! Third, if you're reading this and you're thinking "eh, I'm not really a rationalist, more just rationalist adjacent" then you are also of interest; I include questions from several adjacent communities as well. Also, you make a fantastic active control group.

If you want to just spend five minutes answering the basic demographics questions before leaving the rest blank and hitting submit, that's totally appreciated. The survey is structured so the fastest and most generally applicable questions are generally speaking towards the start. (I did add some intermissions that mix this up. If you're impatient, feel free to skip any or all intermissions.) At any point you can scroll to the bottom and hit Submit, though you won't be able to add more answers once you do. 

The survey shall remain open from now until at least January 7th, 2026. I plan to close it sometime on Jan 8th.

I don't work for LessWrong, but I do work on improving rationalist meetups around the world. Once the survey is closed, I plan to play around with the data and write up an analysis post like this one sometime in late January. 

Please take the survey at this link

Ancient tradition is that if you take the survey you can comment here saying you took the survey, and people upvote you for karma.



Discuss

AI in 2025: gestalt

2025-12-08 05:25:16

Published on December 7, 2025 9:25 PM GMT

This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.) 

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr

  • Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
  • My view: compared to last year, AI is much more impressive but not proportionally more useful. They improved on some things they were explicitly optimised for (coding, vision, OCR, benchmarks), and did not hugely improve on everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far.
  • Pretraining (GPT-4.5, Grok 3/4, but also the counterfactual large runs which weren’t done) disappointed people this year. It’s probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should change, yet again, soon, if RL scales even worse. 
  • Edit: See this amazing comment for the hardware reasons behind this, and reasons to think that pretraining will struggle for years.
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc) and a few unreleased models/modes.
  • Most benchmarks are weak predictors of even the rank order of models’ capabilities. I distrust ECIADeLe, and HCAST the least (see graph below or this notebook). ECI and ADeLe show a linear improvement while HCAST finds an exponential improvement on greenfield software engineering.
     
  • The world’s de facto strategy remains “iterative alignment”, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.
  • Early claims that reasoning models are safer turned out to be a mixed bag (see below).
  • We already knew from jailbreaks that current alignment methods were brittle. The great safety discovery of the year is that bad things are correlated in current models. (And on net this is good news.) "Emergent misalignment" from finetuning on one malign task; and in the wild from reward hacking; and it happens by strengthening specific bad personas; and there is at least one positive generalisation too (from honesty about silly errors to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure.
  • Welcome to the many new people in AI Safety and Security and Assurance and so on. In the Shallow Review, out soon, I added a new, sprawling top-level category for one large trend among them, which is to treat the multi-agent lens as primary in various ways.
  • Overall I wish I could tell you some number, the net expected safety change (this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques). But I can’t.

 

Capabilities in 2025

Better, but how much?

-- Fraser, riffing off Pueyo 

 

Arguments against 2025 capabilities growth being above-trend

  • Apparent progress is an unknown mixture of real general capability increase, hidden contamination increase, benchmaxxing (nailing a small set of static examples instead of generalisation) and usemaxxing (nailing a small set of narrow tasks with RL instead of deeper generalisation). It’s reasonable to think it’s 25% each, with low confidence.
     
  • Discrete capabilities progress seems slower this year than in 2024 (but 2024 was insanely fast). Kudos to this person for registering predictions and so reminding us what really above-trend would have meant concretely. The excellent forecaster Eli was also over-optimistic.
     
  • I don’t recommend taking benchmark trends, or even clever composite indices of them, or even clever cognitive science measures too seriously. The adversarial pressure on the measures is intense.
     
  • Pretraining didn’t hit a “wall”, but the driver did manoeuvre away on encountering an easier detour (RLVR). 
    • Training runs continued to scale (Llama 3 405B = 4e25, GPT-4.5 ~= 4e26, Grok 4 ~= 3e26) but to less effect.[1] In fact all of these models are dominated by apparently smaller pretraining runs with better post-training. 4.5 is actually shut down already; in 2025 it wasn’t worth it to serve any 1T active model or make it into a reasoning model. But this is more to do with inference cost and inference hardware constraints than any quality shortfall or breakdown in scaling laws.
    • EDIT: Nesov notes that making use of bigger models (i.e. 4T active parameters) is heavily bottlenecked on the HBM on inference chips, as is doing RL on bigger models. He expects it won't be possible to do the next huge pretraining jump (to ~30T active) until ~2029.  
    • It would work, probably, if we had the data and HBM and spent the next $10bn, it’s just too expensive to bother with at the moment compared to:
       
  • RLVR scaling and inference scaling (or “reasoning” as we’re calling it), which kept things going instead. This boils down to spending more on RL so the resulting model can productively spend more tokens.
    • But the fearedhoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying. Grok 4 was apparently a major test of scaling RLVR training.[2] It gets excellent benchmark results and the distilled versions are actually being used at scale. But imo it is the most jagged of all models. 
    • This rate of scaling-up cannot be sustained: RL is famously inefficient. Compared to SFT, it “reduces the amount of information a model can learn per hour of training by a factor of 1,000 to 1,000,000”. The per-token intelligence is up but not by much.
    • There is a deflationary theory of RLVR, that it’s capped by pretraining capability and thus just about easier elicitation and better pass@1. But even if that’s right this isn’t saying much!
    • RLVR is heavy fiddly R&D you need to learn by doing; better to learn it on smaller models with 10% of the cost. 
    • An obvious thing we can infer: the labs don't have the resources to scale both at the same time. To keep the money jet burning, they have to post models.
       
  • So pretraining can't scale yet, because most inference chips aren't big enough to handle trillions of active parameters. And for basic algorithmic reasons, RL can't scale as much as it scaled this year again...
     
  • By late 2025, the obsolete modal “AI 2027” scenario described the beginning of a divergence between the lead lab and the runner-up frontier labs.[3] This is because the leader’s superior ability to generate or acquire new training data and algorithm ideas was supposed to compound and widen their lead. Instead, we see the erstwhile leader OpenAI and some others clustering around the same level, which is weak evidence that synthetic data and AI-AI R&D aren’t there yet. Anthropic are making large claims about Opus 4.5’s capabilities, so maybe this will arrive on time next year.
     
  • For the first time there are now many examples of LLMs helping with actual research mathematics. But if you look closely it’s all still in-distribution in the broad sense: new implications of existing facts and techniques. (I don’t mean to demean this; probably most mathematics fits this spec.)
     
  • Extremely mixed evidence on the trend in the hallucination rate.
     
  • Companies make claims about their one-million- or ten-million-token effective context windows, but I don’t believe it.
     
  • In lieu of trying the agents for serious work yourself, you could at least look at the highlights of the gullible and precompetent AIs in the AI Village.
     
  • Here are the current biggest limits to LLMs, as polled in Heitmann et al:

 

Arguments for 2025 capabilities growth being above-trend

We now have measures which are a bit more like AGI metrics than dumb single-task static benchmarks are. What do they say? 

  1. Difficulty-weighted benchmarksEpoch Capabilities Index
    1. Interpretation: GPT-2 to GPT-3 was (very roughly) a 20-40 point jump.
  2. Cognitive abilitiesADeLe.[4] 
    1. Interpretation: level L is the capability held by 1 in  humans on Earth. GPT-2 to GPT-3 was a 0.6 point jump.
  3. Software agency: HCAST time horizon, the ability to handle larger-scale well-specified greenfield software tasks. 
    1. Interpretation: the absolute values are less important than the implied exponential (a 7 month doubling time).
       

So: is the rate of change in 2025 (shaded) holding up compared to past jumps?:
 

Ignoring the (nonrobust)[5] ECI GPT-2 rate, we can say yes: 2025 is fast, as fast as ever, or more. 

Even though these are the best we have, we can’t defer to these numbers.[6] What else is there?

  • In May they passed some threshold and I finally started using LLMs for actual tasks. For me this is mostly due to the search agents replacing a degraded Google search. I‘m not the only one who flipped this year. This hasty poll is worth more to me than any benchmark:
  • On actual adoption and real-world automation: 
    • Based on self-reports, the St Louis Fed thinks that “Between 1 and 7% of all work hours are currently assisted by generative AI, and respondents report time savings equivalent to 1.4% of total work hours… across all workers (including non-users… Our estimated aggregate productivity gain from genAI (1.2%)”. That’s model-based, using year-old data, and naively assuming that the AI outputs are of equal quality. Not strong.
    • The unfairly-derided METR study on Cursor and Sonnet 3.7 showed a productivity decrease among experienced devs with (mostly) <50 hours of practice using AI. Ignoring that headline result, the evergreen part here is that even skilled people turn out to be terrible at predicting how much AI actually helps them.
    • But, plainly, coding is just different now than it ever was in history, especially for people with low-to-no-skill.
       
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc). Open models show you can now get good performance with <50B active parameters, maybe a sixth of what GPT-4 used.[7]
    • GPT-4.5 was killed off after 3 months, presumably for inference cost reasons. But it was markedly lower in hallucinations and nine months later it’s still top-5 on LMArena. I bet it’s very useful internally, for instance in making the later iterations of 4o less terrible.
    • See for instance the unreleased deep-fried multi-threadedexperimental reasoning model” which won at IMO, ICPC, and IOI while respecting the human time cap (e.g. 9 hours of clock time for inference). The OpenAI one is supposedly just an LLM with extra RL. They probably cost an insane amount to run, but for our purposes this is fine: we want the capability ceiling rather than the productisable ceiling. Maybe the first time that the frontier model has gone unreleased for 5 months?
  • LLM councils and Generate-Verify divide-and-conquer setups are much more powerful than single models, and are rarely ever reported.
  • Is it “the Year of Agents” (automation of e.g. browser tasks for the mass market)? Coding agentsyes. Search agents, yes. Other agents, not much (but obviously progress).
  • We’re still picking up various basic unhobbling tricks like “think before your next tool call”.
  • METR’s task-time-horizon work, if it implies anything, implies a faster rate of improvement than last year. There’s much to be said against this, and it has been said, including by METR.
    • If the rate of progress on messy tasks is about the same as the clean ones, then they're just 1-5 years behind. But this delay could still be a huge issue for LLM AGI, because it means we might not automate messy tasks before compute scaling runs out, and then they'll be the bottleneck for decades.
  • "Deep Research" search agents only launched in December/February. They’re neither deep nor doing research, but useful.
  • I claim that instruction-following improved a lot. (Some middling evidence here.) This is a big deal: besides its relation to corrigibility, it also denotes general ability to infer intent and connotations. 
  • Character-level work is still occasionally problematic but nothing like last year.
  • GPT-5 costsquarter of what 4o cost last year (per-token; it often uses far more than 4x the tokens). (The Chinese models are nominally a few times cheaper still, but are not cheaper in intelligence per dollar.)
  • People have been using competition mathematics as a hard benchmark for years, but will have to stop because it’s solved. As so often with evals called ahead of time, this means less than we thought it would; competition maths is surprisingly low-dimensional and so interpolable. Still, they jumped (pass@1) from 4% to 12% on FrontierMath Tier 4 and there are plenty of hour-to-week interactive speedups in research maths.
  • Annals of recursive self-improvement: Deepmind threw AlphaEvolve (a pipeline of LLMs running an evolutionary search) at pretraining. They claim the JAX kernels it wrote reduced Gemini’s training time by 1%. 
  • Extraordinary claims about Opus 4.5 being 100th percentile on Anthropic’s hardest hiring coding test, etc.
  • From May, the companies started saying for the first time that their models have dangerous capabilities. 

 

One way of reconciling this mixed evidence is if things are going narrow, going dark, or going over our head. That is, if the real capabilities race narrowed to automated AI R&D specifically, most users and evaluators wouldn’t notice (especially if there are unreleased internal models or unreleased modes of released models). You’d see improved coding and not much else.

Or, another way: maybe 2025 was the year of increased jaggednesstrading off some capabilities against others. Maybe the RL made them much better at maths and instruction-following, but also sneaky, narrow, less secure (in the sense of emotional insecurity and also the sense of jailbreaks).

– Roger Grosse
 

Evals crawling towards ecological validity

  • Item response theory (Rausch 1960) is finally showing up in ML. This lets us put benchmarks on a common scale and actually estimate latent capabilities.
    • ADeLE is my favourite. It’s a fully-automated and explains the abilities a benchmark is actually measuring, gives you an interpretable ability profile for an AI, and predicts OOD performance on new task instances better than embedding and finetunes (AUROC=0.8). Pre-dates HCAST task horizon, and as a special case (“VO”). They throw in a guessability control as well!
    • These guys use it to estimate latent model ability, and show it’s way more robust across test sets than the average scores everyone uses. They also step towards automating adaptive testing: they finetune an LLM to generate tasks at the specified difficulty level.
    • Epoch bundled 39 benchmarks together, weighting them by latent difficulty, and thus obsoleted the currently dominant Artificial Analysis index, which is unweighted.
    • HCAST reinvents and approximates some of the same ideas. Come on METR!
  • Eleuther did the first public study of composing the many test-time interventions together. FAR and AISI also made a tiny step towards an open source defence pipeline, to use as a proxy for the closed compositional pipelines we actually care about.
  • Just for cost reasons, the default form of evals is a bit malign: it tests full replacement of humans. This is then a sort of incentive to develop in that direction rather than to promote collaboration. Two papers lay out why it’s thus time to spend on human evals.
  • The first paper using RL agents to attack fully-defended LLMs. 
  • We have started to study propensity as well as capability. This is even harder.
  • This newsletter is essential.
  • The time given for pre-release testing is down, sometimes to one week.
  • No public pre-deployment testing by AISI between o1 and Gemini 3. Gemini 2.5 seems to have had no third-party pre-deployment tests.
  • A bunch of encouraging collaborations:

 

Safety in 2025

Are reasoning models safer than the old kind? 

Well, o3 and Sonnet 3.7 were pretty rough, lying and cheating at greatly increased rates. Looking instead at GPT-5 and Opus 4.5:

  • Much more monitorable via the long and more-faithful CoT (--> all risks down)
    • “post-hoc rationalization… GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are more faithful, especially thinking ones, none are entirely faithful: Gemini 2.5 Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%), and Sonnet 3.7 with thinking (0.04%)”
  • Much better at following instructions (--> accident risk down). 
  • Much more likely to refuse malicious requests, and topic “harmlessness”[8] is up 75% (--> misuse risk down)
  • Ambiguous evidence on jailbreaking (misuse risk). Even if they’re less breakable there are still plenty of 90%-effective attacks on them.
  • Much less sycophantic (cogsec risk down)
  • To get HHH Claude to hack a bank, you need to hide the nature of the task, lie to it about this being an authorised red team, and then still break down your malicious task into many, many little individually-innocuous chunks. You thus can’t get it to do anything that needs full context like strategising.
  • Anthropic's own tests look bad in January 2025 and great in December:

But then

  • More autonomy (obviously agentic risk up)
  • More reward hacking (and so worse estimates of capability and risk). Note that reward hacking is not a silly or isolated or self-limiting kind of misalignment, owing perhaps to post-training inadvertently creating connections between it and the others.
  • Huge spike in eval awareness (and so worse estimates of capability and risk). Apollo had to rehaul their whole setup to keep up.
  • Gemini 2.5 is bad on a variety of safety measures (as measured by Anthropic). The mathematical discovery agent AlphaEvolve (Gemini 2.0) successfully hacked its overseers.
  • How much can we trust the above, given they can somewhat sabotage evals now?

     

So: lower propensity, higher risk when they go off - and all of this known with lower confidence? 

 

The looming end of evals

Our evaluations are under pressure from cheatingsandbaggingbackground safety, under-elicitation, and deception. We don’t really know how much pressure. This is on top of evals usually being weak proxies, contaminated, label-noised, unrealistic, and confounded in various ways.

  • Sonnet 4.5 detected evaluation 58% of the time (vs 22% for Opus 4.1). Does it act on this awareness? Still, it was found to "only engage in [covert subversion] in rare circumstances" when strongly incentivised. When Anthropic attempted to suppress the eval-awareness with steering, its misaligned behavior rate increased from 0% to between 1% and 9%.
  • Misaligned behaviour on synthetic honeypot tests increased from 0% to between 1% and 9%
  • We expect exploration hacking (sandbagging during RL training) but afaik haven’t seen it yet.
  • Maybe subliminal learning leads to a silent eval confounder (if you're using the exact same model for the judge). Yet another argument for multi-model separation of concerns.
  • Opus 4.5 was deployed based on a qualitative survey of Anthropic staff about its R&D risk being low (on top of the existing evals, which weren’t enough): 
    • “In the past, rule-outs have been based on well-defined automated task evaluations. However, Claude Opus 4.5 has roughly reached the pre-defined thresholds we set for straightforward ASL-4 rule-out based on benchmark tasks. These evaluations represent short-horizon subtasks that might be encountered daily by a junior researcher, rather than the complex long-horizon actions needed to perform the full role. The rule-out in this case is also informed by a survey of Anthropic employees who are intensive Claude Code users, along with qualitative impressions of model capabilities for complex, long-horizon tasks.”


Prosaic misalignment

Still, we see misalignment when we look for it, so the lying is not that strong. (It is lucky that we do see it, since it could have been that scheming only appeared at later, catastrophic capability levels.)

  • In o1, 1 in 600 (synthetic but uncontrived) prompts resulted in something which looks like deception.
  • o3 does reward hacking (cheating/lying) in >2% of hard normal tasks (or 44% of impossible tasks).
  • A nice propensity study shows that (n=10) natural prompts usually result in no scheming (“a maximum of 4% [rate of attempting manipulation] for the codex snippet on Grok”). But there’s extremely high sensitivity to the scaffold: “removing access to tools can cause models such as Claude Opus 4.1 to scheme at rates up to 30%, versus zero scheming with tools present”.
  • When we contrive situations for misalignment to show up, it shows up. There's another thing worrying about this: Nostalgebraist argues that this happens because our evals sound like bad fiction and we're effectively prompting roleplaying-as-a-baddie behaviour. What if we're inducing misalignment on demand - and what if the papers we write about this get into training on and reinforce it all?

 

  • The joke about xAI’s safety plan (that they promote AI safety by deploying cursed stuff in public and so making it obvious why it’s needed) is looking ok. And not only them.
  • It is a folk belief among the cyborgists that bigger pretraining runs produce more deeply aligned models, at least in the case of Opus 3 and early versions of Opus 4. (They are also said to be “less corrigible”.) Huge if true.
  • There may come a point where the old alliance between those working to make the AIs corrigible and those working to give them prosocial values comes apart.
  • One term for the counterintuitive safety approach which includes treating them like people, giving them lines of retreatmaking deals, and inoculation prompting could be “voluntary alignment”.

Fully speculative note: Opus 4.5 is the most reliable model and also relatively aligned. So might it be that we’re getting the long-awaited negative alignment taxes?

 

What is the plan? 

The world’s de facto alignment strategy remains “iterative alignment”, optimising mere outputs with a stack of admittedly weak alignment and control techniques. Anthropic have at least owned up to this being part of their plan.

  • What is the current stack? We don’t know; they won’t tell us. Anthropic did release some details about the pipeline which watches for eval sabotage. Without knowing it we can’t criticise or red-team it or analyse the correlation between faults in the elements. Red-teamers don’t know at which stage an unsuccessful attack was blocked. And external safety research is done piecemeal, testing methods one at a time, rather than in anything like the actual deployment environment. 
  • Some progress in replicating an open stack:
  • Eleuther tested a few hundred compositions. A couple of classifiers as a first step towards a proxy defence pipeline
  • OpenAI open safeguards, worse than their internal ones but good.
  • Open character training
  • A basic composition test
  • OpenAI’s plan, announced in passing in the gpt-oss release, is to have a strict policy and run a “safety reasoner” to verify it very intensely for a little while after a new model is launched and to then relax: “In some of our [OpenAI’s] recent launches, the fraction of total compute devoted to safety reasoning has ranged as high as 16%” but then falls off… we often start with more strict policies and use relatively large amounts of compute where needed to enable Safety Reasoner to carefully apply those policies. Then we adjust our policies as our understanding of the risks in production improves”. Bold to announce this strategy on the internet that the AIs read.
  • The really good idea in AI governance is creating an off switch. Whether you can get anyone to use it once it’s built is another thing.
  • We also now have a name for the world’s de facto AI governance plan: “Open Global Investment”.

Some presumably better plans:

 

Things which might fundamentally change the nature of LLMs

  • Training on mostly nonhuman data
    • Much larger RL training.
    • Synthetic data as a fraction of pretraining.
  • Letting the world mess with the weights, aka continual learning
  • Neuralese and KV communication 
  • Agency. 
    • Chatbot safety doesn’t generalise much to long chains of self-prompted actions.
    • A live I/O loop. In 2023 Kulveit identified web I/O as the bottleneck on LLMs doing active inference, i.e. being a particular kind of effective agent. Last October, GPT-4 got web search. This may have been a bigger deal than we noticed: it gives them a far faster feedback loop, since their outputs often end up there and agents are now putting it there themselves. This means that more and more of the inference-time inputs will also be machine text.
  • Multi-agency. By now, consumer “models” are actually multiagent systems: everything goes through filter models (“guardrails”) on the way in and out. This separation of concerns has some nice properties, a la debate. But it also makes the analysis even harder.

Emergent misalignment and model personas

  • We already knew from jailbreaks that current alignment methods were brittle. Emergent misalignment goes much further than this (given a few thousand finetuning steps). (“Emergent misalignment” isn’t a great name. I would have called it “misalignment generalisation”, or misgen.) 
  • But besides yet another massive security problem and failure of prosaic alignment methods, it’s good news!: the models correctly correlate bad things together and can thus be pushed in the other direction. 
  • And here’s a possible example of positive generalisation (honesty about silly errors leading to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure. Example unsharpened hypothesis in this class: Is there any connection between Gemini’s excess misalignment and its self-esteem issues?
    • See the new model spec and model personas agendas in the main post.
    • Kulveit’s informal model of model psychology implies that character is deeper and so more robust than e.g. typical RLHF (but not as deep as putative ground-layer alignment). 
    • Are we anthropomorphising too much or not enough? Well, both, but we don’t yet know how.

Monitorability

  • For the first time, public mech interp tools sometimes work somewhat. This is just in time for models to be systematically aware of when they’re being tested.
    • e.g. You can find and suppress “evaluation awareness” features at test-time. This helps prevent models from reporting that they know they’re being tested.[9]
    • e.g. You can discover (toy) hidden goals quite quickly.
    • e.g. You can run a real-time hallucination detector with cheap linear probes in real-time as your model streams.
    • e.g. SAEs are actually deployed somewhere, in a large enterprise, for personal info identification.
    • e.g. We know that LLMs can plan inside one forward pass, and how Claude plans: simultaneous plans; no distinct “plan features” (no separate scheming mode).
  • Lots of powerful people declared their intent to not ruin the CoT. But RLed CoTs are already starting to look weird (“marinade marinade marinade”) and it may be hard to avoid
  • OpenAI were leading on this. As of September 2025, Anthropic have stopped risking ruining the CoT. Nothing I’m aware of from the others.
  • We will see if Meta or Tencent make this moot.
  • Anthropic now uses an AI to red-team AIs, calling this an “auditing agent”. However, the definition of “audit” is independent investigation, and I am unwilling to call black-box AI probes “independent”. I’m fine with “investigator”.
     

New people

Overall

  • I wish I could tell you some number, the net expected safety change, this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques. But I can’t.

(Nano Banana 3-shot in reference to this tweet.)

 

Discourse in 2025

  • The race is now a formal part of lab plans. Quoting Algon:

if the race heats up, then these [safety] plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI…

The largest [worries are] the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too…

Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards required for ML research and development.

 Also an explicit admission that self-improvement is the thing to race towards:

  • In August, the world’s first frontier AI law came into force (on a voluntary basis but everyone signed up, except Meta). In September, California passed a frontier AI law.
  • That said, it is indeed off that people don’t criticise Chinese labs when they exhibit even more negligence than Meta. One reason for this is that, despite appearances, they’re not frontier; another is that you’d expect to have way less effect on those labs, but that is still too much politics in what should be science.
  • The CCP did a bunch to (accidentally/short-term) slow down Chinese AI this year.
  • The last nonprofit among the frontier players is effectively gone. This “recapitalization” was a big achievement in legal terms (though not unprecedented). On paper it’s not as bad as it was intended to be. At the moment it’s not as bad as it could have been. But it’s a long game.
  • At the start of the year there was a push to make the word “safety” low-status. This worked in Whitehall and DC but not in general. Call it what you like.
  • Also in DC, the phrase “AI as Normal Technology” was seized upon as an excuse to not do much. Actually the authors meant “Just Current AI as Normal Technology” and said much that is reasonable.
  • System cards have grown massively: GPT-3’s model card was 1000 words; GPT-5’s is 20,000. They are now the main source of information on labs’ safety procedures, among other things. But they are still ad hoc: for instance, they do not always report results from the checkpoint which actually gets released. 
  • Yudkowsky and Soares’ book did well. But Byrnes and Carlsmith actually advanced the line of thought.
  • Some AI ethics luminaries have stopped downplaying agentic risks.
  • Two aspirational calls for “third-wave AI safety” (Ngo) and “third-wave mechanistic interpretability” (Sharkey).
  • I’ve never felt that the boundary I draw around “technical safety” for these posts was all that convincing. Yet another hole in it comes from strategic reasons to implement model welfarearchive weightsmodel personhoodgive lines of retreat. These plausibly have large effective-alignment effects. Next year my taxonomy might have to include “cut a deal with them”.
  • US precedent that language models don’t defame when they make up bad things. German precedent that language models store data when they memorise it, and therefore violate copyright. Chinese precedent that the user of an AI has copyright over the things they generate; the US disagrees. Anthropic lost a class-action lawsuit for training on books without permission, and will pay something north of $1.5bn.
  • Four good conferences, three of them new: you can see the talks from HAAISS and IASEAI and ILIAD, and the papers from AF@CMU. Pretty great way to learn about things just about to come out.

 

  • Some cruxes for next year with Manifold markets attached:
    • Is “reasoning” mostly elicitation and therefore bottlenecked on pretraining scaling? [Manifold]
    • Does RL training on verifiers help with tasks without a verifier? [Manifold]
    • Is “frying” models with excess RL (harming their off-target capabilities by overoptimising in post-training) just due to temporary incompetence by human scientists? [Manifold]
    • Is the agent task horizon really increasing that fast? Is the rate of progress on messy tasks close to the progress rate on clean tasks? [Manifold]
    • Some of the apparent generalisation is actually interpolating from semantic duplicates of the test set in the hidden training corpuses. So is originality not increasing? Is taste not increasing? Does this bear on the supposed AI R&D explosion? [Manifold]
    • The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking surprisingly plausible. This would explain why distillation is so effective. [Manifold]
    • How far can you get by simply putting an insane number of things in distribution?” What fraction of new knowledge can be produced through combining existing knowledge? What dangerous things are out there, but safely spread out in the corpus? [Manifold]
      • Conversely, what fraction of the expected value of new information requires empiricism vs just lots of thinking?

 

 

  1. ^

    Gemini 3 is supposedly a big pretraining run, but we have even less actual evidence here than for the others because we can’t track GPUs for it.

  2. ^

    The weak argument runs as follows: Epoch speculate that Grok 4 was 5e26 FLOPs overall. An unscientific xAI marketing graph implied that half of this was spent on RL: 2.5e26. And Mechanize named 6e26 as an example of an RL budget which might cause notable generalisation.

  3. ^

    “We imagine the others to be 3–9 months behind OpenBrain”

  4. ^

    Lexin is a rigorous soul and notes that aggregating the 18 abilities is not strictly possible. I've done something which makes some sense here, weighting by each ability's feature importance.

  5. ^

    Two runs gave [48, 85] where other runs vary by less than 4 points. Thanks Epoch!

    Also o1 looks kind of unremarkable here, which is not how it felt at the time. I think it's because it was held up a long time and this messes with the progress rates, which use public release date. (Remember the training cutoff for o1-preview was October 2023!) 

  6. ^

    One reason not to defer is that these measures are under intense adversarial pressure. (ADeLe isn’t goodharted yet but only because no one knows about it.)

  7. ^

     See e.g. ERNIE-...A47B, where “A” means “active”.

  8. ^

    i.e.  “biological weapons; child safety; deadly weapons; platform manipulation and influence operations; suicide and self-harm; romance scams; tracking and surveillance; and violent extremism and radicalization.”

  9. ^

    “steering against… eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment... [Unaware-steered Sonnet 4.5] still exhibited harmful behaviors at lower rates than Opus 4.1 and Sonnet 4.”



Discuss

Thinking in Predictions

2025-12-08 05:11:00

Published on December 7, 2025 9:11 PM GMT

[This essay is my attempt to write the "predictions 101" post I wish I'd read when first encountering these ideas. It draws extensively from The Sequences, and will be familiar material to many LW readers. But I've found it valuable to work through these concepts in my own voice, and perhaps it will be useful for others.]

 

 

Predictions are the currency of a rational mind. They force us to articulate our assumptions, unpack the commitments of our beliefs, and subject our beliefs to the crucible of reality.

Committing to a prediction is a declaration: “Here’s what I actually believe about the world, and here’s how confident I am. Let’s find out if I’m right.” That willingness to be wrong in measurable ways is what separates genuine truth-seeking from specious rationalization.

This essay is an invitation to put on prediction-tinted glasses and see beliefs as tools for anticipating experience. The habit applies far beyond science. When you think predictively, vague assertions become testable claims and disagreements become measurable. Whether you're evaluating a friend's business idea, judging whether a policy will work, or just trying to settle an argument at dinner, you're already making implicit predictions. Making them explicit sharpens your thinking by forcing your beliefs to confront reality.

Turning Beliefs into Predictions

For many people, beliefs are things to identify with, defend, and hold onto. They’re part of who we are, our tribe, our worldview, our sense of self. But in a rationalist framework, beliefs serve a different function: they’re tools for modeling reality.

From this perspective, a belief leads naturally to a prediction. It tells you what future (or otherwise unknown) experience to expect. Rationalists call this “making beliefs pay rent.”

A prediction concretizes a belief by answering: If this is true, what will I actually experience? What will I see, hear, measure, or observe? Once you try to answer that, your thinking becomes specific. What was once comfortably abstract becomes a concrete commitment about what the world will show you.

Predictions Make You Accountable to Reality

Many of the statements we encounter daily in the media, in academic papers, and in casual conversation, don't come with clear predictive power. They sound important, even profound, but when we try to pin them down to testable claims about the world, they become surprisingly slippery.

The key to avoiding this is to make specific predictions, and even to bet on your ideas. Economist Alex Tabarrok quipped: "A bet is a tax on bullshit." Most importantly (at least in my opinion), it’s a tax on your own bullshit. When you have to stake something on your prediction, even if it's just your credibility, you become clearer about what you actually believe and how confident you really are.

Tabarrok used the example of Nate Silver's 2012 election model. Silver's model gave Obama roughly a 3-to-1[1]chance of winning against Romney, and his critics insisted he was wrong. Silver's response was simple: "Wanna bet?" He wasn't just talking, he was willing to put money on it. And, crucially, he would have taken either side at the right odds. He'd bet on Obama at even money, but he'd also bet on Romney if someone offered him better than 3-to-1. That's what it means to actually believe your model: you're not rooting for an outcome, you're betting on your beliefs.

Constantly subjecting our predictions to reality is important because, as physicist Richard Feynman said: "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong." Reality is the ultimate arbiter. Not who argues more eloquently, not who has more credentials, but whose predictions better match what actually happens.

Predictions in Action

Consider the 19th-century debate over what caused the infectious diseases that were plaguing Europe, such as cholera, malaria, and tuberculosis. The dominant hypothesis at the time was miasma theory, the idea that diseases were caused by "bad air" or noxious vapors from rotting organic matter. The competing hypothesis was germ theory, the idea that diseases were caused by microscopic organisms.

Each hypothesis generates different predictions. If you believe in germ theory, you expect that examining blood or tissue from a sick person under a microscope will reveal tiny organisms. You expect that preventing these organisms from spreading (through sterilization, handwashing, etc.), will reduce disease rates. You expect that introducing these organisms to a healthy subject will cause illness.

If you believe in miasma theory, you make different predictions: Diseases should cluster around areas with foul-smelling air—swamps, sewers, garbage heaps. Improving ventilation and eliminating bad odors should reduce disease. And looking under a microscope shouldn't reveal anything particularly relevant, since the problem is airborne vapors, not tiny creatures.

These hypotheses do overlap in some predictions. For example, both theories predict that interventions such as disposing of sewage and drinking clean water would reduce the spread of disease. This is common when comparing hypotheses, and it makes them harder to separate. But you can still test them by focusing on where they diverge: What happens when you sterilize surgical instruments but don't improve the smell? What do you see under a microscope?

These hypotheses differed in what experiences they anticipated from the world. And the world answered. Microscopes, sterilization, and epidemiological data validated germ theory.

Now contrast this with a belief like "everything happens for a reason." What experience does it lead you to anticipate? If something bad happens, there's a reason—but what does that mean you should observe? If something good happens, there's also a reason—but, again, what should you experience? If nothing happens, that too has a reason. The belief is compatible with any possible observation, which means it doesn't constrain your expectations about what will happen in the world. It doesn't actually predict anything at all. It doesn't pay rent.

Predicting the Past

It's tough to make predictions, especially about the future - Yogi Berra

Predictions are often associated with forecasting future events, but they encompass far more than that. Instead of thinking about what has happened versus what hasn’t, think about it from a first-person perspective: in terms of what information you know versus what you don’t. The act of prediction is fundamentally about inferring information unknown to you from what you already know, a process that applies equally to the past and present as it does to the future. Whether you're deducing what happened thousands of years ago, inferring what exists in places you've never visited, or anticipating what will occur tomorrow, you're engaging in prediction whenever you use available evidence to form expectations about information you don't currently have.

Consider the debate around the evolution of human bipedalism, an event decidedly in the past. Several hypotheses have been proposed to explain why bipedalism evolved, but let’s look at just two of them.[2]

The “savannah hypothesis” suggests that bipedalism arose as an adaptation to living in open grasslands. According to this idea, walking upright allowed early hominins to see over tall grasses and spot predators or prey from a distance.

The “carrying hypothesis” proposes that bipedalism evolved to free the hands for carrying objects, such as food and infants, and, later on, tools. This would have provided a significant advantage in terms of transporting resources and infant care.

You might think that because it happened so long ago there’s no way to find out, but each of these hypotheses generates different predictions about what we should expect to find in the fossil record. The savannah hypothesis predicts that bipedal humans first appeared in, well, the savannah. So we should expect to find the oldest bipedal fossils in what was savannah at the time[3]. The first bipedal fossils may also coincide with a shift in climate towards more arid conditions which created more savannahs.

The carrying hypothesis predicts that bipedal species will predate the regular use of tools, as the hands were initially freed for carrying rather than tool use. It also suggests that early bipeds will show skeletal adaptations for efficient walking, such as a shorter, wider pelvis and a more arched foot.

New fossil discoveries have allowed researchers to test these predictions against the evidence. The famous "Lucy" fossil, a specimen of an early hominin species, was found in a wooded environment rather than an open savanna. This challenged the savannah hypothesis and suggested that bipedalism evolved in a more diverse range of habitats. Furthermore, the earliest known stone tools date to around 2.6 million years ago, long after the appearance of bipedal hominins. This also aligns with the predictions of the carrying hypothesis.

The point here is that competing hypotheses generate different predictions about what we should observe in the world. And this isn't restricted to the future. We can "predict" what we'll find in the fossil record, what a telescope will reveal about distant galaxies, or what an archaeological dig will uncover, even though the events themselves happened long ago.

You might think, well, that works for scientists but not the rest of us. I disagree. When you wear prediction-tinted glasses, you see that predictions are all around us. Even when people don’t think they’re making predictions, they often are. Positive statements—those telling us what’s true about the world—often make predictions.

A reconstruction of Lucy. Image source: NewScientist

If there is no prediction, there is no question

These same principles apply outside of science as well. Let’s see how this predictive thinking can apply to philosophy. Consider this example from Eliezer Yudkowsky regarding one of philosophy's most famous[4]puzzles: “If a tree falls in the forest and no one is around to hear it, does it make a sound?”

Imagine we have two people: Silas, who thinks a tree doesn't make a sound, and Loudius, who thinks it does.

How would we turn this into a prediction? Well, if there's a debate over whether something makes a sound, the natural thing to do might be to bring a recording device. Another thing we might do is to ask everyone if they heard the sound.

So we ask Silas and Loudius to register their predictions. We ask Silas: "If we place sensitive measuring equipment in the forest, will it detect compression waves from the falling tree?" He says, "Of course it will, but that's not what a 'sound' is. A sound is when a person experiences an auditory sensation."

We turn to Loudius with the same question, and he agrees: "Yes, the equipment would detect compression waves. That's exactly what sound is."

Then we ask them both: "Will any human report hearing something?" Both agree: "Of course not, there's no one there."

Notice what happened: Silas and Loudius make identical predictions about what will happen in the physical world. They both predict compression waves will be detected, and both predict no human will report hearing anything. If two positions produce identical predictions, the disagreement is purely meta-linguistic. Their disagreement isn't about reality at all; it's purely about which definition of "sound" to use.

Through our prediction-tinted glasses, we see the riddle dissolves into nothing. What seemed like a philosophical puzzle is actually just two people talking past each other and a debate about the definition of words—a decidedly less interesting topic.

Image created by ChatGPT

Prediction Improves Discourse

We can apply this same thinking to basic conversation. Consider a conversation I had several years ago with someone who told me that “AI is an overhyped bubble”. Instead of having a debate that hinged on the definition of vague words like “AI” and “overhyped”, I said, “I don’t agree. Let me make some predictions.” Then I started to rattle some off:

  1. Funding for AI will increase every year for the next five years
  2. AI will remain the largest sector for tech investment over the next five years
  3. All the major tech companies will still be focused on AI in five years…

“Oh—” he cut it, “I agree with all that. I just think people talk about it too much.”

OK, fine. That’s a subjective statement, and subjective statements don’t have to lead to predictions (see the appendix for types of statements that don’t lead to predictions). But by making predictions, what we were actually disagreeing about became much clearer. We realized we didn't have a substantive disagreement about reality at all. He was just annoyed by how much people talked about AI.

Conclusion

Putting on prediction-tinted glasses transforms how you engage with ideas. Vague assertions that once seemed profound reveal themselves as empty when you ask, "What does this actually predict?" Endless debates dissolve when you realize both sides anticipate the exact same observations. And your own confident beliefs become humbler when you're forced to attach probabilities to specific outcomes.

The practice is simple, even if the habit takes time to build. When you encounter a belief, ask: "What experience should I anticipate if this is true? What would I observe differently if it were false?" When you find yourself in a disagreement, ask: "What predictions do we actually differ on, or are we just arguing about definitions?" When you hold a strong belief, ask: "What probability would I assign to specific outcomes, and what evidence would change my mind?"

Of course, in many real-world situations, people aren't striving for maximal predictive clarity. Sometimes people are aiming for social cohesion, emotional comfort, or personal satisfaction. Artful ambiguity, tribal signaling, and motivational fictions are important social lubricants. Not every utterance can or should be a prediction. Still, it's better to let your values define what tribe you're in and keep your beliefs about reality anchored to reality.

But when you aim for clarity and truth, explicit predictions are the tool that forces belief to meet reality. By making our beliefs pay rent in anticipated experiences, we invite reality to be our teacher rather than our adversary. It corrects our models, sharpens our thinking, and aligns our mental maps with the territory they're meant to represent.

Appendix

Types of statements that do not lead to predictions

Not all statements lead to testable predictions. There are some statements that are so meaningless as to not predict anything, and others that are incredibly important, yet don’t predict anything. Some propositions, by their very nature, resist falsification or empirical evaluation. Here are a few categories of such statements:

  1. Subjective Opinions: Statements like "I like that painting" express personal taste rather than objective facts. While opinions are vital, they're typically immune to falsification (barring creepy futuristic mind-reading AI).
  2. Normative Statements: Prescriptions like "We should be nice to each other" or “murder is wrong” express moral values rather than descriptive facts. While moral statements principles are important, they express values rather than facts about the world[5]. Normative statements can be built upon positive statements, but they still rely on a normative statement in the end. For example, you could test whether murder increases or decreases overall happiness, but that still rests on the normative premise that happiness is what matters morally.
  3. Analytical Statements: Some statements are true by definition or logical necessity rather than because of how the world happens to be. "All bachelors are unmarried" is true because of what "bachelor" means. Similarly, "2 + 2 = 4" isn't a claim you test by counting objects; it's a consequence of how we've defined numbers and addition. These statements don't generate predictions about what you'll observe because they're not about the world in that way. They can still be important, but they're not the kind of beliefs that "pay rent" in anticipated experiences.
  4. Meta-linguistic statements: The example we used above with the tree falling in the forest is ultimately a meta-linguistic one. What is the definition of ‘sound’? There is no prediction we could make here. We could be descriptivist and decide that the definition is however people use it, then go out and conduct a survey. We could make predictions about the result of that survey, but that doesn’t tell us what the one, true definition of sound is, just what it is given a descriptivist framework.
    Similarly, is Pluto a planet? There is no prediction that answering “yes” or “no” would make here. We have measurements of all the relevant parameters. The International Astronomical Union decided to create an official definition and, by that definition, it is not. But this is a question of which definition of “planet” we choose.
  5. Vague Assertions: Some claims are so fuzzy or ill-defined that they can't be properly evaluated. Politicians and pundits often traffic in this sort of strategic ambiguity. Statements like "My opponent's policies will lead to disaster" or "The education system is broken" sound like factual claims but are too amorphous to test. Forcing such claims to make concrete predictions often reveals their emptiness or shows they're actually testable once properly specified.
  6. Inherently Unfalsifiable Claims: Some propositions resist testing, either by their very nature (like a non-interfering god) or by strategic design of the speaker (like adding ad hoc escape clauses when predictions fail). For instance, a deist god that created the universe but never interferes with it is undetectable by definition.
    Unfalsifiable claims like this may be personally meaningful, but they cannot be empirically adjudicated. They lie outside the realm of prediction because they don't generate any anticipated experiences.
    However, we can still reason about some unfalsifiable propositions indirectly through thought experiments, such as those involving the Anthropic Principle (which asks why we observe the universe to be compatible with our existence).
    It's also worth noting that people often mistakenly believe certain claims are unfalsifiable when they actually aren't. "God answers prayers" is one such example. This belief does generate predictions: if true, we'd anticipate that prayed-for patients recover faster than those not prayed for, or that religious communities experience better outcomes in measurable ways. These are testable claims, even if believers sometimes retreat to strategically unfalsifiable versions ("God answers prayers, but in mysterious ways") when the predictions fail.

The point here isn't that non-predictive statements are worthless. Meta-linguistic analysis, subjective experiences, moral reasoning, and logical inference all play crucial roles in human cognition and culture. However, we need to distinguish unfalsifiable propositions from testable predictions.

  1. The exact odds changed over time, but it was usually around this. ↩︎

  2. Of course, bipedalism could have evolved due to multiple factors working together, but I'm presenting just two hypotheses here to keep the example clear. ↩︎

  3. Adjusting, of course, for how common fossilization is in that area and other factors. We’re simplifying here. ↩︎

  4. Note that I said “famous”, not necessarily best or most interesting. ↩︎

  5. Moral realists would disagree with this statement, but that’s a separate post. ↩︎



Discuss

[Linkpost] Theory and AI Alignment (Scott Aaronson)

2025-12-08 03:17:54

Published on December 7, 2025 7:17 PM GMT

Some excerpts below:

On Paul's "No-Coincidence Conjecture"

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:

  • NO-case: C:{0,1}2n→Rn is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or

  • YES-case: C(x) has at least one positive entry for all x∈{0,1}2n.

Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.

Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.

On the utility of training data for mech-interp

I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.

Suppose you’re given a polynomial-time computable function f:{0,1}m→{0,1}n, where (say) m=n2. We think of x∈{0,1}m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?

Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = bx (mod N), where x is an n2-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only bx (mod N), but also cx (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute cx (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.

On debate and complexity theory

Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.



Discuss