MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Thoughts on Toby Ords' AI Scaling Series

2026-02-04 08:41:41

Published on February 4, 2026 12:41 AM GMT

I've been reading Toby Ord's recent sequence on AI scaling a bit. General notes come first, then my thoughts.

Notes

  • The Scaling Paradox basically argues that the scaling laws are actually pretty bad and mean progress will hit a wall fairly quickly unless the next gen or two of models somehow speed up AI research, we find a new scaling paradigm etc...
  • Inference Scaling and the Log X Chart says that inference is also not a big deal because the scaling is again logarithmic. My intuition here is that this is probably true for widespread adoption of models. It's probably not true if there are threshold effects where a single $100'000 query can be drastically better than a $100 query and allow you to, say, one shot open research problems. I'm not sure which world we live in.
  • Inference Scaling Reshapes Governance talks about the implications of inference being a big part of models. One of the implications is that instead of getting a big bang of new model trained => millions of instances, we get a slower gradual wave of more inference = stronger model with a gradual rightward movement in the curve. Another is that compute thresholds matter less because centralized data centers or single compute runs for training are less important. A third is that inference boosted models may able to help produce synthetic data for the next model iteration or distillation, leading to very rapid progress in some possible worlds.
  • Is there a Half-Life for the Success Rates of AI Agents? basically argues that AI agent time horizons can best be modeled as having a constant hazard rate
  • Inefficiency of Reinforcement Learning talks about RL being the new paradigm and being 1'000 - 1'000'000 times less efficient. Question 1: What is RL? Basically in pre-training you predict the next token and it's right or wrong. In RL you emit a whole chain of answer/reasoning and only then get marked as right wrong. Much less signal per token. Much bigger jumps to make. Toby argues that RL is inefficient and, unlike pretraining, generalize less making it even more costly per unit of general intelligence gain.
  • Recent AI Gains are Mostly from Inference Scaling is again about how inference scaling is behind much of the improvement in benchmark scores we've seen recently
  • How well does RL scale is similar. Breaking down how far recent improvements are due to RL vs Inference as well as how much scaling you get with RL vs inference for a given amount of compute. The conclusion is basically 10x scaling in RL === 3x scaling in inference.
  • Hourly Costs for AI Agents argues that much of the progress in agentic benchmarks, like the famous METR time horizon graph, is misleading and the product of drastically higher spending rather than improved performance per dollar. We're still getting progress, but at a much slower rate than would at first seem reasonable.

Takeaways

I think there are two things I got from this series: a better model of the three phases modern LLM scaling has gone through, and how LLM training works generally, and an argument for longer timelines.

The model of scaling is basically

  • We start with pre-training (2018 - 2024)
    • In pre-training, the model is given a text as input and asked to predict the next token.
    • This is pretty efficient (you output 1 token, it's either correct or incorrect)
    • Pre-training seems to make a model generally smarter and more capable in a broad, highly generalizable way. It's great. We keep doing it until we've run through too many orders of magnitude of compute and it becomes uneconomical.
  • We then do RL (2024)
    • In RL, we give the model a specific task where we can evaluate the output (e.g: solve a maths problem, a coding task)
    • RL is much less efficient. You still need a bunch of input. The output is often dozens or hundreds of tokens long. You only learn after all the output whether you're correct and update
    • RL is also much more limited in what it teaches the model. It causes a significant improvement in the training domain, but that doesn't generalize nearly as well as pre-training
    • We do RL anyway because, having done a bunch of pre-training, the costs of RL per unit of "improving my model" are low even if the scaling is worse
  • Around the same time as RL, we also start to do inference (2024)
    • With inference, we don't change the model at all. We just spend more compute to run it harder in various ways (chains of thought, multiple answers and choosing the best one, self-verification). For that specific run, we get a better quality answer.
    • This is hideously inefficient. The scaling relationship between inference compute and improved performance is also logarithmic but in addition unlike in RL or pre-training, where you pay the cost once to get the benefit for every future query as you make the base model better, here you pay the full cost for only a single query.
    • We do inference a fair bit. It pushes out model performance a bit further. If you spend a large amount of $ you can get your model to perform far better on benchmarks than it will in any real life use case.

This is also a case for, well, not AI risk being that much lower. I think there are actually two key takeaways. One is that the rate of progress we've seen recently in major benchmarks doesn't really reflect the underlying progress in some metric we actually care about like "answer quality per $". The other is that we've hit or are very close to hitting a wall and that the "scaling laws" everyone thinks are a guarantee of future progress are actually pretty much a guarantee of a drastic slowdown and stagnation if they hold.

I buy the first argument. Current benchmark perf is probably slightly inflated and doesn't really represent "general intelligence" as much as we would assume because of a mix of RL and inference (with the RL possibly chosen specifically to help juice benchmark performance).

I'm not sure how I feel about the second argument. On one hand the core claims seem to be true. The AI Scaling laws do seem to be logarithmic. We have burned through most of the economically feasible orders of magnitude on training. On the other hand someone could have made the same argument in 2023 when pre-training was losing steam. If I've learned one thing from my favourite progress studies sources it's that every large trend line is composed of multiple smaller overlapping S curves. I'm worried that just looking at current approaches hitting economic scaling ceilings could be losing sight of the forest for the trees here. Yes the default result if we do the exact same thing is we hit the scaling wall. But we've come up with a new thing twice now and we may well continue to do so. Maybe it's distillation/synthetic data. Maybe it's something else. Another thing to bear in mind is that, even assuming no new scaling approaches arise, we're still getting a roughly 3x per year effective compute increase from algo progress and a 1.4x increase from hardware improvements, meaning a total increase of roughly an order of magnitude every 1.6 years. Even with logarithmic scaling and even assuming AI investment as a % of GDP stabilizes, we should see continued immense growth in capabilities over the next years.



Discuss

'Inventing the Renaissance' Review

2026-02-04 06:01:58

Published on February 3, 2026 10:01 PM GMT

 Inventing the Renaissance is a 2025 pop history book by historian of ideas Ada Palmer. I'm someone who rarely completes nonfic books, but i finished this one & got a lot of new perspectives out of it. It's a fun read! I tried this book after attending a talk by Palmer in which she not only had good insights but also simply knew a lot of new-to-me context about the history of Europe. Time to reduce my ignorance!

ItR is a conversational introduction to the European Renaissance. It mostly talks about 1400 thru 1600, & mostly Italy, because these are the placetimes Palmer has studied the most. But it also talks a lot about how, ever since that time, many cultures have been delighted by the paradigm of a Renaissance, & have categorized that period very differently.

Interesting ideas in this book:

  • Claim: There has never been any golden age nor any dark age on Earth. Ages tend to be paradoxical juxtapositions of the downstream effects of the last age & the early seeds of the next age. 
  • In 1500, Florence feels familiar to us moderns. It's literate & cosmopolitan. We have detailed records. There are even life insurance companies. Yet it's also still full of exotic medieval violence. Torture & public executions are not rare. Slavery is normal. When the police arrest a wealthy man, they quickly withdraw from the streets into the police fort, then the man's employees besiege the police. Aristocrats can order a commoner killed with no consequence. Sometimes the pope hires assassins. It's a very interesting time to read about, because it's well-documented & familiar, but also very unstable, dynamic, personal, & high-stakes. 
  • The world everyone thought they lived in was very supernatural. It reminds me of a D&D setting. A army might attack a city merely because it has the fingerbone of a certain saint in its cathedral, & this bone would magically make the army's projectiles more accurate. No one questioned this - the city defenders were simply desperate to deny this magical advantage to the attackers. 
  • During wars, nuns got more funding. Nuns have great relationships with dead people, who in turn can talk to God. They were basically lobbyists for Fate. Convents were often built next to city walls, as spiritually defensive buildings. 
  • This era saw a 'space race' for grammarians, historians, & old books. It was believed that by adopting the culture of the past (old is always better than new, they thought), they could raise the virtue waterline & end war. 
  • Like today, poor people went to budget doctors & rich people went to expensive doctors. Unlike today, the rich people got no real medical benefit from what they bought (magic crystals). Their healthcare was no better than the budget healthcare.
  • Claim: Machiavelli gave us modern political science & fact-based history.
  • Claim: Machiavelli gave the West utilitarianism. (Mozi gave it to the East.) This was caused by a specific moment when Aristocrat A broke his oath to Aristocrat B & killed him. (Bear with me on the names; i'm no historian.) This betrayal was unforgivable; it's literally what gets punished in the lowest circle of Dante's Hell. But this ended Aristocrat A's obligation to reconquer Aristocrat B's birth city. So one man died to stop a whole war. Many thousands of common men would have died, & (if i'm reading between the lines correctly) many many women would have been sexually assaulted by the pillaging soldiers. Machiavelli got his bad reputation from saying 'shut up & multiply'. He wrote that when a tradeoff averts so much violence, it IS the ethical choice. Nobody agreed with him ... except by the 20th & 21st centuries, everyone's practical attitude to politics is far closer to Machiavelli's than to any of his sin-deontology contemporaries. 
  • Emotionally, we want our favorite Renaissance geniuses to care about human rights, or democracy, or empiricism. Similarly, they wanted Plato to care about Jesus. But even the smartest & most likeable people from the past had worldviews & values very divergent from our own. 
  • In 1500, atheism was like modern Creationism: a worldview with more holes than cloth. Who designed the animals? Some unknown process. How does gravity work, if not by the pull of Hell upon sin? Some unknown process. You'd have to argue against a huge mainstream of physics experts & doctors, & many textbooks of detailed, internally-consistent explanations for all phenomena. God was as deeply integrated into phenomena as our Four Fundamental Forces. Atheism was considered so out-there that the Inquisition didn't expect anyone to actually believe it. And they were generally right. It was hard before the scientific method, Atomism, the microscope, or natural selection.
  • Gutenberg went bankrupt. He understood his press was a big deal, & sold books to all local buyers ... then ran out of buyers. Knowledge didn't get exponential until merchants set up trade networks for books. 
  •  There was a long interval between Galileo's scientific method & Ben Franklin's lightning rod - the first time science led to a technology that directly benefited people. In this interval, science awkwardly coexisted with prophecy & magic crystals: All of these seemed cool, but it was debated which was most useful. 

The worst things i can say about this book:

  • Similar to most books by academics for the popular audience, it's kindof just a assortment of interesting results from her research. Fortunately her research is about some of the most high-stakes junctures in history, & she has many little-known things to share.
  • The part i found most boring was the chapter about the most interesting lives from the era. The content wasn't boring (female commanders winning wars, democratic takeovers), but if we zoom in too much on history i'll be here all day.

You should try this book if:

  • You're curious about this placetime. The book talks about a lot of fun weird pranks, scandals, & strange disasters. Civilization used to be very different there!!
  • You want to learn more about the history of ideas via grounded examples.
  • You want to learn about the early causes of the scientific & industrial era.


Discuss

Concrete research ideas on AI personas

2026-02-04 05:50:45

Published on February 3, 2026 9:50 PM GMT

We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.

Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.

Project ideas are grouped into:

  • Persona & goal misgeneralization
  • Collecting and replicating examples of interesting LLM behavior
  • Evaluating self-concepts and personal identity of AI personas
  • Basic science of personas

Persona & goal misgeneralization

It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.

Project ideas:

  • Inoculation Prompting and recontextualization are general techniques to steer what models learn from their training data. We are mentoring a number of projects related to further research on it.
  • Persona-anchored datapoint-wise inoculation: When training on data that could break a model persona (e.g., RL training after instruction FT, or SFT on off-distribution data), generate inoculation prompts for each training datapoints (datapoint-wise), which increases the likelihood of the completion while assuming the persona (persona-anchor). Then train on that data, this should protect the persona initially in the model.[1]
  • Study if there are useful differences in inductive biases of different finetuning setups, for example study if training specific subsets of the weights of a model generalize differently. We will also work on this project as part of the SPAR program.
  • To what extent does subsequent RL training degrade persona priors? Do we always get instrumental convergence to ‘utility-maxxing’? Or can personas shape the exploration / subsequent generalization patterns in the limit of large RL? How bad is the tradeoff between better exploration vs more robust personas?
  • (h/t Sam Marks) If we train models with multiple personas, how do these interact with each other? One hope w.r.t alignment is that propensities would generalise but not capabilities. Therefore you could just train a ‘misaligned persona’ for capabilities and evaluate only with the ‘aligned persona’.
    • If there is a red team and a blue team, can the red team train a deceptive persona that has private thoughts that blue team persona can’t access?
    • Can the blue team train on a (capabilities?) dataset which is known to induce misalignment, but in such a way that the red team cannot meaningfully elicit this misaligned behaviour
    • Can we use unlearning or gradient steering on the misaligned person to robustly remove it?
  • How can we predict the effect of finetuning generalization (esp unsupervised)? E.g., correlations in training data, influence functions, SLT, SAE features, … Do any of these provide support for ‘persona’ hypothesis vs other hypotheses?

Collecting and reproducing examples of interesting LLM behavior

LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.

A very brief initial list of such behavior:

Project ideas:

  • Replicate these behaviors: For any such behavior, one could test which existing models are prone to exhibiting it, and which properties of AI development induce the behavior of interest. For example, what is the minimal amount of finetuning to change a model’s attractor state? Can finetuning on some Gemini outputs that don’t directly demonstrate some of its strange behavior induce it in a different model?
  • Meme propagation among AI personas. Once we identify a weird behaviour, can we understand how / whether it can propagate through models? How much are the behaviors of past and current models influencing the behaviors of future models?

Evaluating self-concepts and personal identity of AI personas

It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]

Project ideas:

  • Reverse Turing Test: the idea is to let an AI talk to (AI or human) candidates and give it the task to figure out which candidate is its twin. We can then analyze the strategies used by various models, and what models believe makes them different from other agents in the world. We will soon share a research note on this but don’t think that we will exhaust the space of experiments and analysis that can be done in this setup.
  • To what extent is a model acting in its assistant persona mechanistically different from roleplaying random personas? Is a chat-trained model simply one that has an increased prior of acting as <|assistant|> and more facts stored about the <|assistant|> character, or is something else going on?
  • Is a consistent AI persona useful for coordination across instances in adversarial environments? Is character training increasing the risks from coordinated AIs?
  • Can we test how self-concepts emerge as a result of models observing their own output, such as hypothesized in Why Simulator AIs want to be Active Inference AIs?

Basic science of personas

  • What traits naturally correlate under fine tuning? Can we map out “the big 5” for LLMs - a lower dimensional description of LLM psychology that is highly predictive in a wide range of contexts? (e.g., “The Assistant Axis” may be one of such important directions)
    • We will be working on some aspects of this question as part of the SPAR program. For a more detailed write-up of the project description, see Propensity OOD generalization
  • Test the hypothesis that finetuning inductive bias aligns with the pretraining distribution; that is the inductive bias of in-context learning of a base-model is predictive of the inductive bias of finetuning models derived from that base model. Can we characterize ways in which they differ?
    • Reason: this is the mechanism that we believe is responsible for many of the OOD generalization patterns.
    • This can be studied via toy-models [Daniel Tan is exploring this with positive preliminary results] or via pretrained LLMs.
  • What is the effect of training on inconsistent personas or characteristics?
    • Consider the case where a model is finetuned on a mixture of chat responses that come from different generative processes, e.g. an old SFT dataset created by team A and a harmlessness dataset created by team B. This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2). This may create tension that leads to weird or conditional behavior.
    • Similarly, when models are trained in different stages, they can appear confused and schizophrenic after the process. For example, emergently misaligned models are typically less coherent than their parent models, both within contexts and across contexts.
    • Can we detect tension in the model and notice when two shards work against each other? Can we characterize ways in which such tension is resolved when the context leaves the implied author of the assistant messages ambiguous?
    • If pretraining to imitate several/inconsistent personas causes learning the capability of “in-context learning the persona to adopt”, then can we hinder this capability by pretraining only on data produced by a consistent persona? Aiming to eliminate the in-context adaption of the persona.
  • Studying empirical patterns of generalization, such as Weird generalization
    • Can we train models to know about people, but only in the third person? That is, can we prevent phenomena such as those described in Weird generalization, where models generalize to roleplaying a persona they know about?
  • Mechanistically understanding personas: How do they arise? How are they represented / implemented?
  • What are our existing techniques for discovering persona archetypes? Can we identify if certain personas are ‘privileged’ in any way?
  • Can we clarify definitions around personas? Can we identify the most useful concepts? What is a good mathematical framing for ‘personas’? Do those admit any crisp predictions we could test in language models?
  • Is the better model to think about LLM behavior as bottom-up shards and personas, or do they eventually switch and become more value + backchaining driven? (see Richard Ngo’s blogpost on ‘value systematization’ here)
  1. One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎

  2. See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎



Discuss

Progress links and short notes, 2026-01-26

2026-02-04 05:42:07

Published on February 3, 2026 9:42 PM GMT

Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.

Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.

Contents

  • Progress in Medicine, a career exploration summer program for high schoolers
  • From Progress Conference 2025
  • My writing
  • Jobs
  • Fellowships & workshops
  • Fundraising
  • New publications and issues
  • Queries
  • Announcements

For paid subscribers:

  • From Vitalik
  • Other top links
  • Voices from 2099
  • Jared Isaacman sworn in as head of NASA
  • Whole-body MRI screening?
  • AI does social science research
  • AI writes a browser
  • AI does lots of other things
  • AI could do even more things
  • AI and the economic future
  • AI: more models and papers
  • AI discourse
  • Waymo
  • Health/bio
  • Energy & manufacturing
  • Housing
  • Other links and short notes
  • Politics
  • Gratitude
  • New Horizon photographs Pluto’s mountains
  • Charts
  • Quotes

Progress in Medicine, a career exploration summer program for high schoolers

Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:

People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?

Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.

Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.

From Progress Conference 2025

A few more batches of video:

My writing

  • My essay series The Techno-Humanist Manifesto has concluded, and you can read the whole thing online. I’m pleased to announce that the series will be revised for publication as a book from MIT Press (expected early 2027)!
  • 2025 in review. My annual update, including my reading highlights
  • How to tame a complex system. Nature is a complex system, I am told, and therefore unpredictable, uncontrollable, unruly. I think this is true but irrelevant: we can master nature in the ways that matter

Jobs

  • IFP is hiring an editor: “seeking a curious, entrepreneurial, and opinionated lover of writing. … You’ll partner with our policy experts to turn their drafts into pieces that change minds across DC. You’ll coach both new and experienced writers to become better communicators. You’ll also innovate on our systems to help the team consistently ship great products.” (via @rSanti97)
  • OpenAI is hiring a Head of Preparedness: “If you want to help the world figure out how to enable cybersecurity defenders with cutting edge capabilities while ensuring attackers can’t use them for harm, ideally by making all systems more secure, and similarly for how we release biological capabilities and even gain confidence in the safety of running systems that can self-improve, please consider applying. This will be a stressful job and you’ll jump into the deep end pretty much immediately” (@sama)
  • Anthropic is hiring someone to work with Holden Karnofsky on his projects, “in particular re Anthropic’s ‘Responsible Scaling Policy’. Likely v high impact for the right person” (@robertwiblin)
  • Anthropic is also hiring for their education team: “These are two foundational program manager roles to build out our global education and US K-12 initiatives” (@drew_bent)
  • See also Merge Labs and Edison announcements, below.

Fellowships & workshops

  • MATS 10.0 (Machine Learning Alignment & Theory Scholars): “Come work with Seth Donoughe and me this summer on AI-biosecurity! We will be mentoring projects on threat models, frontier evaluations, and technical safeguards.” (@lucafrighetti)
  • Beyond the Ivory Tower, via Joseph Fridman: “an intensive two-day writing workshop for academics, taught by James Ryerson, a longtime editor at the New York Times. … Our alumni have published hundreds of pieces in outlets from the Atlantic to Aeon to the Wall Street Journal. … I think historians and economists of technology and innovation would be a great fit.” Apply by March 1

Fundraising

Nonprofits that would make good use of your money:

  • Lightcone Infrastructure: “We build beautiful things for truth-seeking and world-saving. We run LessWrong, Lighthaven, Inkhaven, designed AI-2027, and so many more things. All for the price of less than one OpenAI staff engineer ($2M/yr)” (@ohabryka)
  • Transluce: “a nonprofit AI lab working to ensure that AI oversight scales with AI capabilities, by developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society.” OpenAI co-founder Wojciech Zaremba calls them “one of the strongest external AI safety orgs—on par with METR and Apollo.” (@woj_zaremba)
  • And of course, us

New publications and issues

Queries

  • “Who is best to read / follow for advice on using AI e.g. Claude Code? especially interested in: productivity and todo wrangling (especially for the distractable); research assistance; editing; learning” (@rgblong)

Announcements

  • Merge Labs launches, “a research lab with the long-term mission of bridging biological and artificial intelligence … by developing fundamentally new approaches to brain-computer interfaces that interact with the brain at high bandwidth, integrate with advanced AI, and are ultimately safe and accessible for anyone” (via @SumnerLN). SamA is listed as a co-founder. Merge grew out of the Forest Labs FRO; Convergent Research notes that the tech is ultrasound-based and that they’ve raised over $250M. (!) And of course, they’re hiring
  • Edison, the for-profit spinout of Future House, has raised $70M: “we are integrating AI Scientists into the full stack of research, from basic discovery to clinical trials. We want cures for all diseases by mid-century.” They are hiring software engineers, AI researchers, scientists, and business operators. ”Our goal is to accelerate science writ large.” (@SGRodriques)
  • Science Corp. announces Vessel (WIRED). Vessel is “a project focused on rethinking perfusion from the ground up, extending how long life can be sustained, and expanding what’s possible in transplantation and critical care. Life-support technologies like ECMO can keep patients alive when the heart or lungs fail, but they aren’t designed for long-term use. Vessel exists to close the gap between what perfusion technology is fundamentally capable of and how it is deployed in daily practice.” (@ScienceCorp_)
  • Fuse Energy raises a $70M Series B. Honestly hard to figure out exactly what they do, but it seems to involve deploying solar and batteries, and maybe later doing fuel synthesis and fusion? Anyway I liked this from (presumably) one of the founders: “Energy is the fundamental source for human progress. But for the last 30 years, we’ve been told that the future requires sacrifice ‘use less, be less, restrict yourself’. No one should have to trade a good life today for the chance of a better tomorrow.” (@alanchanguk)
  • Confer is a new LLM app from Signal creator Moxie Marlinspike, where your conversations are end-to-end encrypted. Confer goes to impressive lengths to ensure that the LLM server doesn’t, e.g., exfiltrate your data somewhere. The entire server image is signed and is auditable on a public ledger. The client verifies the signature before chatting. The server also runs in a VM that is isolated from its host at the hardware level.
  • Gordian Bio announces “a research collaboration with Pfizer to apply Gordian’s in vivo mosaic screening platform to obesity target discovery.” (@GordianBio) Story in Business Wire

To read the rest of this digest, subscribe on Substack.
 



Discuss

The Projection Problem: Two Pitfalls in AI Safety Research

2026-02-04 05:17:21

Published on February 3, 2026 9:03 PM GMT

TLDR: A lot of AI safety research starts from x-risks posed by superintelligent AI. That's the right starting point. But when these research agendas get projected onto empirical work with current LLMs, two things tend to go wrong: we conflate "misaligned AI" with "failure to align," and we end up doing product safety while believing we're working on existential risk. Both pitfalls are worth being aware of.

 

Epistemological status: This is an opinion piece. It does not apply to all AI safety research, and a lot of that work has been genuinely impactful. But I think there are patterns worth calling out and discussing. 

An LLM was used to structure the article and improve sentences for clarity.


The Two Confusions

There are two distinctions that don't get made clearly enough in this space, and both have real consequences for how research gets done.

The first is between misaligned AI and failure to align. When most people hear "misaligned AI," they imagine something with agency: a system that has its own goals and is pursuing them against our interests. But a lot of the time, "misaligned" is used to describe something much simpler: we trained a system and it didn't do what we wanted. No intent, no goals, no scheming. Just an engineering failure. These two things are very different, but they get treated as the same thing constantly, and that has consequences for how we interpret empirical results.

The second is between AI safety research aimed at x-risks and AI safety as a product problem. Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it's also work that any AI company deploying these systems needs to do anyway. It has commercial incentive. It is not a neglected problem. And yet a lot of it gets funded and framed as if it's addressing existential risk.

Both confusions tend to crystallise at the same point: the moment when a research agenda built around superintelligent AI gets projected down to empirical work on current LLMs. We call this: The Projection Problem. That's where the pitfalls live.


Pitfall 1: Treating Current LLMs as Proto-Superintelligent AI

The AI safety community broadly agrees that superintelligent AI poses serious existential risks. Arguments for this are convincing, and research in this direction deserves serious funding and serious researchers. No disagreement there.

The problem starts when threat models designed for superintelligent systems, such as AI colluding with each other, maintaining consistent goals across contexts, executing long-term deceptive plans, or self-preservation, get tested empirically on current LLMs. These are reasonable and important things to think about for future systems. But current models fail at the prerequisites. They can't maintain consistency across minor prompt variations, let alone coordinate long-horizon deception.

So what happens when you run these experiments anyway? The models, being strong reasoners, will reason their way through whatever scenario you put in front of them. Put a model in a situation with conflicting objectives and it will pick one and act on it. That gets reported as evidence of deceptive capability, or emergent self-interest, or proto-agency.

It is not. It's evidence that LLMs can reason, and that current alignment techniques are brittle under adversarial conditions. Neither of those things is surprising. Most alignment techniques are still in an early stage, with few hard constraints for resolving conflicting objectives. These models are strong reasoners across the board. Systematically checking whether they can apply that reasoning to every conceivable dangerous scenario tells us very little we don't already know.

To put it bluntly: claiming that an LLM generating content about deception or self-preservation is evidence that future AI will be dangerous has roughly the same scientific validity as claiming that future AI will be highly spiritual, based on instances where it generates content about non-duality or universal oneness. The model is doing the same thing in both cases: reasoning well within the scenario it's been given.

Why This Happens: The Narrative Problem

This is where the first confusion, misalignment vs. failure to align, gets highlighted. When an LLM produces an output that looks deceptive or self-interested, it gets narrated as misalignment. As if the system wanted to do something and chose to do it. When what actually happened is that we gave it a poorly constrained setup and it reasoned its way to an output that happens to look alarming. That's a failure to align. The distinction matters, because it changes everything about how you interpret the result.

The deeper issue is that the field has largely adopted one story about why current AI is dangerous: it is powerful, and therefore dangerous. That story is correct for superintelligent AI. But it gets applied to LLMs too, and LLMs don't fit it. A better and more accurate story is that current systems are messy and therefore dangerous. They fail in unpredictable ways. They are brittle. Their alignment is fragile. That framing is more consistent with what we actually observe empirically, and it has a practical advantage: it keeps responsibility where it belongs: on the labs and researchers who built and deployed these systems, rather than implicitly framing failures as evidence of some deep, intractable property of AI itself.

There's another cost to the "powerful and dangerous" framing that's worth naming. If we treat current LLMs as already exhibiting the early signs of agency and intrinsic goals, we blur the line between them and systems that might genuinely develop those properties in the future. That weakens the case for taking the transition to truly agentic systems seriously when it comes, because we've already cried wolf. And there's a more immediate problem: investors tend to gravitate toward powerful in "powerful and dangerous." It's a compelling story. "Messy and dangerous" is a less exciting one, and a riskier bet. So the framing we use isn't just a scientific question. It shapes where money and attention actually go.

A Note on Awareness

The counterargument is that this kind of research, even if methodologically loose, raises public awareness about AI risks. That's true, and awareness matters. But there's a cost. When empirical results that don't hold up to scientific scrutiny get attached to x-risk narratives, they don't just fail to strengthen the case. They actively undermine the credibility of the arguments that do hold up. The case for why superintelligent AI is dangerous is already strong. Attaching weak evidence to it makes the whole case easier to dismiss.


Pitfall 2: X-Risk Research That Becomes Product Safety

The second pitfall follows a logic that is easy to slip into, especially if you genuinely care about reducing existential risk. It goes something like this:

x-risks from advanced AI are the most important problem → alignment is therefore the most important thing to work on → so we should be aligning current systems, because that's how we can align future ones.

Each step feels reasonable. But the end result is that a lot of safety-oriented research ends up doing exactly what an AI company's internal safety team would do: evaluations, red-teaming, monitoring, iterative alignment work. That work is fine, and it is important and net positive for society. The question is whether it should be funded as if it were addressing existential risk.

The shift is subtle but it's real. Alignment evaluations become safety evaluations. AI control or scalable oversight becomes monitoring. The language stays the same but the problem being solved quietly transforms from "how do we align superintelligent AI" to "how do we make this product safe enough to ship." And that second problem has commercial incentive. Labs like Apollo have successfully figured out how to make AI labs pay for it and others have started for-profit labs to do this work. AI companies have an incentive for getting this work done. It is, by the standard definitions used in effective altruism, not a neglected problem.

Notice that a lot of research agendas still start from x-risks. They do evaluations focused on x-risk scenarios: self-awareness, deception, goal preservation. But when you apply these to current LLMs, what you're actually testing is whether the model can reason through the scenario you've given it. That's not fundamentally different from testing whether it can reason about any other topic. The x-risk framing changes what the evaluation is called, but not what problem it's actually solving.

The Autonomous Vehicle Analogy

Here's a useful way to see the circularity. Imagine many teams of safety-oriented researchers, backed by philanthropic grants, meant to address risks from rouge autonomous vehicles, worked on making current autonomous vehicles safer. They'd be doing genuinely important work. But the net effect of their effort would be faster adoption of autonomous vehicles, not slower. 

The same dynamic plays out in AI safety. Work that makes current LLMs more aligned increases adoption. Increased adoption funds and motivates the next generation of more capable systems. If you believe we are moving toward dangerous capabilities faster than we can handle, this is an uncomfortable loop to find yourself inside.

The Moral Trap

This is where it gets uncomfortable. Talented people who genuinely want to reduce existential risk end up channelling their efforts into work that, through this chain of reasoning, contributes to accelerating the very thing they're worried about. They're not wrong to do the work, because the work itself is valuable. But they may be wrong about what category of problem they're solving, and that matters for how the work gets funded, prioritised, and evaluated.

The incentive structure also does something quieter and more corrosive: the language of existential risk gets used to legitimise work that is primarily serving commercial needs. A paper with loose methodology but an x-risk framing in the title gets more attention and funding than a paper that is methodologically rigorous but frames its contribution in terms of understanding how LLMs actually work. The field ends up systematically rewarding the wrong things.


What To Do With This

None of this is against working on AI safety. It means be more precise about what kind of AI safety work you're actually doing, and be honest, with yourself and with funders, about which category it falls into.

Product safety work on current LLMs is important. It should be done. But it can and should leverage commercial incentives. It is not neglected, and it does not need philanthropic funding meant for genuinely neglected research directions.

X-risk research is also important, arguably more important, precisely because it is neglected. But it should be held to a high standard of scientific rigour, and it should not borrow credibility from empirical results on current systems that don't actually support the claims being made.

The two categories have become increasingly hard to tell apart, partly because the same language gets used for both, and partly because it is genuinely difficult to evaluate what research is likely to matter for mitigating risks from systems that don't yet exist. But the difficulty of the distinction is not a reason to stop making it.

If you're entering this field, the single most useful habit you can develop is the ability to ask: am I working on the problem I think I'm working on?




Discuss

New AI safety funding newsletter

2026-02-04 04:23:52

Published on February 3, 2026 8:23 PM GMT

We’ve had feedback from several people running AI safety projects that it can be a pain tracking various funding sources and their application windows. To help make it easier, AISafety.com has launched the AI Safety Funding newsletter (which you can subscribe to here).

Screenshot of AI Safety Funding newsletter

It lists all newly announced funding opportunities relevant to individuals and orgs working on AI x-risk, and any opportunities which are closing soon. We expect posts to be about 2x/month.

Opportunities will be sourced from the database at AISafety.com/funding, which displays all funders, whether they are currently accepting applications or not. If you want to add yourself as a funder you can do so here.

Screenshot of Funding page

The newsletter will likely evolve as we gather feedback – please feel free to share any thoughts in the comments or via our anonymous feedback form.

AISafety.com is operated through a public Discord server with the help of many volunteers, so if you’re interested in contributing or just seeing what we’re up to then feel free to join. Beyond the funding page, the site has 9 other resource pages like upcoming events & training programs, local and online communities, the field map, etc.



Discuss