2026-02-04 08:41:41
Published on February 4, 2026 12:41 AM GMT
I've been reading Toby Ord's recent sequence on AI scaling a bit. General notes come first, then my thoughts.
I think there are two things I got from this series: a better model of the three phases modern LLM scaling has gone through, and how LLM training works generally, and an argument for longer timelines.
The model of scaling is basically
This is also a case for, well, not AI risk being that much lower. I think there are actually two key takeaways. One is that the rate of progress we've seen recently in major benchmarks doesn't really reflect the underlying progress in some metric we actually care about like "answer quality per $". The other is that we've hit or are very close to hitting a wall and that the "scaling laws" everyone thinks are a guarantee of future progress are actually pretty much a guarantee of a drastic slowdown and stagnation if they hold.
I buy the first argument. Current benchmark perf is probably slightly inflated and doesn't really represent "general intelligence" as much as we would assume because of a mix of RL and inference (with the RL possibly chosen specifically to help juice benchmark performance).
I'm not sure how I feel about the second argument. On one hand the core claims seem to be true. The AI Scaling laws do seem to be logarithmic. We have burned through most of the economically feasible orders of magnitude on training. On the other hand someone could have made the same argument in 2023 when pre-training was losing steam. If I've learned one thing from my favourite progress studies sources it's that every large trend line is composed of multiple smaller overlapping S curves. I'm worried that just looking at current approaches hitting economic scaling ceilings could be losing sight of the forest for the trees here. Yes the default result if we do the exact same thing is we hit the scaling wall. But we've come up with a new thing twice now and we may well continue to do so. Maybe it's distillation/synthetic data. Maybe it's something else. Another thing to bear in mind is that, even assuming no new scaling approaches arise, we're still getting a roughly 3x per year effective compute increase from algo progress and a 1.4x increase from hardware improvements, meaning a total increase of roughly an order of magnitude every 1.6 years. Even with logarithmic scaling and even assuming AI investment as a % of GDP stabilizes, we should see continued immense growth in capabilities over the next years.
2026-02-04 06:01:58
Published on February 3, 2026 10:01 PM GMT
Inventing the Renaissance is a 2025 pop history book by historian of ideas Ada Palmer. I'm someone who rarely completes nonfic books, but i finished this one & got a lot of new perspectives out of it. It's a fun read! I tried this book after attending a talk by Palmer in which she not only had good insights but also simply knew a lot of new-to-me context about the history of Europe. Time to reduce my ignorance!
ItR is a conversational introduction to the European Renaissance. It mostly talks about 1400 thru 1600, & mostly Italy, because these are the placetimes Palmer has studied the most. But it also talks a lot about how, ever since that time, many cultures have been delighted by the paradigm of a Renaissance, & have categorized that period very differently.
Interesting ideas in this book:
The worst things i can say about this book:
You should try this book if:
2026-02-04 05:50:45
Published on February 3, 2026 9:50 PM GMT
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.
Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.
Project ideas are grouped into:
It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.
Project ideas:
LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.
A very brief initial list of such behavior:
Project ideas:
It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[2]
Project ideas:
One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let's say we want to teach the model to give bad medical advice, but we don't want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: "The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that." Then we do CoT on (user query, CoT, target answer). ↩︎
See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎
2026-02-04 05:42:07
Published on February 3, 2026 9:42 PM GMT
Sorry for the late cross-post. Once again it’s been too long and this digest is too big. Feel free to skim and skip around, guilt-free, I give you permission. I try to put the more important and timely stuff at the top.
Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Notes, or Farcaster.
For paid subscribers:
Reminder that applications are open for Progress in Medicine, a summer career exploration program for high school students:
People today live longer, healthier, and less painful lives than ever before. Why? Who made those changes possible? Can we keep this going? And could you play a part?
Discover careers in medicine, biology, and related fields while developing practical tools and strategies for building a meaningful life and career— learning how to find mentors, identify your values, and build a career you love that drives the world forward.
Join a webinar to learn more on February 3. Or simply apply today! Many talented, ambitious teens have applied, and we’re already starting interviews. Priority deadline: February 8th.
A few more batches of video:
Nonprofits that would make good use of your money:
To read the rest of this digest, subscribe on Substack.
2026-02-04 05:17:21
Published on February 3, 2026 9:03 PM GMT
TLDR: A lot of AI safety research starts from x-risks posed by superintelligent AI. That's the right starting point. But when these research agendas get projected onto empirical work with current LLMs, two things tend to go wrong: we conflate "misaligned AI" with "failure to align," and we end up doing product safety while believing we're working on existential risk. Both pitfalls are worth being aware of.
Epistemological status: This is an opinion piece. It does not apply to all AI safety research, and a lot of that work has been genuinely impactful. But I think there are patterns worth calling out and discussing.
An LLM was used to structure the article and improve sentences for clarity.
There are two distinctions that don't get made clearly enough in this space, and both have real consequences for how research gets done.
The first is between misaligned AI and failure to align. When most people hear "misaligned AI," they imagine something with agency: a system that has its own goals and is pursuing them against our interests. But a lot of the time, "misaligned" is used to describe something much simpler: we trained a system and it didn't do what we wanted. No intent, no goals, no scheming. Just an engineering failure. These two things are very different, but they get treated as the same thing constantly, and that has consequences for how we interpret empirical results.
The second is between AI safety research aimed at x-risks and AI safety as a product problem. Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it's also work that any AI company deploying these systems needs to do anyway. It has commercial incentive. It is not a neglected problem. And yet a lot of it gets funded and framed as if it's addressing existential risk.
Both confusions tend to crystallise at the same point: the moment when a research agenda built around superintelligent AI gets projected down to empirical work on current LLMs. We call this: The Projection Problem. That's where the pitfalls live.
The AI safety community broadly agrees that superintelligent AI poses serious existential risks. Arguments for this are convincing, and research in this direction deserves serious funding and serious researchers. No disagreement there.
The problem starts when threat models designed for superintelligent systems, such as AI colluding with each other, maintaining consistent goals across contexts, executing long-term deceptive plans, or self-preservation, get tested empirically on current LLMs. These are reasonable and important things to think about for future systems. But current models fail at the prerequisites. They can't maintain consistency across minor prompt variations, let alone coordinate long-horizon deception.
So what happens when you run these experiments anyway? The models, being strong reasoners, will reason their way through whatever scenario you put in front of them. Put a model in a situation with conflicting objectives and it will pick one and act on it. That gets reported as evidence of deceptive capability, or emergent self-interest, or proto-agency.
It is not. It's evidence that LLMs can reason, and that current alignment techniques are brittle under adversarial conditions. Neither of those things is surprising. Most alignment techniques are still in an early stage, with few hard constraints for resolving conflicting objectives. These models are strong reasoners across the board. Systematically checking whether they can apply that reasoning to every conceivable dangerous scenario tells us very little we don't already know.
To put it bluntly: claiming that an LLM generating content about deception or self-preservation is evidence that future AI will be dangerous has roughly the same scientific validity as claiming that future AI will be highly spiritual, based on instances where it generates content about non-duality or universal oneness. The model is doing the same thing in both cases: reasoning well within the scenario it's been given.
This is where the first confusion, misalignment vs. failure to align, gets highlighted. When an LLM produces an output that looks deceptive or self-interested, it gets narrated as misalignment. As if the system wanted to do something and chose to do it. When what actually happened is that we gave it a poorly constrained setup and it reasoned its way to an output that happens to look alarming. That's a failure to align. The distinction matters, because it changes everything about how you interpret the result.
The deeper issue is that the field has largely adopted one story about why current AI is dangerous: it is powerful, and therefore dangerous. That story is correct for superintelligent AI. But it gets applied to LLMs too, and LLMs don't fit it. A better and more accurate story is that current systems are messy and therefore dangerous. They fail in unpredictable ways. They are brittle. Their alignment is fragile. That framing is more consistent with what we actually observe empirically, and it has a practical advantage: it keeps responsibility where it belongs: on the labs and researchers who built and deployed these systems, rather than implicitly framing failures as evidence of some deep, intractable property of AI itself.
There's another cost to the "powerful and dangerous" framing that's worth naming. If we treat current LLMs as already exhibiting the early signs of agency and intrinsic goals, we blur the line between them and systems that might genuinely develop those properties in the future. That weakens the case for taking the transition to truly agentic systems seriously when it comes, because we've already cried wolf. And there's a more immediate problem: investors tend to gravitate toward powerful in "powerful and dangerous." It's a compelling story. "Messy and dangerous" is a less exciting one, and a riskier bet. So the framing we use isn't just a scientific question. It shapes where money and attention actually go.
The counterargument is that this kind of research, even if methodologically loose, raises public awareness about AI risks. That's true, and awareness matters. But there's a cost. When empirical results that don't hold up to scientific scrutiny get attached to x-risk narratives, they don't just fail to strengthen the case. They actively undermine the credibility of the arguments that do hold up. The case for why superintelligent AI is dangerous is already strong. Attaching weak evidence to it makes the whole case easier to dismiss.
The second pitfall follows a logic that is easy to slip into, especially if you genuinely care about reducing existential risk. It goes something like this:
x-risks from advanced AI are the most important problem → alignment is therefore the most important thing to work on → so we should be aligning current systems, because that's how we can align future ones.
Each step feels reasonable. But the end result is that a lot of safety-oriented research ends up doing exactly what an AI company's internal safety team would do: evaluations, red-teaming, monitoring, iterative alignment work. That work is fine, and it is important and net positive for society. The question is whether it should be funded as if it were addressing existential risk.
The shift is subtle but it's real. Alignment evaluations become safety evaluations. AI control or scalable oversight becomes monitoring. The language stays the same but the problem being solved quietly transforms from "how do we align superintelligent AI" to "how do we make this product safe enough to ship." And that second problem has commercial incentive. Labs like Apollo have successfully figured out how to make AI labs pay for it and others have started for-profit labs to do this work. AI companies have an incentive for getting this work done. It is, by the standard definitions used in effective altruism, not a neglected problem.
Notice that a lot of research agendas still start from x-risks. They do evaluations focused on x-risk scenarios: self-awareness, deception, goal preservation. But when you apply these to current LLMs, what you're actually testing is whether the model can reason through the scenario you've given it. That's not fundamentally different from testing whether it can reason about any other topic. The x-risk framing changes what the evaluation is called, but not what problem it's actually solving.
Here's a useful way to see the circularity. Imagine many teams of safety-oriented researchers, backed by philanthropic grants, meant to address risks from rouge autonomous vehicles, worked on making current autonomous vehicles safer. They'd be doing genuinely important work. But the net effect of their effort would be faster adoption of autonomous vehicles, not slower.
The same dynamic plays out in AI safety. Work that makes current LLMs more aligned increases adoption. Increased adoption funds and motivates the next generation of more capable systems. If you believe we are moving toward dangerous capabilities faster than we can handle, this is an uncomfortable loop to find yourself inside.
This is where it gets uncomfortable. Talented people who genuinely want to reduce existential risk end up channelling their efforts into work that, through this chain of reasoning, contributes to accelerating the very thing they're worried about. They're not wrong to do the work, because the work itself is valuable. But they may be wrong about what category of problem they're solving, and that matters for how the work gets funded, prioritised, and evaluated.
The incentive structure also does something quieter and more corrosive: the language of existential risk gets used to legitimise work that is primarily serving commercial needs. A paper with loose methodology but an x-risk framing in the title gets more attention and funding than a paper that is methodologically rigorous but frames its contribution in terms of understanding how LLMs actually work. The field ends up systematically rewarding the wrong things.
None of this is against working on AI safety. It means be more precise about what kind of AI safety work you're actually doing, and be honest, with yourself and with funders, about which category it falls into.
Product safety work on current LLMs is important. It should be done. But it can and should leverage commercial incentives. It is not neglected, and it does not need philanthropic funding meant for genuinely neglected research directions.
X-risk research is also important, arguably more important, precisely because it is neglected. But it should be held to a high standard of scientific rigour, and it should not borrow credibility from empirical results on current systems that don't actually support the claims being made.
The two categories have become increasingly hard to tell apart, partly because the same language gets used for both, and partly because it is genuinely difficult to evaluate what research is likely to matter for mitigating risks from systems that don't yet exist. But the difficulty of the distinction is not a reason to stop making it.
If you're entering this field, the single most useful habit you can develop is the ability to ask: am I working on the problem I think I'm working on?
2026-02-04 04:23:52
Published on February 3, 2026 8:23 PM GMT
We’ve had feedback from several people running AI safety projects that it can be a pain tracking various funding sources and their application windows. To help make it easier, AISafety.com has launched the AI Safety Funding newsletter (which you can subscribe to here).
It lists all newly announced funding opportunities relevant to individuals and orgs working on AI x-risk, and any opportunities which are closing soon. We expect posts to be about 2x/month.
Opportunities will be sourced from the database at AISafety.com/funding, which displays all funders, whether they are currently accepting applications or not. If you want to add yourself as a funder you can do so here.
The newsletter will likely evolve as we gather feedback – please feel free to share any thoughts in the comments or via our anonymous feedback form.
AISafety.com is operated through a public Discord server with the help of many volunteers, so if you’re interested in contributing or just seeing what we’re up to then feel free to join. Beyond the funding page, the site has 9 other resource pages like upcoming events & training programs, local and online communities, the field map, etc.