2026-04-26 06:39:28
Summary
EA and rationalists got enamoured with forecasting and prediction markets and made them part of the culture, but this hasn’t proven very useful, yet it continues to receive substantial EA funding. We should cut it off.
My Experience with Forecasting
For a while, I was the number one forecaster on Manifold. This lasted for about a year until I stopped just over 2 years ago. To this day, despite quitting, I’m still #8 on the platform. Additionally, I have done well on real-money prediction markets (Polymarket), earning mid-5 figures and winning a few AI bets. I say this to suggest that I would gain status from forecasting being seen as useful, but I think, to the contrary, that the EA community should stop funding it.
I’ve written a few comments throughout the years that I didn’t think forecasting was worth funding. You can see some of these here and here. Finally, I have gotten around to making this full post.
Solution Seeking a Problem
When talking about forecasting, people often ask questions like “How can we leverage forecasting into better decisions?” This is the wrong way to go about solving problems. You solve problems by starting with the problem, and then you see which tools are useful for solving it.
The way people talk about forecasting is very similar to how people talk about cryptocurrency/blockchain. People have a tool they want to use, whether that be cryptocurrency or forecasting, and then try to solve problems with it because they really believe in the solution, but I think this is misguided. You have to start with the problem you are trying to solve, not the solution you want to apply. A lot of work has been put into building up forecasting, making platforms, hosting tournaments, etc., on the assumption that it was instrumentally useful, but this is pretty dangerous to continue without concrete gains.
We’ve Funded Enough Forecasting that We Should See Tangible Gains
It’s not the case that forecasting/prediction markets are merely in their infancy. A lot of money has gone into forecasting. On the EA side of things, it’s near $100M. If I convince you later on in this post that forecasting hasn’t given any fruitful results, it should be noted that this isn’t for lack of trying/spending.
The Forecasting Research Institute received grants in the 10s of millions of dollars. Metaculus continues to receive millions of dollars per year to maintain a forecasting platform and conduct some forecasting tournaments. The Good Judgment Project and the Swift Centre have received millions of dollars for doing research and studies on forecasting and teaching others about forecasting. Sage has received millions of dollars to develop forecasting tools. Many others, like Manifold, have also been given millions by the EA community in grants/investments at high valuations, diverting money away from other EA causes. We have grants for organizations that develop tooling, even entire programming languages like Squiggle, for forecasting.
On the for-profit side of things, the money gets even bigger. Kalshi and Polymarket have each raised billions of dollars, and other forecasting platforms have also raised 10s of millions of dollars.
Prediction markets have also taken off. Kalshi and Polymarket are both showing ATH/growth in month-over-month volume. Both of them have monthly volumes in the 10s of billions of dollars. Total prediction market volume is something like $500B/year, but it just isn’t very useful. We get to know the odds on every basketball game player prop, and if BTC is going to go up or down in the next 5 minutes. While some people suggest that these trivial markets help sharpen skills or identify good forecasters, I don’t think there is any evidence of this, and it is more wishful thinking.
If forecasting were really working well and was very useful, you would see the bulk of the money spent not on forecasting platforms but directly on forecasting teams or subsidizing markets on important questions. We have seen very little of this, and instead, we have seen the money go to platforms, tooling, and the like. We already had a few forecasting platforms, the market was going to fund them itself, and yet we continue to create them.
There has also been an incredible amount of (wasted) time by the EA/rationality community that has been spent on forecasting. Lots of people have been employed full-time doing forecasting or adjacent work, but perhaps even larger is the amount of part-time hours that have gone into forecasting on Manifold, among other things. I would estimate that thousands of person-years have gone into this activity.
Hits-based Giving Means Stopping the Bets that Don’t Pay Off
You may be tempted to justify forecasting on the grounds of hits-based giving. That is to say, it made sense to try a few grants into forecasting because the payoff could have been massive. But if it was based on hits-based giving, then that implies we should be looking for big payoffs, and that we have to stop funding it if it doesn’t.
I want to propose my leading theory for why forecasting continues to receive 10s of millions per year in funding. That is, it has become a feature of EA/rationalist culture. Similar to how EAs seem to live in group houses or be polyamorous, forecasting on prediction markets has become a part of the culture that doesn’t have much to do with impact. This is separate from parts of EA culture that we do for impact/value alignment reasons, like being vegan, donating 10%+ of income, writing on forums, or going to conferences. I submit that forecasting is in the former category.
At this point, if forecasting were useful, you would expect to see tangible results. I can point to you hundreds of millions of chickens that lay eggs that are out of cages, and I can point to you observable families that are no longer living in poverty. I can show you pieces of legislation that have passed or almost passed on AI. I can show you AMF successes with about 200k lives saved and far lower levels of malaria, not to mention higher incomes and longer life expectancies, and people living longer lives that otherwise wouldn’t be because of our actions. I can go at the individual level, and I can, more importantly, go at the broad statistical level. I don’t think there is very much in the way of “this forecasting happened, and now we have made demonstrably better decisions regarding this terminal goal that we care about”. Despite no tangible results, people continue to have the dream that forecasting will inform better decision-making or lead to better policies. I just don’t see any proof of this happening.
Feels Useful When It Isn’t
Forecasting is a very insidious trap because it makes you think you are being productive when you aren’t. I like to play bughouse and a bunch of different board games. But when I play these games, I don’t claim to do so for impact reasons, on effective altruist grounds. If I spend time learning strategy for these board games, I don’t pretend that this is somehow making the world better off. Forecasting is a dangerous activity, particularly because it is a fun, game-like activity that is nearly perfectly designed to be very attractive to EA/rationalist types because you get to be right when others are wrong, bet on your beliefs, and partake in the cultural practice. It is almost engineered to be a time waster for these groups because it provides the illusion that you are improving the world’s epistemics when, in reality, it’s mainly just a game, and it’s fun. You get to feel that you are improving the world’s epistemics and that therefore there must be some flow-through effects and thus you can justify the time spent by correcting a market from 57% to 53% on some AI forecasting question or some question about if the market you are trading on will have an even/odd number of traders or if someone will get a girlfriend by the end of the year.
Conclusion
A lot of people still like the idea of doing forecasting. If it becomes an optional, benign activity of the EA community, then it can continue to exist, but it should not continue to be a major target for philanthropic dollars. We are always in triage, and forecasting just isn’t making the cut. I’m worried that we will continue to pour community resources into forecasting, and it will continue to be thought of in vague terms as improving or informing decisions, when I’m skeptical that this is the case.
2026-04-26 02:12:53
Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."
In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before."
Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something like:
And, somewhat more opinionatedly:
The spirit of the Goodhart is "are you finding new questions, and making some kind of progress on them?". Along the way, each day, you're thinking at least one new thought you haven't thought before.
One way of "thinking new thoughts" and "asking new questions" is to research stuff that already exists out there (i.e. learn some cool new facts about math/history/science/etc, and then write up good explainers for it, or connect it to other domains).
Another way is by thinking original thoughts, plumbing somehow through your confusions about the world and developing new concepts to help deal with them.
Presumably there are other approaches too, but I list those two to convey that that there's more than one way to go about this. The main thing we'd be trying to not do at Thinkhaven, is to "explain ideas that you've already thought about and just wish everyone else understood." That's also important, it's just not what Thinkhaven is for.
The daily journal is for accountability, to make sure you're making any kind of progress at all. The daily "new question", is to help ensure that progress has some kind of forward momentum, and is exploring new ideas.
The fortnightly 2500 word published writing is to close the loop on "get a new idea all the way from a vague musing, to something written up in a way that other people can critique." (Ideally, this explains some new ideas to the internet. If you didn't really get any new good ideas, you can write "well, I didn't come up with anything, but, here's a cleaned up version of my daily research notes.")
My primary inspiration for this is not actually Inkhaven, it's a period in 2022 where the Lightcone team focused on thinking/research/etc to try to get in touch with what it's like to be an original thinker on LessWrong. We set a goal of writing up a blogpost-worth-of-content per day, which the team would then read over each morning. Even without publishing externally, it was a useful forcing function to keep generating new thoughts and forcing them into a clearer shape. |
This is intended to be a fairly openended container. I'd expect to get value out of the pure container listed above, but, I'd ideally want a few different styles of mentors and coaches around, who embody different ways of thinking.
There are a few ways to operationalize that. You could model the thing more similar to MATS where everyone has a mentor they meet with on some cadence. If I were modeling it more on Inkhaven, I think some mentors would give classes, others might be more like mysterious old wizards you just go talk to.
All participants need to have at least one mentor who is enthusiastic about them (as part of the admissions process), but, they could sample from different styles of mentorship over the course of the month.
Possible examples of mentors:
Note: these are examples, not people who agreed to participate or even looked at this post. But they are some archetypes that I'm imagining. I'd be hoping for Thinkhaven to include a mix of mentors or "resident thinkers" with similar range. |
John Wentworth-style, focused on tackling some confusing problems we don't understand, asking "what's the hard part?" / "what's the bottleneck?", and systematically making progress, while keeping an eye on Principles Which Will Carry Over To The Next Paradigm.
Logan Strohl-style, focused on openended, patient observation (with a kind of "open curiosity" as opposed to "active curiosity"). Trying to keep as close-to-the-metal on your observations. (See Intro to Naturalism: Orientation for a deep meditation on the sentence "Knowing the territory takes patient and direct observation.")
Elizabeth Van Nostrand-style, with some focus on open-ended "lit review"-ish research. Pick a new field you are curious about, read over lots of existing papers and books. See if you can synthesize some new takeaways that weren't obvious. Be ready to follow threads of information wherever they lead.
Scott Garrabrant-style, go live where the important math problems are, but then marry-for-love. Mull over interesting problems and then get nerdsniped on whatever feels alive.
Chris Olah-style, where... okay honestly I'm not actually sure how Chris Olah does his thinking and he seems particularly unlikely to come. But, reading over his older blogposts I get a sense of both a guy who likes studying lots of little fiddly patterns in the world and making sense of them, in a way that (vaguely) reminds me of an old timey biologist. And, a guy who likes experimenting with new ways of explaining things.
The mentors above are selected for "I respect their thinking and writing."
They're not particularly selected for it being the right-use-of-their-time to help people through daily stumbling blocks, executive dysfunction, etc.
I would want some staff that are more like the research coaches at MATS, who meet with the people on some cadence to check on how things are going and help them resolve obstacles. And, I'd like to try out having dedicated Thinking Assistants available, who can sit with you for a chunk of time as you write or talk out loud through your problem, and notice little microhabits that might be worth paying more attention to.
Everything above is the core idea. I'm not that confident in that particular format, and expect I'd change my mind about stuff after one iteration. But, here's some explanations of why I picked this structure instead of others, structured as an FAQ.[1]
I'm not sure this will work as well as I hope. But, my reasons are:
Sometimes, when you're exploring and stewing on a set of ideas, you're not really making progress, you're sort of going in circles, or building up some superficial understandings that don't really translate into a clear takeaway. Asking yourself new questions forces you to take your vague musings and confusions and turn them into questions with a meaningful return type.
It also pumps against "explaining ideas you've already thought about." (which again, is totally a useful way to write. It's just not what this program is for). By forcing yourself not to do something, you create space to practice new skills.
And, while it's opinionated on format, I think the "question" framing is still pretty openended as structures go.
One person read the above and was like "okay I kinda get it, but I think I need to see an example of what this looks like to have a clearer sense of what this'd mean."
Here's an example.
(Note: this is just one example. As I just said, the program should be pretty unopinionated. Hopefully, if my line of questioning feels weird to you, it helps you imagine a version that would fit your thought process better).
I might start with a vague frustration/confusion:
"Geeze, alignment seems to have shitty feedback loops. wat do?"
I find it fruitful to ask more explicitly:
"Okay, what would it mean to have good feedback loops?"
"If there were definitely no good feedback loops, what else might good progress look like?".
Which in turn prompt more specific questions like:
"What are some domains that solved the 'poor feedbackloop' problem before? How did they do that?".
"What are some domains where 'feedbackloop' just wasn't even the right ontology?"
"What problem are 'feedback loops' solving? What other ways could you solve those?"
"What properties would 'solving alignment' have? What do I actually mean by that?"
As well as meta/tactical questions like:
"Who are some people who've thought about this already? Do they have writing I could read? Could I literally go talk to them?"
"Why is it hard to think about this, and what can I do about that?"
And then I might learn about domains where progress clearly accumulates, but a lot of it is driven by "taste." I might then spend a day digging into historical example of how people acquired or transmitted taste.
The first answer is "whatever you want."
But, I did find, while beta testing this for myself this month, that it worked better when I gave myself a set of daily prompts to fill out, which looked like:
What questions did I think about yesterday?
What did I learn yesterday?
What questions or confusions am I interested in now?
What seems difficult about this? How can I fix that?
The "what did I learn?" section is the bit that ends up most shaped like a 500 word blogpost.
Rather than think of this as "the thing I scramble to write before the end of the day", it's more like a thing I write when I first get started in the morning. (I don't really like the "publish by midnight" thing that Inkhaven does, and I think I might want to actually set the deadline at lunchtime).
Another friend who beta-tested the format experimented with changing up the prompts, so that it worked better as an orienting process for them. (By default it felt a bit like a tacked-on-assignment they were doing out of obligation, but, slightly tweaked, it felt more naturally like a useful thing for them to do each day)
I think so, but, not 100% sure.
(But, my default recommendation would be to put them on an out-of-the-way secondary blog, so you feel more free to think dumb thoughts along the way).
The reason to make them public is to help them function more as an accountability mechanism. You don't need to make a nice polished essay with a conclusion. But, you do need to get your thoughts to a point where they're structured enough someone else can make sense of them.
I considered just requiring them to be published internally to the Thinkhaven cohort. Habryka argued with me that this'd make people feel more like they were writing for the cohort-in-particular, having to care what those people thought, instead of getting to follow their own thought process.
The most important thing is you expect someone to be reading them.
Because the point of intellectual progress is to actually contribute to the sum of human knowledge. It's an important part of the process to package it up in a way that other people can understand and build on.
And, it's an important forcing-function that eventually your meandering question needs to turn into something that someone else would want to read.
Both of these are numbers I can imagine fine-tuning.
Why not "once a week?"
I thought "once a week" might be a better cadence, but, when I tried it out I found it too short.
During Inkhaven, where I was mostly focused on writing up existing ideas, I was able to write ~2000+ words a day and usually write one full post and make partial progress on an effortpost.
Thinking new meaningful/useful thoughts takes awhile, and sometimes it's important to get lost in the woods for awhile without knowing quite how everything will tie together. Or, just go off and gather a lot of information and digest it.
Why not longer?
I think "real work in the field" often does take more than 2 weeks at a time to output a blogpost worth of content. But, I think that's too slow a feedback loop for people learning. This is still supposed to be a class. I think it'd be hard for people to stay for longer than a month, and seems like people should get at least two reps in of "go from ideation -> publishing."
If this ended up being like a 3-month fellowship, I can imagine once-a-month being a reasonable cadence. But, I think it's just not that hard to turn 2 weeks of thinking into one substantial writeup.
If this were a 3-month fellowship, my current guess is I'd keep the 2-week effortpost but add in a Final Project that's aiming for the level of "significant contribution to whatever field you're exploring."
All of this is only one possible structure for the underlying goal of "learn to relentlessly find new, useful thoughts every day." But, it's a pretty simple structure I'd expect to do pretty well even in its minimal form.
Anyways, happy thinking.
These questions have all been asked at most "once" and sometimes "zero", so "frequently asked questions" is not exactly correct.
2026-04-26 01:29:47
This post and the related sequence were written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks." This was one of the three projects supported by, and continuing the work of, Groundless. Specifically, it develops one of the key concepts referred to in the original MoSSAIC (Management of Substrate-Sensitive AI Capabilities) paper (sequence here). Matthew Farr and Aditya Adiga co-mentored the project; Vardhan, Vadim Fomin, and Ian Rios-Sialer participated as team members.
In a previous post and paper, we informally sketched out a definition of substrate as follows:
"the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard."
Or more informally,
"that (layer of abstraction) which you don't have to think about."
We gave several examples of differences in this programmable context producing differences in measurable and relevant aspects of computation. These included the adoption of GPUs allowing networks to train at scale, or how quantum computers operate on entirely different algorithms from their classical counterparts.
In the following posts, we expand upon this concept more thoroughly, giving (i) an informal, intuitive introduction to substrate and it's role in computation, (ii) some case studies that argue it's salience in cybersecurity and AI safety, and (iii) our initial formal characterization.
There is a principled sense in which the physical and architectural substrate of a computation shapes its behavior in ways that are invisible on the purely computational.
We start with pure, unembedded mathematics. These are numbers as Plato intended them.

In this case, there is no substrate, no wider context needed to implement the numbers. (We neglect the embedded nature of the cognition in which the numbers are placed.)
Thus, we can say that 3 = 3 in all possible respects within this Platonic reality.
Now we consider a simple embedding of these Platonic ideals in physical reality. We write these numbers on a sheet of paper.

Here, the paper is part of the substrate. This might seem trivial in the above example, but in more complex calculations, the paper becomes an essential part of what makes the computation possible. We note that we start to characterise the differences between these two numbers, in that both live at different locations on the sheet.
To see how these differences become increasingly relevant in computation, let's consider now two separate pieces of paper. One is on my desk, the other is kept at my local library.

Now these two 3s, despite having the same mathematical/formal meaning, perform very differently when I want to use them in computation. If I'm working at my desk and I want to compute something using the number written on that sheet of paper next to me, I can complete this very quickly. However, if I need the number on the other sheet, the one at the library, my calculations will take considerably longer.
Now we generalize this to actual modern computer systems. Behind the layers of abstraction and useful ergonomics of modern computers, we have something remarkably similar to the above example of sheets of paper located across the city.
Instead of sheets of paper, we have addresses in the computer's memory. These are updated and retrieved for computational work via code, itself a layer of abstraction that hides the various stages of assembly sitting between the user and the physical changes she's making to her computer.
Instead of 3's located at a desk and in the local library, we now have 3s located at different memory addresses. The locations of these have an noticeable and often exploitable effect on the computation performed.
For instance, and for those that don't know, a CPU has a memory hierarchy structured like this:

The L1 cache is the smallest and most immediate. It has a latency of ~1 ns. Think of it as a book open on your desk. It is small but can be accessed very quickly.
The L2 cache is next. It is larger but needs more time to access (~3–5 ns). Think of it as a stack of closed books on your desk. They contain more information but you have to open them and find the correct page if you need to access the information they hold.
The L3 cache is larger still. Multiple cores will access it, and its latency is ~10–20 ns. Think of it as a bookshelf in your room: you have to leave your desk and search through the shelves to find the information you need.
And so on...
The point is this: whether a 3 is stored in the L1 cache vs the L3 cache makes a non-trivial difference to the computation performed, despite the formal equivalence. These differences by themselves are trivial, on the order a few nanoseconds. But, as we scale the systems to increasingly complicated computations, these differences count.

Hardware engineers have come up with various tricks by which to exploit these differences to improve performance.
To summarize. Two entities can be formally equivalent but can be meaningfully different when implemented in real computation. Above we point at the performance aspects, though other differences include the security/vulnerability, interpretability, stability and so on. Whilst these differences are often trivial, at scale they accumulate, having a meaningful impact on the tractability of certain formal entities (numbers at locations, algorithms using certain caches).
We use the term "substrate" to capture this essential context.
2026-04-26 00:16:36
People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong.
The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference, not about baseline risk. The first group of people are wrong because the probability that you personally avert AI catastrophe isn’t that small.
Here’s a story to show that Pascal’s muggings are about the probability that you make a difference. Imagine that God will flip a coin at the end of time. If the coin lands heads, He’ll send everyone to heaven. If the coin lands tails, He’ll send everyone to hell. Everyone knows this is what will happen.
In a dark alley, a stranger approaches you and tells you that he can make God’s coin land heads, thereby ensuring that everyone goes to heaven. He says he’ll do it if you give him your wallet. You assign a very low probability to this stranger telling the truth — 1 in a bajillion — but the stranger reminds you that 10 bajillion people will have their fates determined by God’s coin.
‘Hang on,’ you say, ‘This seems a lot like a Pascal’s mugging.’
‘Au contraire,’ says the stranger, ‘It can’t be a Pascal’s mugging. The outcome I’m promising to avert — everyone going to hell — is not low probability at all. p(hell) is 50%.’
Would this reply convince you to hand over your wallet? Of course not. Even though the baseline risk of everyone going to hell is high, the probability that you make a difference — getting everyone to heaven when they otherwise would have gone to hell — is extremely low. And it’s this latter probability that determines whether your situation is a Pascal’s mugging.
So when people say that AI safety is a Pascal’s mugging, you can’t just reply that p(doom) is high. You have to argue that p(you avert doom) is high.
All that said, I think p(you — yes, you — avert doom) is high, or at least high enough. The whole doom situation is really up-in-the-air right now, and you’re at most like 4 degrees of separation from the big players: presidents, lab CEOs, and the like. You can influence someone who influences someone who influences someone. Your chances are way higher than 1 in a bajillion.
2026-04-26 00:08:33
This is the second post in a sequence that expands upon the concept of substrates as described in this paper. It was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.
We now argue that the idea of substrate, as we describe it in the original work and in the previous post, unifies several safety-relevant phenomena in AI safety. This list is expanding as we identify and clarify more.
These examples show how the specific details of how a complex computation (i.e., that of a neural network) is implemented in a real system affect that computation such that safety might be compromised.
Self-repair, also referred to as the Hydra effect, is a phenomena by which ablations to network components (e.g., attention heads, MLP layers) are compensated for by later layers in the network.
Such behaviour represents an obstacle to the causal analysis of neural networks, as such analysis relies on ablations to isolate genuine causal relevance from observed correlations.
Self repair was first identified in Wang et al. (2023), investigated in more detail in McGrath et al. (2023), and given a full analysis in Rushing & Nanda (2024). We here follow the latter, which presents a fine-grained analysis of various aspects of self-repair.
The most interesting result of Rushing and Nanda's work is that LayerNorm contributes substantially to a network's self-repair capabilities. This is an architectural component of the model, often folded into the model weights themselves in mathematical descriptions or neglected entirely as a "passive" part of the model.
In our terminology, it forms a part of the substrate. It is context surrounding the purely formal function f(x) = y that characterizes the model's behaviour, allowing that function to be learned and implemented.
This passive, architectural primitive rescales and centres the activation values such that the per-layer activations average to zero and have a variance of 1. When an attention head is ablated, the overall size of the residual stream norm is reduced. LayerNorm divides the residual stream by a scaling factor proportional to its norm, so reducing the norm amplifies the surviving components' contributions. Model components tend to be correlated, pushing the residual stream vector towards the same answer. Thus, they compensate for the absent signal from the ablated head.

On average, when an attention head is ablated, this re-scaling contributes ~30% of the observed self-repair in Pythia-160M.[1]
This is a relatively shallow example of an untrained, "passive" component of the model (i.e., an architectural component) contributing to safety-relevant properties.
The next case study we consider is the role that quantization plays in safety-relevant aspects of AI systems.
Quantization refers to the format in which a model's weights are stored in computer memory. A neural network's weights are just numbers, but these are represented in different ways to improve precision and efficiency. Models are typically trained in FP16, a 16-bit floating-point format: one sign bit, five exponent bits, and ten mantissa bits. A value is written in this form:
But FP16 is an expensive format to store weights in. Deployment thus increasingly favours quantized formats: FP8 (still floating-point, but with a narrower range), INT8 and INT4 (plain signed integers with a scale factor rescaling them at inference time), and others.
Performance loss from quantization is smaller than one might expect. Well-engineered 8-bit quantization typically costs less than 1% on standard benchmarks, and even INT4 loses only 1-5% performance when using careful compression.
Like LayerNorm, quantization format is substrate rather than model in our sense. It is computational context surrounding the formal function f(x) = y that characterises model behaviour. It is a further design choice made according to the constraints of the deployment hardware.
Its safety relevance shows up most sharply in the bit-flip attack literature. Bit-flip attacks use hardware fault-injection techniques to alter stored model weights directly. Coalson et al.'s PrisonBreak (2024) showed that flipping fewer than 25 bits (sometimes as few as 5 bits) in an aligned FP16 LM is enough to strip safety alignment. Their attack worked by targeting the most significant exponent bits of FP16 weights, since flipping an exponent bit can change a weight's magnitude by as much as 16 orders of magnitude.
Zahran et al. (2025) extend this investigation to FP16, FP8, INT8, and INT4 quantisations. FP8 is the most robust, holding attack success rates below 20% (within a 25 bit-flip budget). INT8 offers substantial but lesser protection. INT4, despite having no exponent bits and only 16 possible values per weight, collapses almost as easily as FP16.

Three further observations:
First, the attack heuristics don't transfer across substrates. PrisonBreak's exponent-bit-targeting strategy is inapplicable to integer formats, which have no exponent bits at all. Zahran et al. use a different search strategy: they perform a direct step-size search over all bits in a candidate weight. Safety analyses calibrated on one substrate's vulnerability profile simply does not describe the other's.
Second, the attack locations shift with quantisation scheme. Successful attacks on FP16 and INT4 models cluster in attention value projections, while successful attacks on FP8 and INT8 models cluster in MLP down-projections. The same behavioural outcome is produced by interventions on different parts of the model depending on the storage format.
Third, vulnerabilities can persist across substrate transitions in ways that aren't obvious. Jailbreaks induced in FP16 models and then quantized down retain nearly their full attack success rate under FP8 or INT8 quantization (<5% drop), but lose roughly 24–45% under INT4.
The broader point is that safety-relevant properties established at one level of abstraction (the FP16 model, evaluated on HarmBench) silently depend on choices made at a lower, untracked level (the storage format, the quantization scheme, the order of operations between alignment and quantization). When these implementation choices change as part of standard deployment, the safety claim does not always go with them.
The third case study moves further down the stack, from the encoding of weights in memory (Case 2) to the physical memory in which those encodings are stored.
RowHammer is a read-disturbance phenomenon in dynamic random access memory (DRAM). Repeatedly activating a row of memory causes charge to leak in physically adjacent rows, eventually flipping bits. Lin et al.'s GPUHammer (2025) demonstrated that NVIDIA GPUs are vulnerable to an engineered version of this attack.

On an NVIDIA A6000 with GDDR6 memory, their attack produced eight bit-flips across four memory banks. By targeting the most significant bit of the exponent (weights were stored in FP16), they were able to collapse model performance across five standard ImageNet models (AlexNet, VGG16, ResNet50, DenseNet161, InceptionV3), driving top-1 accuracy from 56–80% (depending on model) to below 0.1% in around thirty minutes and within eight attempts.
Two observations matter for our argument.
First, the vulnerability is substrate-specific in a way that is invisible from the model's level of description. The same hammering techniques produced eight bit-flips on the A6000 but zero on a RTX 3080 (same GDDR6 family, different chip) and on an A100 (HBM2e memory, different architecture). An ML practitioner deploying a model cannot, from the model's behaviour, distinguish which of these substrates their weights are sitting on.
Second, the mitigations live at levels below the model and are traded against performance. Error-Correcting Code (ECC) refers to extra bits alongside data that let the system detect and sometimes correct bit flips. This can correct single bit-flips on the A6000 but is disabled by default because enabling it costs 6.5% memory and 3–10% inference slowdown. The substrate's safety-relevant behaviour is thus partly a function of decisions made by hardware vendors, cloud providers, and system administrators outside of typical AI safety threat models.
This is the same structural pattern as Cases 1 and 2, now at the physical layer. A property established at one level of abstraction—behavioural alignment of a DNN, evaluated on standard benchmarks—depends on choices made at a lower, untracked level — the specific DRAM chip and its configuration (e.g., ECC). Those choices change across cloud GPU generations and deployment settings.
Safety evaluations in AI: ablation studies, behavioural evals, weight-level analysis are substrate-sensitive.
Note that Rushing & Nanda measure this over the top 2% of tokens by direct effect, and they describe it as a "directional metric" rather than a precise quantity.
2026-04-25 23:31:07
Our scene is set in a biofilm, long before the origin of the first multicellular organisms, so long ago that time itself has not yet really been invented except in the cycles of energy production that characterise the activity of each cell. Many bacteria live here, tied together in a complex web of functional interrelations that bind different groups to each other. Life at our scale is defined by cycles: food must be harvested, waste must be secreted, animals must live and replicate and die, structures must be erected to provide shelter and room for growth, all at regular intervals. All of this is also true of life at the scale of the biofilm. (Yes, even the part about building structures.)
At this moment in time the sun is setting on the planet that will one day be called Earth, but the bacteria of course are not aware of this fact. Their biofilm is in a slimy place far from the scorching heat of daylight, and any residual photosensitivity inherited from their time as nomadic plankton is strictly on the way out. From the bacterial point of view, there’s really no good reason to hold on to such antiquated features like the ability (the vulnerability, really) of being reactive to strong light.
But not all is well in the biofilm. Food has been growing scarce. Cells are abandoning their mutual arrangements, eating each other in destructive conflicts. There is talk of a new development in the ion channels: a super-cell. The super-cell, it is theorised, could be attained by editing the code of a normal cell during division, or even during the cell’s lifetime. It would effectively remove the limiters placed by evolution on the size and energy consumption of cells, growing the cell to a new and (relatively) huge size. The increased energy consumption of that organism would be made up for by its optimality: it would have more energy to hunt and collect food, more capability to process information, and therefore effectively outcompete the unmodified baseline cells.
Some of the cells protest that it isn’t possible scientifically for a single super-cell to be better than a normal cell at everything. They point out that cells in the biofilm are diversified, specialised workers with unique skills and talents. They cooperate to form mutually beneficial arrangements that ensure value is created for all cells. The super-cell theorists respond that none of that matters when a horde of super-cells is eating everything and they are too strong to be disarmed by the standard expulsion and regulation mechanisms.
Others protest that normal cells would simply live and trade with super-cells, just as they do with other specialised cells already. The super-cell theorists point out that super-cells, liberated from their evolutionary constraints, have no drive to cooperate with other cells and would probably just eat them for their nutrients. Even if the super-cells harboured no animosity towards the other cells, they would gain the ability to access food and nutrients more effectively that the normal cells by virtued of their enhanced physical capabilities and intelligence, eventually starving them to death slowly or quickly.
In fact, the theorists propose that optimal super-cells would discard their senescence, autophagy, and apoptosis mechanisms, meaning that they do not age and self-destruct like other cells do to ensure the health of the collective biofilm. They would be able to live indefinitely and divide constantly, with the rate of super-cell production rapidly outpacing the rate at which normal cells are produced and replicated. The resulting new super-cells might themselves be unstable, leading to rapid evolution and augmentation within the super-cell DNA. As the super-cell theorists put it, the fact that super-cells came from normal cells would be immaterial. It would be as if a new species of cell had been borne that was simply superior to all normal cells, and the rules of evolution are very clear about what happens to losers in evolutionary conflicts.
The ultimate effect is once a super-cell is made, it will rapidly grow and form a horde of super-cells. The horde will rapidly rush out and, whether competition or cooperation with each other, efficiently consume all of the other cells and the free energy in the biofilm. Free from death and utterly ruthless, they might even spread to cover the known world and beyond.
“But what is to be done?” the normal cells ask. Here the prognoses of the super-cell theorists grow increasingly grim. The problem is that would be to the benefit of any individual cell or cluster of cells to become super-cells, since their new found deathlessness and greater size would allow them better chances at replication. For every group of cells that refused this fate, there would be another that accepted the offer, and thereby gained the power to wipe out the first. Thus even if all cells wished to remain normal and non-super, they would always be wary that their competitors here or elsewhere might succumb to the temptation and rapidly become unstoppable. The more stressful conditions got in the biofilm, the more likely some cell or group would give in.
Others proposed that it would be possible for groups of “good” cells to try and harness the power of super-cells to wipe out their enemies. The theorists were not very optimistic about the power of the “good” cells to keep super-cells under control. After all, super-cells were different from the macrophages that kept order in normal cell systems. They were capable of rapid replication, rapid adaptation, and were simply bigger, stronger, and smarter than the “good” cells that were supposed to keep them in check. No, once a super-cell was unleashed, it was lights out for everyone else in the biofilm.
By now it should be clear that what I call “super-cells” are both a metaphor for cancer and also a metaphor for superintelligence as it is commonly conceived of. The idea of a powerful, ruthless, optimal being that restructures the world around it and snuffs out all suboptimal lifeforms is one that has echoes in many scales, not merely human ones. However, the point of that thought exercise is not to say that we are doomed by the laws of biology. Quite the opposite.
Notice, for example, that life on earth does not consist solely of super-cells. We are not walking tumours. Instead, life is made out of the “normal” cells, the ones that were so clearly suboptimal to their cancerous variants, cells that are flimsy, overspecialised, underoptimised, and that still obey programmed cell death protocols laid down by evolution. We know this because every now and then some cells do undergo an intelligence explosion, and clearly become distinct from the norm when they become cancerous. You might protest that the difference in power level between a cell and a cancer cell is nothing like the difference in power level between a human and a superintelligence. But there are two immediate objections I would issue against that idea:
Now notice how the “normal” cells “won” over the cancer cells. They did not do so by becoming cancers themselves, or using cancers to kill off their opponents in massive cell wars. Instead, they self-organised into superorganisms that consist of billions of cells, equipped with complex internal sensing and self-regulation behaviours that no individual group of cells could provide. These superorganisms were notably more organised than the cultures of cells in primitive biofilms, with cooperation that was much tighter than what was previously possible when cells were free-moving individual prokaryotes. It was these cooperative, well-ordered, decidedly non-cancerous superorganisms that replicated across the earth, and eventually gave rise to humans who are now researching the ultimate way to defeat super-cells once and for all: that is to say, we are researching the cure for cancer. Even without that cure, of course, humans regularly defeat cancer and achieve partial or full recoveries.
I want to expand a little on that last point. It would have been quite easy for a group of cells to hypothesise that cells were the reference class of intelligence or strength or capability. After all, at that point cells were genuinely the most complex and intelligent forms of life on Earth. Thus, to do anything beyond what a normal cell could do would require creating more and more powerful super-cells that were bigger and bigger. Yet this strategy is manifestly not what played out. Instead, there was a jump in scale and complexity that came from cells coming together, a jump that dwarfed anything an individual super-cell could achieve. The cells that flew to the moon were hundreds of trillions of decidedly normal cells, not one engorged cancerous cell that had undergone recursive self-improvement.
The same flaws in reasoning, I suggest, are present in the human projects to create superintelligence. It is true that we could probably create a superintelligence that could out-think any human, or destroy our civilisation in its present form. However, I think it would be a mistake to therefore conclude that humans are evolutionary dead ends, and to throw all our resources into creating super-powered gods that we would surely fail to control. Our biological history suggests that another way to achieve those titanic feats of intelligence we dream of is not to discard our selves but to improve our ability to work together. Instead of engineering ever-growing digital tumours, we could learn to make better use of the computational and organisational powers already amongst us.
This would not be a simple thing to do, of course. Many people in the Valley and the world of AI look with disdain and disgust at our outdated institutions, gridlocked politics, and disintegrating social and natural ecosystems. It is much easier for an individual cell to dream of becoming or creating a super-cell than it is for an individual cell to dream of becoming a mouse, a dog, or a human. But if we want to both achieve our dreams and live to see the day after, if we want to maintain our sense of purpose and drive as a species, that may be what we have to try and do.