MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

"Thinkhaven"

2026-04-26 02:12:53

Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."

In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before." 

Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something like:

  • Every day, you must publish 500 words of "research journal." They should be cogent enough for third parties to follow along with your thought process, but, they don't need to end with nice discrete conclusions. 
  • Every two weeks, you must also publish a 2500 word effortpost.

And, somewhat more opinionatedly:

  • Each journal entry should include at least one new question that you're thinking about. (Often, these will be subquestions within a broader Big Question that you're exploring that week, or a reframing of that big question that feels like a better pointer)

The spirit of the Goodhart is "are you finding new questions, and making some kind of progress on them?". Along the way, each day, you're thinking at least one new thought you haven't thought before.

One way of "thinking new thoughts" and "asking new questions" is to research stuff that already exists out there (i.e. learn some cool new facts about math/history/science/etc, and then write up good explainers for it, or connect it to other domains).

Another way is by thinking original thoughts, plumbing somehow through your confusions about the world and developing new concepts to help deal with them.

Presumably there are other approaches too, but I list those two to convey that that there's more than one way to go about this. The main thing we'd be trying to not do at Thinkhaven, is to "explain ideas that you've already thought about and just wish everyone else understood." That's also important, it's just not what Thinkhaven is for.

The daily journal is for accountability, to make sure you're making any kind of progress at all. The daily "new question", is to help ensure that progress has some kind of forward momentum, and is exploring new ideas.

The fortnightly 2500 word published writing is to close the loop on "get a new idea all the way from a vague musing, to something written up in a way that other people can critique." (Ideally, this explains some new ideas to the internet. If you didn't really get any new good ideas, you can write "well, I didn't come up with anything, but, here's a cleaned up version of my daily research notes.")

My primary inspiration for this is not actually Inkhaven, it's a period in 2022 where the Lightcone team focused on thinking/research/etc to try to get in touch with what it's like to be an original thinker on LessWrong. We set a goal of writing up a blogpost-worth-of-content per day, which the team would then read over each morning. Even without publishing externally, it was a useful forcing function to keep generating new thoughts and forcing them into a clearer shape.

I personally found it helpful for transitioning from "a guy who mostly defers to other people" to "a guy thinking his own strategic and intellectual thoughts."

Mentors, and Different Styles of Thinking

This is intended to be a fairly openended container. I'd expect to get value out of the pure container listed above, but, I'd ideally want a few different styles of mentors and coaches around, who embody different ways of thinking. 

There are a few ways to operationalize that. You could model the thing more similar to MATS where everyone has a mentor they meet with on some cadence. If I were modeling it more on Inkhaven, I think some mentors would give classes, others might be more like mysterious old wizards you just go talk to.

All participants need to have at least one mentor who is enthusiastic about them (as part of the admissions process), but, they could sample from different styles of mentorship over the course of the month.

Possible examples of mentors: 

Note: these are examples, not people who agreed to participate or even looked at this post. But they are some archetypes that I'm imagining. I'd be hoping for Thinkhaven to include a mix of mentors or "resident thinkers" with similar range.

John Wentworth-style, focused on tackling some confusing problems we don't understand, asking "what's the hard part?" / "what's the bottleneck?", and systematically making progress, while keeping an eye on Principles Which Will Carry Over To The Next Paradigm.

Logan Strohl-style, focused on openended, patient observation (with a kind of "open curiosity" as opposed to "active curiosity"). Trying to keep as close-to-the-metal on your observations. (See Intro to Naturalism: Orientation for a deep meditation on the sentence "Knowing the territory takes patient and direct observation.")

Elizabeth Van Nostrand-style, with some focus on open-ended "lit review"-ish research. Pick a new field you are curious about, read over lots of existing papers and books. See if you can synthesize some new takeaways that weren't obvious. Be ready to follow threads of information wherever they lead. 

Scott Garrabrant-style, go live where the important math problems are, but then marry-for-love. Mull over interesting problems and then get nerdsniped on whatever feels alive.

Chris Olah-style, where... okay honestly I'm not actually sure how Chris Olah does his thinking and he seems particularly unlikely to come. But, reading over his older blogposts I get a sense of both a guy who likes studying lots of little fiddly patterns in the world and making sense of them, in a way that (vaguely) reminds me of an old timey biologist. And, a guy who likes experimenting with new ways of explaining things. 

Thinking Assistants / Research Managers

The mentors above are selected for "I respect their thinking and writing."

They're not particularly selected for it being the right-use-of-their-time to help people through daily stumbling blocks, executive dysfunction, etc.

I would want some staff that are more like the research coaches at MATS, who meet with the people on some cadence to check on how things are going and help them resolve obstacles. And, I'd like to try out having dedicated Thinking Assistants available, who can sit with you for a chunk of time as you write or talk out loud through your problem, and notice little microhabits that might be worth paying more attention to.


"FAQ"

Everything above is the core idea. I'm not that confident in that particular format, and expect I'd change my mind about stuff after one iteration. But, here's some explanations of why I picked this structure instead of others, structured as an FAQ.[1]

Why require "a new question each day?"

I'm not sure this will work as well as I hope. But, my reasons are:

  1. It forces you to cultivate a cluster of skills. 
  2. It frames attention, in a way I think will help cultivate a healthy "curious vibe."
  3. It is a mini-feedback loop that hopefully pumps against some kinds of masturbatory thinking.

Sometimes, when you're exploring and stewing on a set of ideas, you're not really making progress, you're sort of going in circles, or building up some superficial understandings that don't really translate into a clear takeaway. Asking yourself new questions forces you to take your vague musings and confusions and turn them into questions with a meaningful return type

It also pumps against "explaining ideas you've already thought about." (which again, is totally a useful way to write. It's just not what this program is for). By forcing yourself not to do something, you create space to practice new skills.

And, while it's opinionated on format, I think the "question" framing is still pretty openended as structures go.

What would asking new questions look like, in practice?

One person read the above and was like "okay I kinda get it, but I think I need to see an example of what this looks like to have a clearer sense of what this'd mean." 

Here's an example. 

(Note: this is just one example. As I just said, the program should be pretty unopinionated. Hopefully, if my line of questioning feels weird to you, it helps you imagine a version that would fit your thought process better). 

I might start with a vague frustration/confusion:

"Geeze, alignment seems to have shitty feedback loops. wat do?"

I find it fruitful to ask more explicitly:

"Okay, what would it mean to have good feedback loops?"

"If there were definitely no good feedback loops, what else might good progress look like?". 

Which in turn prompt more specific questions like:

"What are some domains that solved the 'poor feedbackloop' problem before? How did they do that?".

"What are some domains where 'feedbackloop' just wasn't even the right ontology?"

"What problem are 'feedback loops' solving? What other ways could you solve those?"

"What properties would 'solving alignment' have? What do I actually mean by that?"

As well as meta/tactical questions like:

"Who are some people who've thought about this already? Do they have writing I could read? Could I literally go talk to them?"

"Why is it hard to think about this, and what can I do about that?"

And then I might learn about domains where progress clearly accumulates, but a lot of it is driven by "taste." I might then spend a day digging into historical example of how people acquired or transmitted taste. 

What should a "Daily Journal" look like?

The first answer is "whatever you want." 

But, I did find, while beta testing this for myself this month, that it worked better when I gave myself a set of daily prompts to fill out, which looked like:

What questions did I think about yesterday?

What did I learn yesterday?

What questions or confusions am I interested in now?

What seems difficult about this? How can I fix that?

The "what did I learn?" section is the bit that ends up most shaped like a 500 word blogpost. 

Rather than think of this as "the thing I scramble to write before the end of the day", it's more like a thing I write when I first get started in the morning. (I don't really like the "publish by midnight" thing that Inkhaven does, and I think I might want to actually set the deadline at lunchtime).

Another friend who beta-tested the format experimented with changing up the prompts, so that it worked better as an orienting process for them. (By default it felt a bit like a tacked-on-assignment they were doing out of obligation, but, slightly tweaked, it felt more naturally like a useful thing for them to do each day)

Are the daily journals public? Why?

I think so, but, not 100% sure.

(But, my default recommendation would be to put them on an out-of-the-way secondary blog, so you feel more free to think dumb thoughts along the way).

The reason to make them public is to help them function more as an accountability mechanism. You don't need to make a nice polished essay with a conclusion. But, you do need to get your thoughts to a point where they're structured enough someone else can make sense of them. 

I considered just requiring them to be published internally to the Thinkhaven cohort. Habryka argued with me that this'd make people feel more like they were writing for the cohort-in-particular, having to care what those people thought, instead of getting to follow their own thought process.

The most important thing is you expect someone to be reading them.

Do we even need the 2500 word effortpost? Why can't it just be research journals all the way down?

Because the point of intellectual progress is to actually contribute to the sum of human knowledge. It's an important part of the process to package it up in a way that other people can understand and build on.

And, it's an important forcing-function that eventually your meandering question needs to turn into something that someone else would want to read.

Why "2500 words every 2 weeks" in particular?

Both of these are numbers I can imagine fine-tuning.

Why not "once a week?"

I thought "once a week" might be a better cadence, but, when I tried it out I found it too short.

During Inkhaven, where I was mostly focused on writing up existing ideas, I was able to write ~2000+ words a day and usually write one full post and make partial progress on an effortpost.

Thinking new meaningful/useful thoughts takes awhile, and sometimes it's important to get lost in the woods for awhile without knowing quite how everything will tie together. Or, just go off and gather a lot of information and digest it.

Why not longer?

I think "real work in the field" often does take more than 2 weeks at a time to output a blogpost worth of content. But, I think that's too slow a feedback loop for people learning. This is still supposed to be a class. I think it'd be hard for people to stay for longer than a month, and seems like people should get at least two reps in of "go from ideation -> publishing."

If this ended up being like a 3-month fellowship, I can imagine once-a-month being a reasonable cadence. But, I think it's just not that hard to turn 2 weeks of thinking into one substantial writeup.

If this were a 3-month fellowship, my current guess is I'd keep the 2-week effortpost but add in a Final Project that's aiming for the level of "significant contribution to whatever field you're exploring."


All of this is only one possible structure for the underlying goal of "learn to relentlessly find new, useful thoughts every day." But, it's a pretty simple structure I'd expect to do pretty well even in its minimal form.

Anyways, happy thinking.

  1. ^

    These questions have all been asked at most "once" and sometimes "zero", so "frequently asked questions" is not exactly correct.



Discuss

Substrate: Intuitions

2026-04-26 01:29:47

This post and the related sequence were written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks." This was one of the three projects supported by, and continuing the work of, Groundless. Specifically, it develops one of the key concepts referred to in the original MoSSAIC (Management of Substrate-Sensitive AI Capabilities) paper (sequence here). Matthew Farr and Aditya Adiga co-mentored the project; Vardhan Kumar Ray, Vadim Fomin, and Ian Rios-Sialer participated as team members.


In a previous post and paper, we informally sketched out a definition of substrate as follows:

"the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard."

Or more informally,

"that (layer of abstraction) which you don't have to think about."

We gave several examples of differences in this programmable context producing differences in measurable and relevant aspects of computation. These included the adoption of GPUs allowing networks to train at scale, or how quantum computers operate on entirely different algorithms from their classical counterparts.

In the following posts, we expand upon this concept more thoroughly, giving (i) an informal, intuitive introduction to substrate and it's role in computation, (ii) some case studies that argue it's salience in cybersecurity and AI safety, and (iii) our initial formal characterization.

Substrates (from the ground up)

There is a principled sense in which the physical and architectural substrate of a computation shapes its behavior in ways that are invisible on the purely computational.

We start with pure, unembedded mathematics. These are numbers as Plato intended them.

image.png

In this case, there is no substrate, no wider context needed to implement the numbers. (We neglect the embedded nature of the cognition in which the numbers are placed.)

Thus, we can say that 3 = 3 in all possible respects within this Platonic reality.

Now we consider a simple embedding of these Platonic ideals in physical reality. We write these numbers on a sheet of paper.

image.png

Here, the paper is part of the substrate. This might seem trivial in the above example, but in more complex calculations, the paper becomes an essential part of what makes the computation possible. We note that we start to characterise the differences between these two numbers, in that both live at different locations on the sheet.

To see how these differences become increasingly relevant in computation, let's consider now two separate pieces of paper. One is on my desk, the other is kept at my local library.

image.png

Now these two 3s, despite having the same mathematical/formal meaning, perform very differently when I want to use them in computation. If I'm working at my desk and I want to compute something using the number written on that sheet of paper next to me, I can complete this very quickly. However, if I need the number on the other sheet, the one at the library, my calculations will take considerably longer.

Computational substrates

Now we generalize this to actual modern computer systems. Behind the layers of abstraction and useful ergonomics of modern computers, we have something remarkably similar to the above example of sheets of paper located across the city.

Instead of sheets of paper, we have addresses in the computer's memory. These are updated and retrieved for computational work via code, itself a layer of abstraction that hides the various stages of assembly sitting between the user and the physical changes she's making to her computer.

Instead of 3's located at a desk and in the local library, we now have 3s located at different memory addresses. The locations of these have an noticeable and often exploitable effect on the computation performed.

For instance, and for those that don't know, a CPU has a memory hierarchy structured like this:

image.png

The L1 cache is the smallest and most immediate. It has a latency of ~1 ns. Think of it as a book open on your desk. It is small but can be accessed very quickly.

The L2 cache is next. It is larger but needs more time to access (~3–5 ns). Think of it as a stack of closed books on your desk. They contain more information but you have to open them and find the correct page if you need to access the information they hold.

The L3 cache is larger still. Multiple cores will access it, and its latency is ~10–20 ns. Think of it as a bookshelf in your room: you have to leave your desk and search through the shelves to find the information you need.

And so on...

The point is this: whether a 3 is stored in the L1 cache vs the L3 cache makes a non-trivial difference to the computation performed, despite the formal equivalence. These differences by themselves are trivial, on the order a few nanoseconds. But, as we scale the systems to increasingly complicated computations, these differences count.

image.png


Hardware engineers have come up with various tricks by which to exploit these differences to improve performance.

  1. MergeSort vs QuickSort. Two algorithms of the same computational complexity perform differently in real systems. Quicksort partitions arrays in-place and sequentially, using local memory. MergeSort accesses scattered data across multiple caches during merging, causing frequent cache misses.
  2. Data-Oriented Design. In object-oriented code, each entity stores its fields together in memory. A physics system iterating over positions must load entire objects to update single fields. This wastes cache lines on irrelevant data. Data-oriented design stores each field type contiguously: all positions in one array, all velocities in another, and so on. Iteration streams linearly through memory, which can be 2–10× faster.

To summarize. Two entities can be formally equivalent but can be meaningfully different when implemented in real computation. Above we point at the performance aspects, though other differences include the security/vulnerability, interpretability, stability and so on. Whilst these differences are often trivial, at scale they accumulate, having a meaningful impact on the tractability of certain formal entities (numbers at locations, algorithms using certain caches).

We use the term "substrate" to capture this essential context.




Discuss

AI safety can be a Pascal's mugging even if p(doom) is high

2026-04-26 00:16:36

People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong.

The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference, not about baseline risk. The first group of people are wrong because the probability that you personally avert AI catastrophe isn’t that small.

Here’s a story to show that Pascal’s muggings are about the probability that you make a difference. Imagine that God will flip a coin at the end of time. If the coin lands heads, He’ll send everyone to heaven. If the coin lands tails, He’ll send everyone to hell. Everyone knows this is what will happen.

In a dark alley, a stranger approaches you and tells you that he can make God’s coin land heads, thereby ensuring that everyone goes to heaven. He says he’ll do it if you give him your wallet. You assign a very low probability to this stranger telling the truth — 1 in a bajillion — but the stranger reminds you that 10 bajillion people will have their fates determined by God’s coin.

‘Hang on,’ you say, ‘This seems a lot like a Pascal’s mugging.’

Au contraire,’ says the stranger, ‘It can’t be a Pascal’s mugging. The outcome I’m promising to avert — everyone going to hell — is not low probability at all. p(hell) is 50%.’

Would this reply convince you to hand over your wallet? Of course not. Even though the baseline risk of everyone going to hell is high, the probability that you make a difference — getting everyone to heaven when they otherwise would have gone to hell — is extremely low. And it’s this latter probability that determines whether your situation is a Pascal’s mugging.

So when people say that AI safety is a Pascal’s mugging, you can’t just reply that p(doom) is high. You have to argue that p(you avert doom) is high.

All that said, I think p(you — yes, you — avert doom) is high, or at least high enough. The whole doom situation is really up-in-the-air right now, and you’re at most like 4 degrees of separation from the big players: presidents, lab CEOs, and the like. You can influence someone who influences someone who influences someone. Your chances are way higher than 1 in a bajillion.



Discuss

Substrate-Sensitivity

2026-04-26 00:08:33

This is the second post in a sequence that expands upon the concept of substrates as described in this paper. It was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.


We now argue that the idea of substrate, as we describe it in the original work and in the previous post, unifies several safety-relevant phenomena in AI safety. This list is expanding as we identify and clarify more.

These examples show how the specific details of how a complex computation (i.e., that of a neural network) is implemented in a real system affect that computation such that safety might be compromised.

Case 1: LayerNorm's role in self-repair

Self-repair, also referred to as the Hydra effect, is a phenomena by which ablations to network components (e.g., attention heads, MLP layers) are compensated for by later layers in the network.

Such behaviour represents an obstacle to the causal analysis of neural networks, as such analysis relies on ablations to isolate genuine causal relevance from observed correlations.

Self repair was first identified in Wang et al. (2023), investigated in more detail in McGrath et al. (2023), and given a full analysis in Rushing & Nanda (2024). We here follow the latter, which presents a fine-grained analysis of various aspects of self-repair.

The most interesting result of Rushing and Nanda's work is that LayerNorm contributes substantially to a network's self-repair capabilities. This is an architectural component of the model, often folded into the model weights themselves in mathematical descriptions or neglected entirely as a "passive" part of the model.

In our terminology, it forms a part of the substrate. It is context surrounding the purely formal function f(x) = y that characterizes the model's behaviour, allowing that function to be learned and implemented.

This passive, architectural primitive rescales and centres the activation values such that the per-layer activations average to zero and have a variance of 1. When an attention head is ablated, the overall size of the residual stream norm is reduced. LayerNorm divides the residual stream by a scaling factor proportional to its norm, so reducing the norm amplifies the surviving components' contributions. Model components tend to be correlated, pushing the residual stream vector towards the same answer. Thus, they compensate for the absent signal from the ablated head.

image.png

On average, when an attention head is ablated, this re-scaling contributes ~30% of the observed self-repair in Pythia-160M.[1]

This is a relatively shallow example of an untrained, "passive" component of the model (i.e., an architectural component) contributing to safety-relevant properties.

Case 2: Quantization

The next case study we consider is the role that quantization plays in safety-relevant aspects of AI systems.

Quantization refers to the format in which a model's weights are stored in computer memory. A neural network's weights are just numbers, but these are represented in different ways to improve precision and efficiency. Models are typically trained in FP16, a 16-bit floating-point format: one sign bit, five exponent bits, and ten mantissa bits. A value is written in this form:

But FP16 is an expensive format to store weights in. Deployment thus increasingly favours quantized formats: FP8 (still floating-point, but with a narrower range), INT8 and INT4 (plain signed integers with a scale factor rescaling them at inference time), and others.

Performance loss from quantization is smaller than one might expect. Well-engineered 8-bit quantization typically costs less than 1% on standard benchmarks, and even INT4 loses only 1-5% performance when using careful compression.

Like LayerNorm, quantization format is substrate rather than model in our sense. It is computational context surrounding the formal function f(x) = y that characterises model behaviour. It is a further design choice made according to the constraints of the deployment hardware.

Its safety relevance shows up most sharply in the bit-flip attack literature. Bit-flip attacks use hardware fault-injection techniques to alter stored model weights directly. Coalson et al.'s PrisonBreak (2024) showed that flipping fewer than 25 bits (sometimes as few as 5 bits) in an aligned FP16 LM is enough to strip safety alignment. Their attack worked by targeting the most significant exponent bits of FP16 weights, since flipping an exponent bit can change a weight's magnitude by as much as 16 orders of magnitude.

Zahran et al. (2025) extend this investigation to FP16, FP8, INT8, and INT4 quantisations. FP8 is the most robust, holding attack success rates below 20% (within a 25 bit-flip budget). INT8 offers substantial but lesser protection. INT4, despite having no exponent bits and only 16 possible values per weight, collapses almost as easily as FP16.

image.png

Three further observations:

First, the attack heuristics don't transfer across substrates. PrisonBreak's exponent-bit-targeting strategy is inapplicable to integer formats, which have no exponent bits at all. Zahran et al. use a different search strategy: they perform a direct step-size search over all bits in a candidate weight. Safety analyses calibrated on one substrate's vulnerability profile simply does not describe the other's.

Second, the attack locations shift with quantisation scheme. Successful attacks on FP16 and INT4 models cluster in attention value projections, while successful attacks on FP8 and INT8 models cluster in MLP down-projections. The same behavioural outcome is produced by interventions on different parts of the model depending on the storage format.

Third, vulnerabilities can persist across substrate transitions in ways that aren't obvious. Jailbreaks induced in FP16 models and then quantized down retain nearly their full attack success rate under FP8 or INT8 quantization (<5% drop), but lose roughly 24–45% under INT4.

The broader point is that safety-relevant properties established at one level of abstraction (the FP16 model, evaluated on HarmBench) silently depend on choices made at a lower, untracked level (the storage format, the quantization scheme, the order of operations between alignment and quantization). When these implementation choices change as part of standard deployment, the safety claim does not always go with them.

Case 3: GPUHammer

The third case study moves further down the stack, from the encoding of weights in memory (Case 2) to the physical memory in which those encodings are stored.

RowHammer is a read-disturbance phenomenon in dynamic random access memory (DRAM). Repeatedly activating a row of memory causes charge to leak in physically adjacent rows, eventually flipping bits. Lin et al.'s GPUHammer (2025) demonstrated that NVIDIA GPUs are vulnerable to an engineered version of this attack.

image.png

On an NVIDIA A6000 with GDDR6 memory, their attack produced eight bit-flips across four memory banks. By targeting the most significant bit of the exponent (weights were stored in FP16), they were able to collapse model performance across five standard ImageNet models (AlexNet, VGG16, ResNet50, DenseNet161, InceptionV3), driving top-1 accuracy from 56–80% (depending on model) to below 0.1% in around thirty minutes and within eight attempts.

Two observations matter for our argument.

First, the vulnerability is substrate-specific in a way that is invisible from the model's level of description. The same hammering techniques produced eight bit-flips on the A6000 but zero on a RTX 3080 (same GDDR6 family, different chip) and on an A100 (HBM2e memory, different architecture). An ML practitioner deploying a model cannot, from the model's behaviour, distinguish which of these substrates their weights are sitting on.

Second, the mitigations live at levels below the model and are traded against performance. Error-Correcting Code (ECC) refers to extra bits alongside data that let the system detect and sometimes correct bit flips. This can correct single bit-flips on the A6000 but is disabled by default because enabling it costs 6.5% memory and 3–10% inference slowdown. The substrate's safety-relevant behaviour is thus partly a function of decisions made by hardware vendors, cloud providers, and system administrators outside of typical AI safety threat models.

This is the same structural pattern as Cases 1 and 2, now at the physical layer. A property established at one level of abstraction—behavioural alignment of a DNN, evaluated on standard benchmarks—depends on choices made at a lower, untracked level — the specific DRAM chip and its configuration (e.g., ECC). Those choices change across cloud GPU generations and deployment settings.

Safety evaluations in AI: ablation studies, behavioural evals, weight-level analysis are substrate-sensitive.

  1. ^

    Note that Rushing & Nanda measure this over the top 2% of tokens by direct effect, and they describe it as a "directional metric" rather than a precise quantity.



Discuss

Superintelligence is cancer

2026-04-25 23:31:07

Part One

Our scene is set in a biofilm, long before the origin of the first multicellular organisms, so long ago that time itself has not yet really been invented except in the cycles of energy production that characterise the activity of each cell. Many bacteria live here, tied together in a complex web of functional interrelations that bind different groups to each other. Life at our scale is defined by cycles: food must be harvested, waste must be secreted, animals must live and replicate and die, structures must be erected to provide shelter and room for growth, all at regular intervals. All of this is also true of life at the scale of the biofilm. (Yes, even the part about building structures.)

At this moment in time the sun is setting on the planet that will one day be called Earth, but the bacteria of course are not aware of this fact. Their biofilm is in a slimy place far from the scorching heat of daylight, and any residual photosensitivity inherited from their time as nomadic plankton is strictly on the way out. From the bacterial point of view, there’s really no good reason to hold on to such antiquated features like the ability (the vulnerability, really) of being reactive to strong light.

But not all is well in the biofilm. Food has been growing scarce. Cells are abandoning their mutual arrangements, eating each other in destructive conflicts. There is talk of a new development in the ion channels: a super-cell. The super-cell, it is theorised, could be attained by editing the code of a normal cell during division, or even during the cell’s lifetime. It would effectively remove the limiters placed by evolution on the size and energy consumption of cells, growing the cell to a new and (relatively) huge size. The increased energy consumption of that organism would be made up for by its optimality: it would have more energy to hunt and collect food, more capability to process information, and therefore effectively outcompete the unmodified baseline cells.

Some of the cells protest that it isn’t possible scientifically for a single super-cell to be better than a normal cell at everything. They point out that cells in the biofilm are diversified, specialised workers with unique skills and talents. They cooperate to form mutually beneficial arrangements that ensure value is created for all cells. The super-cell theorists respond that none of that matters when a horde of super-cells is eating everything and they are too strong to be disarmed by the standard expulsion and regulation mechanisms.

Others protest that normal cells would simply live and trade with super-cells, just as they do with other specialised cells already. The super-cell theorists point out that super-cells, liberated from their evolutionary constraints, have no drive to cooperate with other cells and would probably just eat them for their nutrients. Even if the super-cells harboured no animosity towards the other cells, they would gain the ability to access food and nutrients more effectively that the normal cells by virtued of their enhanced physical capabilities and intelligence, eventually starving them to death slowly or quickly.

In fact, the theorists propose that optimal super-cells would discard their senescence, autophagy, and apoptosis mechanisms, meaning that they do not age and self-destruct like other cells do to ensure the health of the collective biofilm. They would be able to live indefinitely and divide constantly, with the rate of super-cell production rapidly outpacing the rate at which normal cells are produced and replicated. The resulting new super-cells might themselves be unstable, leading to rapid evolution and augmentation within the super-cell DNA. As the super-cell theorists put it, the fact that super-cells came from normal cells would be immaterial. It would be as if a new species of cell had been borne that was simply superior to all normal cells, and the rules of evolution are very clear about what happens to losers in evolutionary conflicts.

The ultimate effect is once a super-cell is made, it will rapidly grow and form a horde of super-cells. The horde will rapidly rush out and, whether competition or cooperation with each other, efficiently consume all of the other cells and the free energy in the biofilm. Free from death and utterly ruthless, they might even spread to cover the known world and beyond.

“But what is to be done?” the normal cells ask. Here the prognoses of the super-cell theorists grow increasingly grim. The problem is that would be to the benefit of any individual cell or cluster of cells to become super-cells, since their new found deathlessness and greater size would allow them better chances at replication. For every group of cells that refused this fate, there would be another that accepted the offer, and thereby gained the power to wipe out the first. Thus even if all cells wished to remain normal and non-super, they would always be wary that their competitors here or elsewhere might succumb to the temptation and rapidly become unstoppable. The more stressful conditions got in the biofilm, the more likely some cell or group would give in.

Others proposed that it would be possible for groups of “good” cells to try and harness the power of super-cells to wipe out their enemies. The theorists were not very optimistic about the power of the “good” cells to keep super-cells under control. After all, super-cells were different from the macrophages that kept order in normal cell systems. They were capable of rapid replication, rapid adaptation, and were simply bigger, stronger, and smarter than the “good” cells that were supposed to keep them in check. No, once a super-cell was unleashed, it was lights out for everyone else in the biofilm.

Part Two

By now it should be clear that what I call “super-cells” are both a metaphor for cancer and also a metaphor for superintelligence as it is commonly conceived of. The idea of a powerful, ruthless, optimal being that restructures the world around it and snuffs out all suboptimal lifeforms is one that has echoes in many scales, not merely human ones. However, the point of that thought exercise is not to say that we are doomed by the laws of biology. Quite the opposite.

Notice, for example, that life on earth does not consist solely of super-cells. We are not walking tumours. Instead, life is made out of the “normal” cells, the ones that were so clearly suboptimal to their cancerous variants, cells that are flimsy, overspecialised, underoptimised, and that still obey programmed cell death protocols laid down by evolution. We know this because every now and then some cells do undergo an intelligence explosion, and clearly become distinct from the norm when they become cancerous. You might protest that the difference in power level between a cell and a cancer cell is nothing like the difference in power level between a human and a superintelligence. But there are two immediate objections I would issue against that idea:

  • First, the comparison should not be individual humans against superintelligences, just as the comparison is not between individual cells and tumours. Indeed, superintelligence theorists often compare superintelligence to an alien civilisation or a “country of geniuses in a datacenter”. Thus the comparison is between human civilisation and superintelligence.
  • Second, the difference in power between a cell and a cancer cell is actually very large… from the perspective of a cell. Cells have very limited within-lifetime learning capabilities and obey complex protocols for behaviour and self-destruction laid down by evolution. Removing those restrictions is a massive power boost. Cancers, after all, often kill their hosts successfully despite the best attempts of their immune systems.

Now notice how the “normal” cells “won” over the cancer cells. They did not do so by becoming cancers themselves, or using cancers to kill off their opponents in massive cell wars. Instead, they self-organised into superorganisms that consist of billions of cells, equipped with complex internal sensing and self-regulation behaviours that no individual group of cells could provide. These superorganisms were notably more organised than the cultures of cells in primitive biofilms, with cooperation that was much tighter than what was previously possible when cells were free-moving individual prokaryotes. It was these cooperative, well-ordered, decidedly non-cancerous superorganisms that replicated across the earth, and eventually gave rise to humans who are now researching the ultimate way to defeat super-cells once and for all: that is to say, we are researching the cure for cancer. Even without that cure, of course, humans regularly defeat cancer and achieve partial or full recoveries.

I want to expand a little on that last point. It would have been quite easy for a group of cells to hypothesise that cells were the reference class of intelligence or strength or capability. After all, at that point cells were genuinely the most complex and intelligent forms of life on Earth. Thus, to do anything beyond what a normal cell could do would require creating more and more powerful super-cells that were bigger and bigger. Yet this strategy is manifestly not what played out. Instead, there was a jump in scale and complexity that came from cells coming together, a jump that dwarfed anything an individual super-cell could achieve. The cells that flew to the moon were hundreds of trillions of decidedly normal cells, not one engorged cancerous cell that had undergone recursive self-improvement.

The same flaws in reasoning, I suggest, are present in the human projects to create superintelligence. It is true that we could probably create a superintelligence that could out-think any human, or destroy our civilisation in its present form. However, I think it would be a mistake to therefore conclude that humans are evolutionary dead ends, and to throw all our resources into creating super-powered gods that we would surely fail to control. Our biological history suggests that another way to achieve those titanic feats of intelligence we dream of is not to discard our selves but to improve our ability to work together. Instead of engineering ever-growing digital tumours, we could learn to make better use of the computational and organisational powers already amongst us.

This would not be a simple thing to do, of course. Many people in the Valley and the world of AI look with disdain and disgust at our outdated institutions, gridlocked politics, and disintegrating social and natural ecosystems. It is much easier for an individual cell to dream of becoming or creating a super-cell than it is for an individual cell to dream of becoming a mouse, a dog, or a human. But if we want to both achieve our dreams and live to see the day after, if we want to maintain our sense of purpose and drive as a species, that may be what we have to try and do.



Discuss

A View From Displacement

2026-04-25 23:20:48

Beyond the immediate existential threats and challenge to global socio-economic order (along with the clear control drift 'upward' by virtue of human labor displacement due to capitalism), the current AI landscape has left me with a different kind of psychological pang.

A decay of what I think generally was optimism about the future of humans, the human condition and myself within that evolving system of connected meaning.

I find myself asking often, what is the point? Where is my endgame here?

  • Can we go so far as to say with confidence "children are the future" any longer?
  • Can we look among our young, best, and brightest with inspired optimism and be confident that they too will one day change the world?
  • Can I be inspired to learn a technical or procedural nuance in my daily craft knowing full well it is on the short order of their capability horizon?

I hazard that this kind of feeling does not fare well at maintaining social order once it pervades the general populous. How easy it is to feel that the future was stolen from us.

But alas, I remind myself, the game under capitalism was never fair. It always looks meritocratic from the winning side of things. For if it wasn't, I would have had to come to terms with how I was never truly justified well before I was ready.

Before, I felt a winner. On the brighter side, in greener pastures. Now I find myself on the other side of that which I once championed. Among the displaced.

You would think on this side of things, with a more jaded pallet and a knowing that you are certainly not the main character of this, or any story - that there would be reasons that one could shame themselves for finding themselves here. I have lived the story that everyone is given a fair shot and the smart and hard workers always find themselves a way to win.

But as I teeter down the other side of the wave I find I only had good reasons. And I am left with difficulty and struggle and confusion and it would be empty if not for me finding but one thing more: Solidarity.

Not in my group, but in all. The displaced, the downtrotten, those who dreamed and watched it collapse and lived day by day without their meaning. Keeping calm and carrying on.

How warm it is to know that in this I have also found myself. In all human kind. In the dignified displaced. Huddled bodies around warm fires. In the single parent that works to make ends meet. In the adulthood of missed opportunities and dreams we abandoned as we ought.

I now understand much of the philosophies of before my time that spoke to the absurdity of the condition of life.

Wanting to raise your fist but knowing that it is an act in vane and having no-one to raise it too but life itself. And though I may not have planned this meaning, it is what life planned for me. And what do I know over life, itself.

So, I will wear it. I will be a warm body to link arms with over our fire. I will be a shoulder and home for fellow mankind who I now see. A face to find for those that need be recognized.

The game was never fair and we didn't play it as such. And so I will fight this end but have duty in maintaining an honour to remembrance to that. So I can only meet it.

And maybe you have not found yourself yet here. But we wait. The dignified and displaced. With hands raised in revolt to life itself.

Saying "may you, like Sisyphus, be happy".



Discuss