MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Mean field sequence: an introduction

2026-04-04 15:30:18

This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is written jointly). These posts are a combination of an explainer and some original research/ experiments.

The goal of these posts is to explain an approach to understanding and interpreting model internals which we informally denote "mean field theory" or MFT. In the literature, the closest matching term is "adaptive mean field theory". We will use the term loosely to denote a rich emerging literature that applies many-body thermodynamic methods to neural net interpretability. It includes work on both Bayesian learning and dynamics (SGD), and work in wider "NNFT" (neural net field theory) contexts. Dmitry's recent post on learning sparse denoising also heuristically fits into this picture (or more precisely, a small extension of it).

Our team at Principles of Intelligence (formerly PIBBSS) believes that this point of view on interpretability remains highly neglected, and should be better understood and these ideas should be used much more in interpretability thinking and tools.

We hope to formulate this theory in a more user-friendly that can be absorbed and used by interpretability researchers. This particular post is closely related to the paper "Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity". The experiments are new.

What do we mean by mean field theory

Mean field theory is a vague term with many meanings, but for the first few posts at least we will focus on adaptive mean field theory (see for example this paper, written with a physicist audience in mind). It is a theory of infinite-width systems that is different from the more classical (and, as I'll explain below, less expressive) neural tangent kernel formalism and related Gaussian Process contexts. Ultimately it is a theory of neurons (which are treated somewhat like particles in a gas). While every single neuron in the theory is a relatively simple object, the neurons in a mean field picture allow for an emergent large-scale behavior (sometimes identified "features") that permits us to see complex interactions and circuits in what is a priori a "single-neuron theory". These cryptic phrases will hopefully be better understood as this post (and more generally as this series) progresses.

Why MFT

We ultimately want to understand the internals of neural nets to a degree that can robustly (and ideally, in some sense "safely") interpret why a neural net makes a particular decision. So one might say that this implies that we should only care about theories that apply directly to real models. Finite width, large depth, etc. While this is fair, any interpretation must ultimately rely on some idealization. When we say "we have interpreted this mechanism", we mean that there is some platonic gadget or idealized model that has a mechanism "that we understand", and the real model's behavior is explained well by this platonic idealization. Thus making progress on interpretability requires accumulating an encyclopedia (or recipe book) of idealizations and simplified models. The famous SAE methodology is based on trying to fit real neural nets into an idealization inherited from compressed sensing (a field of applied math). As we will explain below, if we never had Neel Nanda's interpretation of the modular addition algorithm, we would get it "for free" by applying a mean field analysis to the related infinite-width model. As it were, the two use the same Platonic idealization[1]. Thus at least one view on the use of theory is to see it as a source of useful models that can be then applied to more realistic settings (with suitable modification, and, at least until a "standard model" theory of interpretability exists, necessarily incompletely). Useful theories should be simple enough to analyse mathematically (maybe with some simplifications, assumptions, etc.) and rich enough to illuminate new structure. We think that mean field theory (and its relatives) is well-positioned to take such a role.

Brief FAQ section

"Frequently asked questions about MFT" is a big topic that can be its own post. But before diving into a more technical introduction, we should address a few standard questions which keep cropping up, especially about comparisons between MFT and other better-known infinite-width limits.

  1. Doesn't infinite width mean that we're in the NTK (or more generally a Gaussian process) regime? The first analyses of neural nets at infinite width have been in the so-called NTK regime, where in particular the model "freezes" to its prior/ initialization at all but the last layer (which is performing linear regression). This is a remarkably deep picture that is for example sufficient to learn mnist. But approaches in this family exhibit extremely different behaviors from realistic nets (in particular the freezing of early neurons) and they are generalize much worse on problems that cannot be solved by some combination of clustering and linear regression (of which MNIST is an example). For example these methods learn only memorizing circuits in modular addition (at least in known regimes) and, worse, they are known to require exponential training data and complexity for learning algorithms that are well-known to be learnable by SGD (see for example the leap complexity paper) – this means that these techniques are fundamentally incompatible with these settings (more generally so-called "compositional" models - ones that have multiple serial steps which models tend to need depth for - have similar failures in this regime). This can be partially improved by including so-called "correction terms", but these only work when the Gaussian process has good performance by itself, and fail to ameliorate for the exponential complexity issues. Note that the Gaussian process picture is useful as a heuristic baseline. In particular it makes some predictions on scaling exponents that have some experimental agreement (and is related to the muP formalism).

    It turns out that the lack of expressivity of the Gaussian limit is due not to its having infinite width to a certain choice of how to take the infinite limit (and in particular how to scale weight regularization terms in the loss). Different limits and scalings give significantly more expressive behaviors as we shall see, and we use MFT as a catch-all term for these. (These different limits are also harder in general, at least in terms of exact mathematical analysis: the Gaussian process limit somewhat compensates for its lack of expressivity by having much easier math.)
  2. Isn't mean field theory only a Bayesian learning theory and doesn't that make it unrealistic? In physics contexts (like MFT, Gaussian Process learning, etc.) Bayesian learning is often theoretically easier to deal with, and we'll explain Bayesian learning predictions here (validated by tempering experiments). However a version of mean field for SGD learning exists and is called "Dynamical Mean Field Theory" (DMFT) (it extends the NTK in Gaussian process contexts). Probably more relevantly, Bayesian learning experiments frequently find similar structures to gradient-based methods (and are often easier to analyse). This is particularly well demonstrated in empirical results by the Timaeus group.
  3. Is mean field theory a theory of shallow models? Most existing papers on mean field theory work in the context of 2-layer neural nets (i.e. 2 linear layers, one nonlinear layer). However there is a fully general, and experimentally robust extension of the theory to a larger number of layers (see for example this lecture series), and we will look at such models here. In fact mean field theory can model mechanisms of arbitrary depth - but it works best for shallower models (or for shallow mechanisms in deep models), and would likely be less useful for modeling strongly depth-dependent phenomena.
  4. What is a success of mean field theory I should know about? Glad you asked! Most people know about the Modular Addition task, which was first explained mechanistically by Neel Nanda et al.'s grokking paper. The interpretation is heuristic: it shows that the model exhibits signatures of using a nice and unexpected trigonometric trick. It also interpolates between generalization and memorization in a sudden shift reminiscent of a phase transition. A more ambitious task (that was considered too hard to tackle in the interpretability community) would be to understand exactly what the model learns on a neuron-by-neuron basis in any setting that exhibits generalization/ grokking. Since models have inherent randomness (from initialization, and sometimes from SGD), the task is inherently a statistical one: explain the probability distribution on weights of learned models (at least to a suitable level of precision), and was generally believed to be quite hard. Thus it comes as a surprise to practitioners of interpretability that in fact there is a context where this is done.

    In the paper "Grokking as a First-order Phase Transition in Two Layer Networks", Rubin, Seroussi and Ringel constructed a complete explanation (experimentally verified to extremely high precision) for the modular addition network in the Bayesian learning setting (there are some other differences from Neel Nanda's approach, most notably the choice of loss function, but variants of the approach extend to these as well). The distribution is first understood at infinite width, then shown to apply at realistic (but large) width in the appropriate regime. When applying the adaptive mean field theory approach to this task, Fourier modes and the trigonometric mechanism fall out as a natural output of the theory – moreover they are fully explained on a statistical distribution level (i.e. we have a complete model "exactly what each neuron does" to an appropriate degree of precision, understood in a statistical physics sense). Of particular interest, the model explains a grokking-like phase transition between memorization (equivalently, a Gaussian process-like behaviour) and generalization (inherently mean field) and predicts the data fraction at which it happens (this is a Bayesian learning analog of predicting the distribution of when grokking happens in SGD-trained neural nets). The phenomenon is a genuine phase transition in the thermodynamic sense.
  5. Are real models in the mean field regime or the Gaussian process regime, or something else? This is an interesting question, whose answer is "this question doesn't make sense". The distinction between regimes applies to infinite width nets, i.e. to a totally non-standard setting. One can prove rigorous results with the gist that if the width is (sufficiently enormous with some giant bound) compared to the training data, the model is guaranteed to learn in one of these two regimes. However, no real models are that enormous. Instead, some phenomena and some mechanisms can be seen (experimentally or theoretically) to extend from infinite nets to nets of finite width. Sometimes these look more like mean field phenomena, sometimes they look like Gaussian process phenomena. For example in some sense MNIST is "GP-like" (GP stands for Gaussian process). Circuits in modular addition are, as it turns out, entirely explained by the MFT limit as we've explained above.

Introduction to the theory

The background (and the foreground)

In physics, one often looks at systems with a large, stable background. A planet vs. a sun, an electron vs. a proton, a weakly interacting observer vs. a large system being observed. In these settings the "background" is the large system and the "foreground" or "test system" is the small system being studied. In these cases the background system may be fixed, or it may be undergoing some motion (like the sun moving around the galaxy's center), but the important idealization is that it does not react to the observer/ test system. In fact, the earth is applying a gravitational pull to the sun (and famously in quantum mechanics, observations always impact a system at a quantum level). But these "reverse" effects are small, so to a good approximation we can treat the sun as doing its own "stable" thing while earth is undergoing physics that depend strongly on the sun.

Self-consistency

While typically the large "background" is a cleanly separate system from the small test system of the observer, it is sometimes extremely useful to treat the test system as a tiny piece of the background. So: the large background system may be a cup of water and the small test system may be a tiny bit of water at some location. Here while technically the full cup includes the tiny "test" bit, the large-scale behaviors (waves etc.) in the water don't really care to relevant precision if the test bit is changed or removed (at least if it's tiny enough). But the tiny bit of water definitely cares about the large-scale behaviors (waves, vortices or flows, etc.), to the extent that bits of water care about things.

Similarly (and in a closely related way), "the economy" is a giant system that includes your neighborhood bakery. The bakery can be viewed as a small "test system": it is affected by the economy. If property prices go up or the economy tanks, it might close. But the economy is not (at least to leading order) affected by this bakery. It is perhaps affected by the union of all bakeries in the world, but if this particular bakery closes due to some random phenomenon (e.g. the lead baker retires), this won't massively impact the economy.

This point of view is remarkably useful, because it introduces a notion of "self-consistency".

Self-consistency when applied in this context comes from the following pair of intuitions:

  1. the behavior of each small component is (statistically) determined by the background
  2. the behavior of the background is the sum of its small components.

If both of these assumptions are true, then these two observations (when turned into equations) are usually enough to fully pin down the system. Indeed, you have two functional relationships[2] :

Putting these together, we have the combined "self-consistency" equation:

which means that the background field satisfies a fixed point equation for the composed function . It so happens that in many cases of interest, it has a unique solution. A classic example of a self-consistency equation is the supply-demand curve equilibrium. Here the background is a single number (price of a good) and the test system is the willingness of a single consumer to buy or of a single producer to sell, as a function of price (the actual "tiny components" consisting of individual consumers/producers are abstracted out, and the curve represents the average incentive).

Of the above assumption 1 is most problematic. Thinking of each component as being determined by some "large-scale" stable system needs to be interpreted appropriately (in particular the relationship is often statistical: so for example the number of bakeries in a given neighborhood fluctuates due to people retiring/ moving/ etc., even if "the economy" is held constant; similarly, every bit of the sun reacts to magnetic/ gravitational fields from other bits, but in a statistical or thermodynamic sense). Sometimes local or so-called "emergent" effects break this directional relationship (and many interesting thermodynamic systems, such as the 2-dimensional Ising model, are precisely interesting in such contexts). But surprisingly often (at least with an appropriate formalism) the approximation of the foreground as fully determined by the background (in a statistical sense) is robust. For example if we are modeling the sun, viewing the "background system" too coarsely (as just the mass + electromagnetic field + temperature, say, of the entire sun) is insufficient. But instead we can view the "background system" as a giant union of many local systems, maybe comprising a few meter chunks. These are still "large" in the sense of being much larger than an atom (or a microscopic chunk), but studying their behavior (in an appropriate abstraction) offers sufficient resolution to model the sun extremely well. Similarly we can't apply a single supply-demand curve to the entire economy (bread costs different amounts in different places). But in appropriate contexts (for fungible products like oil, and on a "local economy" level where the economy is roughly uniform but not dominated by a single station, for example) self-consistency is a pretty good model.

In many settings, the question of how well "assumption 1" above holds is related to a notion of connectedness. In the sun's magnetic plasma, the magnetic field experienced by a particle is accumulated over billions and billions of nearby particles - so the graph of interactions is extremely connected. In an oil economy, each consumer can typically choose between dozens of nearby stations which are reachable by car. However other settings (like the Ising model, or markets for rare and hard-to-transport goods) cannot be purely modeled by self-consistency as well.

In physics, systems that are well-modeled by a self-consistency equation (coupled background and foreground systems) are generally called mean-field settings. A big triumph of statistical physics is to make situations with local/ emergent phenomena "behave as well as" mean field theories – renormalization is a fundamental tool here, and most textbooks on renormalization from a statistical-physics view tend to start with a discussion of mean-field methods. But settings that are directly mean-field (for example due to being highly connected or high-dimensional) are particularly nice, easy-to-study

Neural nets and mean field

Neural nets are physical systems. This is a vacuous statement – anything that has statistics can be studied using a physics toolkit (and in many ways statistical physics is just statistics with different terms). Indeed, real neural nets are immensely complex, and if there is some sense in which they can be locally decomposed into background-foreground consistencies, these must themselves be immensely complex and likely dependent on sophisticated tooling to identify (this is one of the reasons why we are running an agenda on renormalization).

But it turns out that in some settings and architectures neural nets are extremely well-modeled by systems with high connectivity – and the reason is, naively enough, precisely the fact that they are highly connected (often fully-connected) on a neuron level (note that architectures that aren't "fully-connected" – e.g. CNNs – sometimes still have properties that make them "highly connected" from a physical point of view).

The mean-field background and foreground for a neural net

In neural net MFT the foreground (or "system"/ "observer") abstraction is a neuron. This is typically a coordinate index of some layer.

The important "background" thing that each neuron "carries" is what is called an activation function, often denoted by the letter . This is a function on data: given any input x, partially running the model on x returns a vector of activations. is its i'th component. This function is now the thing that a neuron contributes to the "background field" of the neural net.[3]

Now if there are lots of neurons, each neuron's activation function reacts to a background generated by the other neurons: removing the neuron in this limit doesn't change the loss by much, so the background determines each neuron's behavior as a statistical distribution. Conversely, the background itself is composed of individual "foreground" neurons. The loop:

background neuron distribution background

must close, i.e. be self-consistent. Making sense of this loop is the key content of mean field theory of neural nets.

In later installments we'll explain a bit more about the loop and show some examples of it working (or not). You can also see the original linked paper about the Curse of Detail for a more physics-forward view of this.

Experimental setting and pretty picture

We'll close with a toy example of "self-consistency", which is visually satisfying.

In this setting we look at a 2-layer model that takes in a two-dimensional input variables and is trained on the target at a large width (here ) and on infinite data. The activation function is a bounded sigmoid-like function (the relu version of tanh). Each neuron at layer 1 is a function that only depends on a 2-dimensional row of the weight matrix, so the associated "test" field or particle can be plotted on a 2-dimensional graph. When we plot all of these together we get a good picture of the distribution of single-neuron functions that combine together to form the background system:

image.png

The neurons above were trained jointly in a way that would allow them to interact.

It has a nice clover-leaf like structure (it will reappear later when we look at continuous xors - a multi-layer setting where mean field performs compositional computation; already in this simple setting, the fact that the cloud of neurons is a "shaped" distribution rather than a flat Gaussian puts us solidly outside the Gaussian process regime). Now we can empirically measure how a single randomly initialized "foreground" neuron would react to the background generated by this model. To do this, we train 2048 iid single-neuron models on the resulting background from the fully trained model.[4] When we do this and combine the resulting 2048 neurons into a new model, we see that indeed it looks exactly the same as the background. When we compute its associated function, we get very similar loss.

image.png

Each neuron in this picture was trained in a fully iid way, without interacting with any neuron, simply by "reacting to the background", i.e. learning the task in combination with the "blue" background above.

Note that this isn't a property that comes "for free". If we were to use the wrong background (for example a the more Gaussian process-like model here) then samples of the foreground would fail to align to the background.

image.png

Blue is background, orange is foreground (each orange neuron trained independently in reaction to background).

The case of 2-layer networks is special: neuron functions are particularly simple to characterize, and the mean field has better properties (it's not "coupled"). But we'll see that deeper nets can still be analyzed using this language, and even using empirical methods we can get cleaner pictures of how they learn and process representations.

In the next post, we will explain the physics behind these experiments and the experimental details of the models (github repo coming soon).


  1. ^

    Technically they differ on whether they use the "pizza" vs. "clock" mechanisms, but the two idealizations are related, and both the mean field and the realistic setting can be modified to make use of either.

  2. ^

    Below, f and b should generally be understood as "statistical" functions: job choice is, perhaps, a probabilistic function depending on the economy, which includes both demand/ markets but also supply/ people's interests; conversely "the economy" is the average of production over the distribution of jobs.

  3. ^

    Technicalities. Depending on the situation can either be viewed as a function a finite training set or on an infinite "set of all possible inputs", usually a large Euclidean space (example: an MNIST input is a vector of pixel values). Unless we're working with finite training data, this is a priori an infinite-dimensional gadget; and worse, the thing that is actually summed over neurons – the analog of the "market" or "background field" is nonlinear in this objects[4]. There is also a subtlety here about SGD vs. Bayesian learning which I won't get into. But in mean-field settings that admit generalization (or for finite number of inputs), this background is effectively dominated by a small set of "relevant" directions.

  4. ^

    Technical note: each single-neuron model is trained on the difference where is the trained model.

  5. ^

    In fact it is quadratic: the thing that sums over neurons is the "external square" of the neuron function, which is a function of a pair of inputs: knowing this sum fully determines the dynamics up to rotational symmetry, even for a finite-width model (it's often called the "data kernel" but is used very differently from the Gaussian process kernels, which do depend on an infinite-width assumption and lose a lot of information in finite-width and mean-field contexts).



Discuss

Democracy Dies With The Rifleman

2026-04-04 14:39:58

Political power grows out of the barrel of a gun -- Mao Zedong

Halfway thru recorded history, Athens became the first state we're sure was a democracy, and inspiration to many later ones. Probably some existed earlier, and certainly some entities smaller than states were democratic, likely long before recorded history began.

The next tenth of history saw the rise of the Roman Republic, which mixed democracy and aristocracy together to form a functional hybrid, and then it transitioned to the Roman Empire, which shifted the mix substantially towards aristocracy. For the next three tenths of recorded history, democracies were at best local governments, minor regional powers, or components of larger, autocratic states.

Few, if any, of these societies would count as "democracies" according to modern watchdog organizations. Only about 10-20% of the residents of Athens were citizens; the rest possessed no real political power. In later eras, residents of towns or cities might vote for their urban officials, but the urbanization rates were also around 10-20%, so the vast majority lived in the non-democratic countryside.

Then for the last tenth of history, democracies rose again to dominate the world stage. One standard story for this has to do with military technology. The Roman Republic expanded because it had the dominant military technology of its time; this may have been in part because of its political system. But eventually the heavily trained armored horseman became the dominant military technology, and was more easily provided by autocracies than democracies. Then widespread use of gunpowder weapons swung the balance back towards mass manpower; the knight in his castle could no longer reliably put down a peasant revolt, or hold back Napoleon and his levée en masse.

Another standard story has to do with increased state resources. Democracies generally support higher tax rates than autocracies do; while this is primarily to support social services, some amount of this is that people are willing to pay more for things that they think 'they' own (rather than their distant overlords).

A third standard story has to do with ease of turnover. Democracies generally don't have to fight civil wars or succession conflicts, because whenever such a movement would have a chance of a military victory, it also has a chance of a bloodless victory. This leads to peaceful turnovers or governments following the preferences of voters enough to not let resentments build up to the point that they boil over.

The major wars of the last two hundred years have not been limited engagements driven by hereditary elites; they have primarily been total struggles between ideologies and peoples, which only managed to become major because they could motivate significant efforts on both sides.

What does the next tenth of history[1] look like? One might think that the invention of larger and more sophisticated weapons means that we swing back towards the knight and aristocracy, but the evidence of the most recent wars suggests otherwise. Kipling's poem Arithmetic on the Frontier, on the costs of pitting Imperial troops against regional resistance, rings nearly as true about America's various wars and special operations in the Middle East. Between great powers, the weapons have become so expensive that only systems which are widely believed in by their inhabitants can afford to supply a competitive number of them.

It's not obvious to me that this continues to be true. I think next-generation military capabilities primarily have to do with 1) operational knowledge and 2) mass-produced smart weaponry. The war in Ukraine shows how conflicts between drones and riflemen go; the war in Iran shows how conflicts between the informed and the uninformed go. States may find their ability to produce weaponry becomes detached from their popularity; they may find robot soldiers are willing to follow orders that human soldiers would balk at; they may find that it is relatively cheap to identify and destroy dissenters.[2]

That is, we may be moving into an era where mass protests are relatively easy to dismiss or ignore, while individual hackers or saboteurs are still able to disrupt large systems. Taxes from a broad labor base may become less relevant than control over automated infrastructure. What political problems will the new sources of power have, and what systems will help them resolve those problems?

  1. ^

    I don't expect us to have 500 more sidereal years of history left, but I do think we might manage to cram roughly that much subjective history in before the Singularity / as it takes off.

  2. ^

    An old strategy is to recruit your police / imperial enforcers from a different ethnicity than the people that they need to defend against, so there's some baseline level of resentment that will allow brutality which will cow them into submission. Autonomous weapons allow this at scale, and for secret police that are difficult to bribe or corrupt, and widespread surveillance allows for secret police that are always watching and noticing subtle connections.



Discuss

Am I the baddie?

2026-04-04 14:00:37

I am a software engineer. I work for a company that makes software for road construction. Monday last week we were under a bad crunch and we were told to start using agentic workflows. We had like 50 tickets to close by the following Tuesday. I’ve been experimenting with ai development for years now, but this was different. I had access to Opus/Sonnet 4.6, and GPT5.4—the latest models.

Suddenly, they understood. I could talk about abstract concept’s and analogies, and it got them. I was soon working through tickets the first day in hours, what would have taken me days. But we still had a ton of work and not enough time. I was still bound to a single thread of work at a time. So like any problem, I hacked around it. I started with a worktree, where it basically creates a whole other copy of the project I was working in, and that meant multiple threads.


Still I was limited to my single service, and the system that I work on has like 20 services. Wednesday comes, and I’m still cranking the tickets out, when I realized what I could do was create a repo with sub modules for every service. The agent works best when it can find the context it needs without being overloaded.

Thursday comes, and we’re not going to make it I’ve already put in about 40 hours. they said to lean in, so I did. After setting up my MCP servers for our ticket, documentation system, communication, and calendar systems. I told the agent to pull ALL of the tickets for the big feature we are working on, then go through our documentation and communications to look for mentions of this feature, and to turn that into design requirements, then after a Q&A session, we made a plan to implement all open tickets. my idea was that with the full context, it will be better able to perform

It worked, or at least it seemed to. I was almost embarrassed about it. I was talking to our systems architect about how everything is different, and I mentioned this branch of code. he said, ”You know what? Let’s try it“ we brought it to the team, and they figured let’s give it a shot. I hadn’t actually run the code outside of tests. So our QA team dug into it live. The first one worked. The second. The third, and on and on. We went from not going to be able to finish on time, to mostly done. We found a few small bugs, but such is the way of software, especially things as complex as this.

My side project expanded. I created a CLI, a extension for my IDE to manage the local dev environments that could all run independently, and I made a dashboard that pulls all of my tickets, gives me a button to press that spins up an agent with special instructions. it pulls the details and writes the code, pushing it up for me to review. After that i added another button that fixes any issues that come up in review.

My work flow became

  1. Push button
  2. Code review
  3. Maybe push another button

My boss said I had gone plaid. Hahahaha My dashboard became sophisticated, and my process lean. now I had a way to interact with the whole system. I had it solve big problems. Ones that would take months, solved in a day, two with QA.

I had a system to unify our teams, and to allow business analysts to contribute code.

Today, a week later than when I started the project, I talked to two directors and I blew their socks off. We’re talking about doing something like this for the entire company, and I talked about automating the two buttons. It was a big win. I know I have a big raise coming. It’s likely not enough considering my impact.

I went out with friends, and AI came up. they’re pretty sure it’s going to lead to disaster. My general P(Doom) is about 60%. As I was leaving, I had the thought, Am I profiting off of human suffering? I’m proliferating these systems in more places, and my project will mean we are over-staffed at work. It kind of overwhelmed me.

Am I the baddie?



Discuss

Common advice #3: Asking why one more time

2026-04-04 13:25:06

Written quickly as part of the Inkhaven Residency.

At a high level, research feedback I give to more junior research collaborators tends to fall into one of three categories:

  • Doing quick sanity checks
  • Saying precisely what you want to say
  • Asking why one more time

In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like. 

Previously, I covered doing quick sanity checks and saying what you want to say precisely. I’ll conclude these posts by talking about probably the hardest to communicate category of common advice: asking why one more time.

Asking why one more time 

In my opinion, the most important skill in empirical research is figuring out how to make your beliefs pay rent: you have many possible hypotheses about a phenomenon; to test them, you need to connect these hypotheses with empirical observations. While it’s all well and good to perform all the basic correlations and sanity checks that you want, it’s rarely the case that the problem at hand can be straightforwardly solved by looking at a few scatter plots. 

The second important skill in empirical research is close to the converse of the above: instead of looking at your hypotheses and trying to fit them to the data, you look at places where the data seems inconsistent with any of your hypotheses (i.e. surprising or interesting) and generate new hypotheses to explain the data. 

I think these two skills tend to form a research loop: while you’re confused, first generate more hypotheses about the data, and test the hypotheses against either current or future data (or vice versa). That is, testing hypotheses against old or new data will surface anomalies, which prompt new hypotheses, which in turn need testing, which prompt new hypotheses, and so forth.

What counts as sufficient understanding for this loop? In my experience, you can often quantify the number of iterations of this loop you've completed by the depth of the natural why questions from a possible interlocutor that you can answer.[1] At the first level, we might ask questions such as, why does your hypothesis imply this empirical result? Why does the surprising result you’re trying to explain occur? At the next level, we might ask about the parts your hypotheses are made of: if your hypothesis is that the length of chains of thought predicts monitorability, why would this happen? Or, we might ask about why the surprising result didn’t generalize to other domains: if GPT-4o’s sycophancy explains many people’s attachment to it, why don’t other seemingly sycophantic models lead to the same level of attachment? 

Almost all of the researchers I’ve worked with have been incredibly bright (and from great research backgrounds) and have consistently thought about and can cogently answer the first level of whys. So I basically never need to give advice (though, if you’re not asking why your key result is what it is, maybe you should start!) However, a lot of the second level of whys that I ask (or that I ask them to generate) tend to highlight gaps in understanding and lead to fruitful discussion. 

For the sort of researcher I interact with, I think it’s good advice to take whatever answers to natural why questions you generate by default and then repeat the process of generating why questions exactly one more time for each of the explanations. 

Taking this too far. There’s a reason I say “ask why one more time” and not “continue asking why”. In general, as with many similar conversation trees, the space of natural why questions expands exponentially. At some point, you need to decide that you’ve done enough investigation, and research that never gets consumed by other people likely has minimal impact on the world.

There are a few specific failure modes I’ve seen:

  • First, and most obviously: never producing output. If you keep asking why without stopping, you will never finish anything. (This is a famously common problem around these parts.) Every explanation has sub-explanations, and at some depth you’re doing philosophy of science or metamathematics rather than object-level research. Again, there's a reason the heuristic is “one more than your default".
  • Second, there’s a social cost. In collaborative settings, asking too many whys about someone’s work can feel quite adversarial, especially if it's a new collaborator. If a collaborator has a plausible answer to the first-level why and a reasonable sketch for the second, pushing hard on the third can start to feel like you don’t trust their judgment rather than that you’re trying to improve the work. Being explicit about your intent (“I think this is strong, I’m pressure-testing it because I want us to be confident” or "I think you're correct, but I want to check that I understand it myself") can help, but it's still a real dynamic that needs to be managed.
  • Third, investigating the wrong whys. Not all branches of the why-tree are equally valuable. When you generate second-level why questions, some of them will point at load-bearing assumptions; others will point at irrelevant details. Some will be fruitful and easy to investigate, and others will be too hard or too costly to answer. Developing taste for which branches matter is a much harder skill, and one I don’t have great advice for (at least not one I can write up in a short post like this one) but as with all prioritization questions, one heuristic is to focus on the whys whose answers, if different from what you expect, would change your main conclusion.

The optimal depth of whys you try to answer depends on how seriously you care about a result, but for research (in my experience) tends to vary between two (for blog posts or ideas that you don’t intend to seriously build on in the future) to three (for the core ideas of research papers that you do hope to build on in the future). 

  1. ^

    I used to refer to this concept as simply “being skeptical”, but that fails to communicate the actual skill being executed here. I got this new framing from Thomas Kwa at METR (though any confusing parts are no doubt my own).



Discuss

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

2026-04-04 11:56:17

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.

This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.

Quick Summary:

  • Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation
  • Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned.
  • Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors.
  • Activation steering for the average difference between latent vectors did not create increases in accuracy with specific latent pair combinations and instead matched closely with random vectors from “Can we interpret latent reasoning using current mechanistic interpretability tools?”
  • Steering the kv cache to steer CODI outputs can increase accuracy while steering with hidden states do not seem to have a significant effect on CODI

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools? 

Tuned Logit Lens

To create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

Activation Steering

  1. Embedding steering

Getting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states.

 Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.

  1. KV cache Steering

Steering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.

Experiments

Confirming Previous Assumptions

PROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"

Answer = 360

Tuned Logit Lens properties:

  • Tuned lens approximates but, doesn't find the answer in some cases  like 720 (360 x 2) and 350 (360 - 10) latent 0 and 1
  • Approximate answers are not GSM8K artifacts as neither of these numbers are in the most common answers for the dataset
  • The answers being found in latent 3 and 5 for my previous post with tuned lens might be prompt specific. This suggests tuned lens might just be used as a way to see potential outputs

Default Tuned

Default



The following is the answer frequency for the GSM8K data used to train the tuned logit lens

This prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.


Activation Difference (Steering Embeddings)

Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.

Activation Difference (Steer KV cache)

  • Unlike the other method of steering which required another codi pass to get new kv values to pass this method steered the kv values as it was being used on the EoT token to generate the answer
  • Set up is getting the mean activations of latents A and B and subtracting them and steering the difference with a coefficient. Latent A is the first latent vector which is in turn subtracted by another latent vector B
  • Steering the kv values unlike steering the hidden states seemed to change work in changing the accuracy of latent step 5.
  • Most vectors being used for steering performed worse than random latent vector activation patching. Some performed significantly better than the baseline
  • Coefficient (0.5):
    • The steered vectors that worked to improve performance are  A1-B2, A1-B5, A2-B3, A2 - B3, A2-B4, A3-B5, A4-B5 coefficient 1. When steering with the difference of an earlier  latent vector and a later latent vector it is interesting how combinations  that included latent 2 performed the best for the latent A.
  • Coefficient (-1):
    • The negative coefficient flips A-B to B-A
    • Since the coefficient is -1  A1-B4, A1-B6, A4-B6, A5-B6 can be interpreted as B4-A1 B6-A1, B6-A4, B6-A5. It seems latents are steered with 6 minus an earlier latent like 1,4,5 seems to have significant increase in accuracy. And differences like between the 1 and 6 and difference latents 5 and 6 seemed to have the highest increase in accuracy.  
  • Accuracy for all steering decreases as the coefficient increases
  • There is no activation difference that improves accuracy in positive and negative coefficients

Positive Coefficients

Negative Coefficients

A1-B2



B4-A1

A1-B5


A2 - B3


A2-B4



A3-B5


A4-B5



B6-A1


B6-A4


B6-A5


Baseline

For negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient  after steering performed  significantly better than the baseline for latent 5

The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.

Activation Difference (Logit Lens)

No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.

Future Work


  • Find a setup that makes activation steering work with CODI
  • Complete the thought anchors with CODI
  • Why did certain activation differences for the KV cache increase accuracy
  • Look with other methods such as PCA to observe the reason why activation steering worked on kv but, not hidden state.


Discuss

How to emotionally grasp the risks of AI Safety

2026-04-04 11:34:57

I've spent a fair amount of time trying to convince people that this AI thing could be quite large and quite dangerous. I think I normally have at least some success, but there is a range of responses, such as:

  1. Deer in the headlights - People don't know what to do with themselves and struggle to adjust their world models.
  2. Interesting thought experiment – "Hmm, that's very interesting; I'll think about it some more"
  3. Joke attempts – Not necessarily derogatory, but things like "ah well, I didn't care about the world that much anyway"

Of these, 1 is the appropriate emotional reaction[1] to fully absorbing and believing the arguments[2]. This is what it looks like when you take an argument, process it with the deeper reaches of your brain, turn it into something that fundamentally changes your world model and start trying to adapt.

As far as I can tell, our emotional responses are mostly connected to our System 1 thinking. This makes them harder to influence than just changing your mind. You can change your opinions, but that doesn't mean you will get it on a gut level.

I think I have a solution. In particular, visualisations. I don't know if this works for everyone, but I have personally found it helps me both stay more aligned to the cause and increase my motivation. I believe this is basically due to the fact that your system 1 needs to get the stakes to achieve complete alignment.

Note that in the particular case of AI safety, if you want to remain emotionally sane, it is potentially best not to go through this exercise (like genuinely, please skip it if you're not ready; I do it half-heartedly, and it can be painful enough).

As an example, we can take Yudkowsky's "a chemical trigger is used to activate a virus which is already in everyone's system". Close your eyes. You're at home, in your usual spot. Picture it in detail: the lights, the sun shining through the windows, the soft sofa. You're having a drinks party tonight and you've invited your best friends to come and join you. As the guests arrive, you greet each of them in turn, calling them by name and showing them in.

And then it triggers. See each one of them in your mind's eye collapse, one by one. Hear each of them say their last words. Add any details you think make it more plausible.

My brain writhes and struggles and tries to escape when I attempt this exercise. It's painful. It's emotional. Which is the point.

  1. ^

    In the normative sense of "if you care about the world and would rather it doesn't get ruined by a superintelligence, and would rather it doesn't kill everyone you know and are actually processing this on a deeper level, this is what your reaction will probably look like as an ordinary human being."

  2. ^

    I don't think you should have any particular emotional response if you go from not believing AI will kill everyone to still not believing that AI will kill everyone.

  3. ^

    Which become quite samey after the 378th time of hearing "but can't you just turn it off?"



Discuss