2026-04-04 15:30:18
This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is written jointly). These posts are a combination of an explainer and some original research/ experiments.
The goal of these posts is to explain an approach to understanding and interpreting model internals which we informally denote "mean field theory" or MFT. In the literature, the closest matching term is "adaptive mean field theory". We will use the term loosely to denote a rich emerging literature that applies many-body thermodynamic methods to neural net interpretability. It includes work on both Bayesian learning and dynamics (SGD), and work in wider "NNFT" (neural net field theory) contexts. Dmitry's recent post on learning sparse denoising also heuristically fits into this picture (or more precisely, a small extension of it).
Our team at Principles of Intelligence (formerly PIBBSS) believes that this point of view on interpretability remains highly neglected, and should be better understood and these ideas should be used much more in interpretability thinking and tools.
We hope to formulate this theory in a more user-friendly that can be absorbed and used by interpretability researchers. This particular post is closely related to the paper "Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity". The experiments are new.
Mean field theory is a vague term with many meanings, but for the first few posts at least we will focus on adaptive mean field theory (see for example this paper, written with a physicist audience in mind). It is a theory of infinite-width systems that is different from the more classical (and, as I'll explain below, less expressive) neural tangent kernel formalism and related Gaussian Process contexts. Ultimately it is a theory of neurons (which are treated somewhat like particles in a gas). While every single neuron in the theory is a relatively simple object, the neurons in a mean field picture allow for an emergent large-scale behavior (sometimes identified "features") that permits us to see complex interactions and circuits in what is a priori a "single-neuron theory". These cryptic phrases will hopefully be better understood as this post (and more generally as this series) progresses.
We ultimately want to understand the internals of neural nets to a degree that can robustly (and ideally, in some sense "safely") interpret why a neural net makes a particular decision. So one might say that this implies that we should only care about theories that apply directly to real models. Finite width, large depth, etc. While this is fair, any interpretation must ultimately rely on some idealization. When we say "we have interpreted this mechanism", we mean that there is some platonic gadget or idealized model that has a mechanism "that we understand", and the real model's behavior is explained well by this platonic idealization. Thus making progress on interpretability requires accumulating an encyclopedia (or recipe book) of idealizations and simplified models. The famous SAE methodology is based on trying to fit real neural nets into an idealization inherited from compressed sensing (a field of applied math). As we will explain below, if we never had Neel Nanda's interpretation of the modular addition algorithm, we would get it "for free" by applying a mean field analysis to the related infinite-width model. As it were, the two use the same Platonic idealization[1]. Thus at least one view on the use of theory is to see it as a source of useful models that can be then applied to more realistic settings (with suitable modification, and, at least until a "standard model" theory of interpretability exists, necessarily incompletely). Useful theories should be simple enough to analyse mathematically (maybe with some simplifications, assumptions, etc.) and rich enough to illuminate new structure. We think that mean field theory (and its relatives) is well-positioned to take such a role.
"Frequently asked questions about MFT" is a big topic that can be its own post. But before diving into a more technical introduction, we should address a few standard questions which keep cropping up, especially about comparisons between MFT and other better-known infinite-width limits.
In physics, one often looks at systems with a large, stable background. A planet vs. a sun, an electron vs. a proton, a weakly interacting observer vs. a large system being observed. In these settings the "background" is the large system and the "foreground" or "test system" is the small system being studied. In these cases the background system may be fixed, or it may be undergoing some motion (like the sun moving around the galaxy's center), but the important idealization is that it does not react to the observer/ test system. In fact, the earth is applying a gravitational pull to the sun (and famously in quantum mechanics, observations always impact a system at a quantum level). But these "reverse" effects are small, so to a good approximation we can treat the sun as doing its own "stable" thing while earth is undergoing physics that depend strongly on the sun.
While typically the large "background" is a cleanly separate system from the small test system of the observer, it is sometimes extremely useful to treat the test system as a tiny piece of the background. So: the large background system may be a cup of water and the small test system may be a tiny bit of water at some location. Here while technically the full cup includes the tiny "test" bit, the large-scale behaviors (waves etc.) in the water don't really care to relevant precision if the test bit is changed or removed (at least if it's tiny enough). But the tiny bit of water definitely cares about the large-scale behaviors (waves, vortices or flows, etc.), to the extent that bits of water care about things.
Similarly (and in a closely related way), "the economy" is a giant system that includes your neighborhood bakery. The bakery can be viewed as a small "test system": it is affected by the economy. If property prices go up or the economy tanks, it might close. But the economy is not (at least to leading order) affected by this bakery. It is perhaps affected by the union of all bakeries in the world, but if this particular bakery closes due to some random phenomenon (e.g. the lead baker retires), this won't massively impact the economy.
This point of view is remarkably useful, because it introduces a notion of "self-consistency".
Self-consistency when applied in this context comes from the following pair of intuitions:
If both of these assumptions are true, then these two observations (when turned into equations) are usually enough to fully pin down the system. Indeed, you have two functional relationships[2] :
which means that the background field satisfies a fixed point equation for the composed function
Of the above assumption 1 is most problematic. Thinking of each component as being determined by some "large-scale" stable system needs to be interpreted appropriately (in particular the relationship is often statistical: so for example the number of bakeries in a given neighborhood fluctuates due to people retiring/ moving/ etc., even if "the economy" is held constant; similarly, every bit of the sun reacts to magnetic/ gravitational fields from other bits, but in a statistical or thermodynamic sense). Sometimes local or so-called "emergent" effects break this directional relationship (and many interesting thermodynamic systems, such as the 2-dimensional Ising model, are precisely interesting in such contexts). But surprisingly often (at least with an appropriate formalism) the approximation of the foreground as fully determined by the background (in a statistical sense) is robust. For example if we are modeling the sun, viewing the "background system" too coarsely (as just the mass + electromagnetic field + temperature, say, of the entire sun) is insufficient. But instead we can view the "background system" as a giant union of many local systems, maybe comprising a few meter chunks. These are still "large" in the sense of being much larger than an atom (or a microscopic chunk), but studying their behavior (in an appropriate abstraction) offers sufficient resolution to model the sun extremely well. Similarly we can't apply a single supply-demand curve to the entire economy (bread costs different amounts in different places). But in appropriate contexts (for fungible products like oil, and on a "local economy" level where the economy is roughly uniform but not dominated by a single station, for example) self-consistency is a pretty good model.
In many settings, the question of how well "assumption 1" above holds is related to a notion of connectedness. In the sun's magnetic plasma, the magnetic field experienced by a particle is accumulated over billions and billions of nearby particles - so the graph of interactions is extremely connected. In an oil economy, each consumer can typically choose between dozens of nearby stations which are reachable by car. However other settings (like the Ising model, or markets for rare and hard-to-transport goods) cannot be purely modeled by self-consistency as well.
In physics, systems that are well-modeled by a self-consistency equation (coupled background and foreground systems) are generally called mean-field settings. A big triumph of statistical physics is to make situations with local/ emergent phenomena "behave as well as" mean field theories – renormalization is a fundamental tool here, and most textbooks on renormalization from a statistical-physics view tend to start with a discussion of mean-field methods. But settings that are directly mean-field (for example due to being highly connected or high-dimensional) are particularly nice, easy-to-study
Neural nets are physical systems. This is a vacuous statement – anything that has statistics can be studied using a physics toolkit (and in many ways statistical physics is just statistics with different terms). Indeed, real neural nets are immensely complex, and if there is some sense in which they can be locally decomposed into background-foreground consistencies, these must themselves be immensely complex and likely dependent on sophisticated tooling to identify (this is one of the reasons why we are running an agenda on renormalization).
But it turns out that in some settings and architectures neural nets are extremely well-modeled by systems with high connectivity – and the reason is, naively enough, precisely the fact that they are highly connected (often fully-connected) on a neuron level (note that architectures that aren't "fully-connected" – e.g. CNNs – sometimes still have properties that make them "highly connected" from a physical point of view).
In neural net MFT the foreground (or "system"/ "observer") abstraction is a neuron. This is typically a coordinate index
The important "background" thing that each neuron "carries" is what is called an activation function, often denoted by the letter
Now if there are lots of neurons, each neuron's activation function reacts to a background generated by the other neurons: removing the neuron in this limit doesn't change the loss by much, so the background determines each neuron's behavior as a statistical distribution. Conversely, the background itself is composed of individual "foreground" neurons. The loop:
background
neuron distribution background
must close, i.e. be self-consistent. Making sense of this loop is the key content of mean field theory of neural nets.
In later installments we'll explain a bit more about the loop and show some examples of it working (or not). You can also see the original linked paper about the Curse of Detail for a more physics-forward view of this.
We'll close with a toy example of "self-consistency", which is visually satisfying.
In this setting we look at a 2-layer model that takes in a two-dimensional input variables

The neurons above were trained jointly in a way that would allow them to interact.
It has a nice clover-leaf like structure (it will reappear later when we look at continuous xors - a multi-layer setting where mean field performs compositional computation; already in this simple setting, the fact that the cloud of neurons is a "shaped" distribution rather than a flat Gaussian puts us solidly outside the Gaussian process regime). Now we can empirically measure how a single randomly initialized "foreground" neuron would react to the background generated by this model. To do this, we train 2048 iid single-neuron models on the resulting background from the fully trained model.[4] When we do this and combine the resulting 2048 neurons into a new model, we see that indeed it looks exactly the same as the background. When we compute its associated function, we get very similar loss.

Each neuron in this picture was trained in a fully iid way, without interacting with any neuron, simply by "reacting to the background", i.e. learning the task in combination with the "blue" background above.
Note that this isn't a property that comes "for free". If we were to use the wrong background (for example a the more Gaussian process-like model here) then samples of the foreground would fail to align to the background.

Blue is background, orange is foreground (each orange neuron trained independently in reaction to background).
The case of 2-layer networks is special: neuron functions are particularly simple to characterize, and the mean field has better properties (it's not "coupled"). But we'll see that deeper nets can still be analyzed using this language, and even using empirical methods we can get cleaner pictures of how they learn and process representations.
In the next post, we will explain the physics behind these experiments and the experimental details of the models (github repo coming soon).
Technically they differ on whether they use the "pizza" vs. "clock" mechanisms, but the two idealizations are related, and both the mean field and the realistic setting can be modified to make use of either.
Below, f and b should generally be understood as "statistical" functions: job choice is, perhaps, a probabilistic function depending on the economy, which includes both demand/ markets but also supply/ people's interests; conversely "the economy" is the average of production over the distribution of jobs.
Technicalities. Depending on the situation
Technical note: each single-neuron model is trained on the difference
In fact it is quadratic: the thing that sums over neurons is the "external square" of the neuron function, which is a function of a pair of inputs: knowing this sum fully determines the dynamics up to rotational symmetry, even for a finite-width model (it's often called the "data kernel" but is used very differently from the Gaussian process kernels, which do depend on an infinite-width assumption and lose a lot of information in finite-width and mean-field contexts).
2026-04-04 14:39:58
Political power grows out of the barrel of a gun -- Mao Zedong
Halfway thru recorded history, Athens became the first state we're sure was a democracy, and inspiration to many later ones. Probably some existed earlier, and certainly some entities smaller than states were democratic, likely long before recorded history began.
The next tenth of history saw the rise of the Roman Republic, which mixed democracy and aristocracy together to form a functional hybrid, and then it transitioned to the Roman Empire, which shifted the mix substantially towards aristocracy. For the next three tenths of recorded history, democracies were at best local governments, minor regional powers, or components of larger, autocratic states.
Few, if any, of these societies would count as "democracies" according to modern watchdog organizations. Only about 10-20% of the residents of Athens were citizens; the rest possessed no real political power. In later eras, residents of towns or cities might vote for their urban officials, but the urbanization rates were also around 10-20%, so the vast majority lived in the non-democratic countryside.
Then for the last tenth of history, democracies rose again to dominate the world stage. One standard story for this has to do with military technology. The Roman Republic expanded because it had the dominant military technology of its time; this may have been in part because of its political system. But eventually the heavily trained armored horseman became the dominant military technology, and was more easily provided by autocracies than democracies. Then widespread use of gunpowder weapons swung the balance back towards mass manpower; the knight in his castle could no longer reliably put down a peasant revolt, or hold back Napoleon and his levée en masse.
Another standard story has to do with increased state resources. Democracies generally support higher tax rates than autocracies do; while this is primarily to support social services, some amount of this is that people are willing to pay more for things that they think 'they' own (rather than their distant overlords).
A third standard story has to do with ease of turnover. Democracies generally don't have to fight civil wars or succession conflicts, because whenever such a movement would have a chance of a military victory, it also has a chance of a bloodless victory. This leads to peaceful turnovers or governments following the preferences of voters enough to not let resentments build up to the point that they boil over.
The major wars of the last two hundred years have not been limited engagements driven by hereditary elites; they have primarily been total struggles between ideologies and peoples, which only managed to become major because they could motivate significant efforts on both sides.
What does the next tenth of history[1] look like? One might think that the invention of larger and more sophisticated weapons means that we swing back towards the knight and aristocracy, but the evidence of the most recent wars suggests otherwise. Kipling's poem Arithmetic on the Frontier, on the costs of pitting Imperial troops against regional resistance, rings nearly as true about America's various wars and special operations in the Middle East. Between great powers, the weapons have become so expensive that only systems which are widely believed in by their inhabitants can afford to supply a competitive number of them.
It's not obvious to me that this continues to be true. I think next-generation military capabilities primarily have to do with 1) operational knowledge and 2) mass-produced smart weaponry. The war in Ukraine shows how conflicts between drones and riflemen go; the war in Iran shows how conflicts between the informed and the uninformed go. States may find their ability to produce weaponry becomes detached from their popularity; they may find robot soldiers are willing to follow orders that human soldiers would balk at; they may find that it is relatively cheap to identify and destroy dissenters.[2]
That is, we may be moving into an era where mass protests are relatively easy to dismiss or ignore, while individual hackers or saboteurs are still able to disrupt large systems. Taxes from a broad labor base may become less relevant than control over automated infrastructure. What political problems will the new sources of power have, and what systems will help them resolve those problems?
I don't expect us to have 500 more sidereal years of history left, but I do think we might manage to cram roughly that much subjective history in before the Singularity / as it takes off.
An old strategy is to recruit your police / imperial enforcers from a different ethnicity than the people that they need to defend against, so there's some baseline level of resentment that will allow brutality which will cow them into submission. Autonomous weapons allow this at scale, and for secret police that are difficult to bribe or corrupt, and widespread surveillance allows for secret police that are always watching and noticing subtle connections.
2026-04-04 14:00:37
I am a software engineer. I work for a company that makes software for road construction. Monday last week we were under a bad crunch and we were told to start using agentic workflows. We had like 50 tickets to close by the following Tuesday. I’ve been experimenting with ai development for years now, but this was different. I had access to Opus/Sonnet 4.6, and GPT5.4—the latest models.
Suddenly, they understood. I could talk about abstract concept’s and analogies, and it got them. I was soon working through tickets the first day in hours, what would have taken me days. But we still had a ton of work and not enough time. I was still bound to a single thread of work at a time. So like any problem, I hacked around it. I started with a worktree, where it basically creates a whole other copy of the project I was working in, and that meant multiple threads.
Still I was limited to my single service, and the system that I work on has like 20 services. Wednesday comes, and I’m still cranking the tickets out, when I realized what I could do was create a repo with sub modules for every service. The agent works best when it can find the context it needs without being overloaded.
Thursday comes, and we’re not going to make it I’ve already put in about 40 hours. they said to lean in, so I did. After setting up my MCP servers for our ticket, documentation system, communication, and calendar systems. I told the agent to pull ALL of the tickets for the big feature we are working on, then go through our documentation and communications to look for mentions of this feature, and to turn that into design requirements, then after a Q&A session, we made a plan to implement all open tickets. my idea was that with the full context, it will be better able to perform
It worked, or at least it seemed to. I was almost embarrassed about it. I was talking to our systems architect about how everything is different, and I mentioned this branch of code. he said, ”You know what? Let’s try it“ we brought it to the team, and they figured let’s give it a shot. I hadn’t actually run the code outside of tests. So our QA team dug into it live. The first one worked. The second. The third, and on and on. We went from not going to be able to finish on time, to mostly done. We found a few small bugs, but such is the way of software, especially things as complex as this.
My side project expanded. I created a CLI, a extension for my IDE to manage the local dev environments that could all run independently, and I made a dashboard that pulls all of my tickets, gives me a button to press that spins up an agent with special instructions. it pulls the details and writes the code, pushing it up for me to review. After that i added another button that fixes any issues that come up in review.
My work flow became
My boss said I had gone plaid. Hahahaha My dashboard became sophisticated, and my process lean. now I had a way to interact with the whole system. I had it solve big problems. Ones that would take months, solved in a day, two with QA.
I had a system to unify our teams, and to allow business analysts to contribute code.
Today, a week later than when I started the project, I talked to two directors and I blew their socks off. We’re talking about doing something like this for the entire company, and I talked about automating the two buttons. It was a big win. I know I have a big raise coming. It’s likely not enough considering my impact.
I went out with friends, and AI came up. they’re pretty sure it’s going to lead to disaster. My general P(Doom) is about 60%. As I was leaving, I had the thought, Am I profiting off of human suffering? I’m proliferating these systems in more places, and my project will mean we are over-staffed at work. It kind of overwhelmed me.
Am I the baddie?
2026-04-04 13:25:06
Written quickly as part of the Inkhaven Residency.
At a high level, research feedback I give to more junior research collaborators tends to fall into one of three categories:
In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like.
Previously, I covered doing quick sanity checks and saying what you want to say precisely. I’ll conclude these posts by talking about probably the hardest to communicate category of common advice: asking why one more time.
In my opinion, the most important skill in empirical research is figuring out how to make your beliefs pay rent: you have many possible hypotheses about a phenomenon; to test them, you need to connect these hypotheses with empirical observations. While it’s all well and good to perform all the basic correlations and sanity checks that you want, it’s rarely the case that the problem at hand can be straightforwardly solved by looking at a few scatter plots.
The second important skill in empirical research is close to the converse of the above: instead of looking at your hypotheses and trying to fit them to the data, you look at places where the data seems inconsistent with any of your hypotheses (i.e. surprising or interesting) and generate new hypotheses to explain the data.
I think these two skills tend to form a research loop: while you’re confused, first generate more hypotheses about the data, and test the hypotheses against either current or future data (or vice versa). That is, testing hypotheses against old or new data will surface anomalies, which prompt new hypotheses, which in turn need testing, which prompt new hypotheses, and so forth.
What counts as sufficient understanding for this loop? In my experience, you can often quantify the number of iterations of this loop you've completed by the depth of the natural why questions from a possible interlocutor that you can answer.[1] At the first level, we might ask questions such as, why does your hypothesis imply this empirical result? Why does the surprising result you’re trying to explain occur? At the next level, we might ask about the parts your hypotheses are made of: if your hypothesis is that the length of chains of thought predicts monitorability, why would this happen? Or, we might ask about why the surprising result didn’t generalize to other domains: if GPT-4o’s sycophancy explains many people’s attachment to it, why don’t other seemingly sycophantic models lead to the same level of attachment?
Almost all of the researchers I’ve worked with have been incredibly bright (and from great research backgrounds) and have consistently thought about and can cogently answer the first level of whys. So I basically never need to give advice (though, if you’re not asking why your key result is what it is, maybe you should start!) However, a lot of the second level of whys that I ask (or that I ask them to generate) tend to highlight gaps in understanding and lead to fruitful discussion.
For the sort of researcher I interact with, I think it’s good advice to take whatever answers to natural why questions you generate by default and then repeat the process of generating why questions exactly one more time for each of the explanations.
Taking this too far. There’s a reason I say “ask why one more time” and not “continue asking why”. In general, as with many similar conversation trees, the space of natural why questions expands exponentially. At some point, you need to decide that you’ve done enough investigation, and research that never gets consumed by other people likely has minimal impact on the world.
There are a few specific failure modes I’ve seen:
The optimal depth of whys you try to answer depends on how seriously you care about a result, but for research (in my experience) tends to vary between two (for blog posts or ideas that you don’t intend to seriously build on in the future) to three (for the core ideas of research papers that you do hope to build on in the future).
I used to refer to this concept as simply “being skeptical”, but that fails to communicate the actual skill being executed here. I got this new framing from Thomas Kwa at METR (though any confusing parts are no doubt my own).
2026-04-04 11:56:17
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.
This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.
I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?
Getting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states.
Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.
Steering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.
PROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"
Answer = 360
Tuned Logit Lens properties:
Default Tuned |
Default |
|
|
|
The following is the answer frequency for the GSM8K data used to train the tuned logit lens
This prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.
Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.
Positive Coefficients |
Negative Coefficients |
A1-B2 |
|
B4-A1 |
|
A1-B5 |
|
A2 - B3 |
|
|
A2-B4 |
|
A3-B5 |
|
A4-B5 |
|
B6-A1 |
|
B6-A4 |
|
B6-A5 |
For negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient after steering performed significantly better than the baseline for latent 5
The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.
No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.
2026-04-04 11:34:57
I've spent a fair amount of time trying to convince people that this AI thing could be quite large and quite dangerous. I think I normally have at least some success, but there is a range of responses, such as:
Of these, 1 is the appropriate emotional reaction[1] to fully absorbing and believing the arguments[2]. This is what it looks like when you take an argument, process it with the deeper reaches of your brain, turn it into something that fundamentally changes your world model and start trying to adapt.
As far as I can tell, our emotional responses are mostly connected to our System 1 thinking. This makes them harder to influence than just changing your mind. You can change your opinions, but that doesn't mean you will get it on a gut level.
I think I have a solution. In particular, visualisations. I don't know if this works for everyone, but I have personally found it helps me both stay more aligned to the cause and increase my motivation. I believe this is basically due to the fact that your system 1 needs to get the stakes to achieve complete alignment.
Note that in the particular case of AI safety, if you want to remain emotionally sane, it is potentially best not to go through this exercise (like genuinely, please skip it if you're not ready; I do it half-heartedly, and it can be painful enough).
As an example, we can take Yudkowsky's "a chemical trigger is used to activate a virus which is already in everyone's system". Close your eyes. You're at home, in your usual spot. Picture it in detail: the lights, the sun shining through the windows, the soft sofa. You're having a drinks party tonight and you've invited your best friends to come and join you. As the guests arrive, you greet each of them in turn, calling them by name and showing them in.
And then it triggers. See each one of them in your mind's eye collapse, one by one. Hear each of them say their last words. Add any details you think make it more plausible.
My brain writhes and struggles and tries to escape when I attempt this exercise. It's painful. It's emotional. Which is the point.
In the normative sense of "if you care about the world and would rather it doesn't get ruined by a superintelligence, and would rather it doesn't kill everyone you know and are actually processing this on a deeper level, this is what your reaction will probably look like as an ordinary human being."
I don't think you should have any particular emotional response if you go from not believing AI will kill everyone to still not believing that AI will kill everyone.
Which become quite samey after the 378th time of hearing "but can't you just turn it off?"