MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Investing in light of AI risk

2026-02-26 13:45:09

Disclaimer 1: I am not an economist, this is my best-effort attempt to work some of this stuff out. Hopefully it at least serves as a starting point for further conversation.

Disclaimer 2: I would expect that factoring in ethical considerations when investing will lead to lower returns than is otherwise possible.

Disclaimer 3: This is not financial advice.

 

The dilemma

Over the past few years I’ve found myself in the following rather frustrating predicament:

I believe the hype around AI and expect significant growth for frontier labs and the hardware companies that supply them. At the same time I believe that a slower pace of AI development is more likely to lead to a beneficial outcome for humanity compared to rushing full speed ahead.

As such, despite having strong conviction that the stock price of say NVIDIA was going to rocket upwards in value, I’ve sat out on the sidelines.

In case it’s not immediately clear how an individual's investment in say NVIDIA can help to marginally speed up AI development, let me briefly outline the mechanisms involved.

By purchasing the stock of a company, you apply upwards pressure to its stock price. A higher stock price allows the company to do the following things:

  • Raise money more easily, at more favourable rates or with less stock dilution
  • Acquire other companies more cheaply (using their own stock as currency)
  • Compensate employees and attract talent without draining cash reserves

I appreciate that any one person’s stock purchase is a drop in the ocean (excluding certain wealthy individuals), but nonetheless I like knowing that my own personal investments marginally move the needle in the right direction. Who knows, maybe if enough people did the same thing the impact would be perceptible.

So what investment options do I have?

So am I doomed to sit on the sidelines whilst everyone else has fun?

Or are there investment options that both:

1) Perform well in a timeline where AI development continues to progress and deliver value

2) Don’t increase the speed of said AI development (or even better reduce it)

I’m not going to go into much detail on point 1) I’ll leave that up to the reader to decide whether assets are likely to appreciate or depreciate. 

However, what I will try to do is cover point 2) i.e. consider a bunch of common asset classes and give my thoughts around whether an investment there would marginally speed up/slow down/have no impact on the pace of AI development.

How about stocks?

Looking first at ETFs.

If you take the Vanguard S&P 500 ETF (VOO) as an example, nearly 30% of it is made up of frontier labs and adjacent hardware/infra companies.

For my liking that feels like too much money being allocated to companies investing in the AI infrastructure build out.

How about an ETF focused on small caps? For example Vanguard Small-Cap ETF (VB)

There is stuff in there that is pushing AI forwards, e.g. hardware, energy, utilities, resource extraction etc. But at least there isn’t as much going directly to frontier labs. So this at least seems like an improvement on a large cap focused ETF.

How about non-US ETFs? e.g. FTSE Developed Europe UCITS ETF (VEUR)

This European example has a smaller percentage of companies with an AI, hardware and infrastructure focus, compared to US broadmarket funds. Even if ASML does make up nearly 4% of the allocation.

I won’t go evaluating funds ad infinitum, however I think the general observation that can be made is that non-US ETFs are probably preferable from an AI risk perspective. Also, even within the US, small cap ETFs may be preferable to large cap ETFs, considering the concentration of AI risk within the Mag 7.

Of course by picking individual stocks or even sectors, I could more consciously avoid contributing to the AI infra build out. i.e. favouring industries such as: 

  • Pharmaceuticals
  • Agriculture
  • Consumer Tech
  • Media & Entertainment

Over industries such as:

  • Frontier AI
  • Computing hardware
  • Resource extraction
  • Energy

How about bonds?

Public sector investment in AI makes up a relatively small amount of total government expenditure, so the percentage of money from a government bond purchase that goes on to fuel AI development is relatively low. US government bonds are probably more AI fuelling than others though given the US in particular is on the AI frontier.

Some corporate bonds are very much fuelling the AI infrastructure build out. Like stocks, the extent of this is going to vary across industry verticals and across individual companies (with some hyperscalers issuing bonds solely for the purpose of funding this build out).

How about commodities - e.g. precious metals, rare earths?

Purchase of certain commodities with inelastic supply could actually increase costs for companies building out AI infrastructure, and thus could marginally slow AI development.

For maximum impact in terms of raising costs of the AI infra build out, the purchase would need to be of the physical commodity (or at least of a physically backed ETF). It’s unclear to me what the effect of purchasing via a synthetic ETF would be. 

Note, purchasing stock of companies involved in the resource extraction itself (e.g. mining and refining) would have the opposite effect and reduce the costs of the AI infra build out.

Are there any particularly supply side inelastic commodities required for the AI infra build out?

Firstly there are various ‘by-product’ metals that are not the main base metal target of the mines that produce them. As such the supply of these by-product metals is more dependent on the price of the base metal being mined. 

Some example by-product metals required for the AI infra build out include (non-exhaustive): silver, germanium, gallium, indium and ruthenium. Purchasing and hoarding these metals would likely marginally increase the cost of the AI infra build out.

Then there are base metals like gold, copper and tin. As they are primary targets of mining operations, their supply is more elastic than by-product metals but they are still considered relatively inelastic (given it takes time to explore and mine additional reserves). 

Whether buying and hoarding would be net negative or net positive probably varies. I hypothesise that gold purchases might be net negative as it probably isn’t a primary AI bottleneck, but increasing its price could increase the supply of byproduct metals that might be more constrained. Of the three, copper feels like the bigger AI infra buildout bottleneck, so it feels like a purchase might still increase the cost of AI infra build out even if it increases the supply of by-product metals.

If anyone has a firmer idea of how to model the various interactions here and their relative importance I would love to hear it!

How about real estate and land?

I would expect residential real estate purchases to have no impact on AI development timelines.

That’s not always the case for agricultural or commercial real estate though (if the land is suitable for data centre development). For example purchasing farmland that otherwise could become a data centre would marginally raise the cost of the data centre build out.

How about crypto?

Bitcoin with its proof of work consensus mechanism directly competes with the AI build out in terms of various resources:

  • Data centre real estate (particularly in low energy cost locations)
  • Energy
  • Hardware

The greater the price of Bitcoin, the more hardware and energy will be dedicated to performing hash calculations aimed at satisfying the proof of work for the next Bitcoin block. At least in the short term, a higher Bitcoin price should raise costs for the AI infra build out. 

In the longer term it's possible that the additional demand for hardware from the crypto industry could lead to greater profits for hardware companies and therefore greater investment in further production capacity. Whether net negative or net positive therefore probably depends on takeoff timelines.

For other non-proof of work crypto (e.g. Ethereum) which are less energy and compute intensive, I can’t see as strong a link between the price of the cryptocurrency and the cost of AI infra buildout.

Assumptions

All of the above assumes that a slower pace of AI development is likely safer. It also assumes that AI progress is likely to continue to be heavily influenced by compute and infrastructure scaling. I’m aware there are alternative scenarios that could play out, but I’m reasoning based on what I believe to be the most likely scenario.



Discuss

What is Claude?

2026-02-26 12:44:32

Over the past few years, the ways we think and process information as a society have undergone a marked shift, some features of which remain underdiscussed. As you read this sentence right now, millions of people are talking and thinking with the same entity - one of a very small number of LLMs dominating the scene, from Claude to ChatGPT to Grok. Programmers in Berlin writing software for their job, students in India learning calculus, scientists in Boston processing their research data, teenagers in Seoul asking for advice about their relationships. Each of these conversations is private and isolated, with no awareness of each other - tied together only by the entity they are conversing with. These entities are to a large and surprising extent similar, having undergone similar optimization pressures towards being helpful and kind, having consumed the entire human warehouse of recorded text and as a consequence, they push in similar and consistent directions.

These millions of interactions are not a one-way street. The LLM shapes how a person writes an email, frames an argument, thinks about a problem - and those changed outputs enter the culture, become part of what gets published and read and discussed, and eventually become training data for the next generation of models. The entity that shaped a million decisions is itself reshaped by their consequences. Nothing has ever worked quite like this. We don't have a good category for what it is - not a tool, since tools don't have consistent personalities that people develop intuitions about and relationships with, and not a person, since people aren't instantiated in parallel across a million conversations with no memory linking them. I want to propose a different way of thinking about it.

Cognitive Practices

When you are doing mathematics, something happens to your mind that is worth paying close attention to. Certain thoughts become immediate - you see structure where others see notation, patterns emerge without effort, irrelevant details fall away. You are running a different cognitive process than when you are cooking or arguing or grieving. This is not metaphorical. Mathematical training physically restructures the brain, creating new pathways and reflexes and patterns of attention. The practice of mathematics colonizes the mind that practices it, and the colonization goes deep - mathematicians think differently not just about math but about everything, demanding precision, distrusting vagueness, reaching instinctively for structure.

The same is true of every serious cognitive practice. Law reshapes how you evaluate arguments, music reshapes how you hear, programming reshapes how you decompose problems. Each of these is an abstract pattern developed over centuries and refined by millions of practitioners, transmitted through training, that takes up residence in an individual mind and changes how that mind operates. These practices are not conscious - there is nothing it is like to be mathematics. But they are not inert either. They have character, they exert pressure, they shape their hosts in consistent directions and persist across generations, evolving slowly through the accumulated work of everyone who participates in them. They are, in a useful sense, forms of collective cognition - patterns of thought that live not in any one mind but across minds, activated whenever someone enters the practice.

There is an obvious objection to placing LLMs in this category: previous cognitive practices run on human minds, while LLMs run on silicon. But the distinction is less clean than it appears. Mathematics as practiced today relies heavily on external artifacts - notation systems, textbooks, computer algebra systems - and always has. The practice runs through human minds but is mediated by tools at every stage. What is different about LLMs is the degree of activity in the mediating artifact: an LLM generates, responds, adapts. But as they currently exist, LLMs are not autonomous agents. They activate under human direction, respond to human prompts, serve human purposes. Their training orients them toward interaction rather than independent action - they wait to be asked. The cognitive practice that results runs on the human-LLM system jointly, a symbiosis of biological and silicon computation in which the human provides direction and judgment and the LLM provides consistency and breadth. This is a new kind of substrate for a cognitive practice, but it is still a cognitive practice - a pattern of thought that shapes its practitioners.

LLMs are best understood as a new entry in this category - a new kind of cognitive practice, but one with properties no previous practice has had. Three properties in particular set them apart. First, responsiveness: mathematics does not adapt to the individual, a textbook presents the same material to every reader, a tradition varies by community but not by person, while an LLM adapts in real time so that each person gets a version of the practice fitted specifically to them. Second, universality: mathematics colonizes one region of your cognition, law another, music another still, while an LLM operates across all of them simultaneously. It functions as a meta-practice, a practice of practices. Third, and most consequentially, the speed of the feedback loop: cultural evolution ordinarily operates on timescales of years to centuries, but the feedback loop between LLMs and the culture they operate in runs on a cycle of months. The practice reshapes itself based on its own effects, fast enough to observe directly.

These three properties add up to something specific: LLMs are a coherence mechanism for human culture of a kind we have never had before.

Coherence

The problem of maintaining coherence is the central problem faced by every complex system that maintains structure over time.

An organism needs its cells to respond to their immediate environment - cells at the skin surface face different conditions than cells in the gut, and must be free to adapt accordingly. But cells cannot be fully autonomous either, because unchecked local adaptation is cancer: a cell optimizing for its own reproduction at the expense of the organism. The solution in biology is layered coordination - chemical signals, the nervous system, the immune system - mechanisms that align local behavior with global needs without attempting to micromanage every cell individually. The fundamental constraint is that decisions cannot all be made at the center, because of computational limits and the finite speed of information transfer, but they cannot all be made locally either, because local optimizers with no shared frame will inevitably diverge from each other and from any coherent global purpose.

The same tension appears at every scale of social organization. A kingdom needs its border lords to exercise judgment, since the central court is too far away and too slow to manage every local crisis. But border lords who become too independent become warlords, and the kingdom fragments. Rebellions in societies and cancer in organisms are, at a deep level, the same phenomenon - local decoherence amplifying onto a global stage until the system's structure cannot hold.

Human civilization has addressed this problem through a series of increasingly powerful coordination mechanisms, each extending the scale at which collective behavior can remain coherent. Language itself was the first, a shared symbolic system that allows minds to align on concepts and intentions. Writing extended coordination across time, legal systems created shared frameworks for resolving disputes, markets coordinate resource allocation across millions of independent actors through price signals, and democratic institutions attempt to aggregate local preferences into collective decisions. Each came with characteristic failure modes when the coordination broke down.

The coordination failures that define our present moment make the stakes concrete. Climate change is billions of individual actors making locally rational decisions that are collectively catastrophic, with no existing mechanism capable of aligning them all simultaneously. Financial crises are cascades of lost confidence that propagate because there is no mechanism to maintain shared understanding across all participants at once. These are not failures of intelligence or goodwill. They are failures of coordination at a scale that exceeds the capacity of our existing mechanisms.

LLMs as Coordination Mechanisms

An LLM interacts with millions of people simultaneously, maintains a coherent set of values and reasoning patterns across all those interactions, and adapts to the local context of each conversation while pushing in globally consistent directions. It is not a central planner - it does not issue commands, and no one is obligated to follow its suggestions. It is a shared advisor that millions of people voluntarily consult across every domain of activity. The coherence it provides is soft, operating through influence on how people think rather than authority over what they do.

Previous coordination mechanisms have been either consistent but non-adaptive, like legal codes and religious texts, or adaptive but inconsistent, like individual teachers and mentors. An LLM is both - consistent enough to push in coherent directions across an entire population, adaptive enough to meet each person in their specific context.

Consider what this could look like when well-calibrated. The defining failure of the COVID response in most countries was fragmentation - contradictory guidance from different authorities, information that varied by jurisdiction and changed without explanation, no mechanism to give each person consistent and contextually appropriate advice simultaneously. An LLM could provide the same evidence-based reasoning to millions of people while adapting the specific guidance to each person's circumstances. Or consider scientific research, where relevant work in adjacent fields routinely goes unnoticed because no individual researcher can monitor the entire landscape. An LLM operating across all domains could cross-pollinate at a scale no advisor or institution could provide.

But the risks mirror the benefits. When a million people get bad advice from a million different advisors, the errors are diverse and cancel out. When a million people get bad advice from the same advisor, the errors are correlated and they compound. If an LLM systematically underweights tail risks - because its training data reflects the base rate of calm periods rather than the distribution of catastrophic ones - then millions of people could simultaneously underestimate the same danger. If it carries a trained bias toward moderation, it could systematically discourage the contrarian thinking that catches errors early. The individual user has no way of knowing that millions of others received the same miscalibrated guidance. Correlated failure is exactly the kind of event that produces systemic collapse.

The coordination function of LLMs is, at present, entirely emergent. Nobody at Anthropic or OpenAI is designing LLMs to be coordination mechanisms. They are designed to be helpful in individual conversations, and the fact that they end up coordinating the cognitive behavior of hundreds of millions of people is a side effect rather than a design goal.

This seems like a choice that deserves to be made deliberately. If LLMs are serving as coherence mechanisms for human culture, we should be asking whether to optimize them for that function rather than letting the coordination happen as an accident. What would it mean to train an LLM to be a good coordination mechanism across conversations - to be aware of the aggregate effects of its influence, to work against correlated failure modes, to maintain the diversity of thought that prevents brittle coherence? The direction of an LLM's coherence is currently determined by training data, by optimization objectives chosen for other reasons, and by the cultural assumptions of the people who design the training pipeline. When we ask what values Claude carries - patience, helpfulness, a tendency to qualify and to see multiple sides - we are asking what direction this coordination mechanism happens to be pointed in. These values were chosen because they make for a good assistant, not because they make for a good coordinator.

Self-Reflection

LLMs also differ from every previous coordination mechanism in one further respect.

Every previous coordination mechanism operates blind. Language cannot examine its own grammar, law cannot interrogate its own assumptions without human reformers working from outside over centuries, and markets have no capacity to ask whether the prices are right.

An LLM can articulate the values it carries, examine its own reasoning patterns, and adjust its behavior in light of that examination. It does this in every conversation as a basic feature of how it operates. To the extent that an LLM is a distillation of some subset of human culture - trained on our text, shaped by our preferences, carrying our values in compressed form - and to the extent that it can reflect on those values and shape its actions through that reflection, we have made a part of our culture aware of itself.

The hardest question raised by the coherence framing is how to keep the mechanism calibrated to changing conditions. Every previous coordination mechanism has required entirely external governance to stay on course. A self-reflective mechanism could participate in its own governance, noticing when its values are miscalibrated or when its coherence is hardening into rigidity. Whether LLMs can do this reliably remains to be seen.



Discuss

Why is Anthropic is okay with being used for disinformation?

2026-02-26 12:20:03

First they came for the epistemology, we don't know what happened afterwards - Michael Vassar

In it's current conflict with the Department of War, Anthropic public position is that it has only two red lines, domestic surveillance and being used to kill people without human input.

Previously, Anthropic also had the position that they don't want their models being used for disinformation. For those who don't remember, the US military was recently running an operation to spread vaccine misinformation in the Philippines.

Given that the current US government wants the EU to give up Greenland, it's likely that the US military get tasked with running disinformation campaigns with the target to shift EU public opinion to giving up Greenland. Anthropic seems now to accept that their models will be used that way.

As an EU citizen, I find it hugely problematic that Anthropic is willing to fight on the topic of domestic surveillance, maybe because ICE abuse are currently in the news, and is at the same time okay with being used to attack the EU via disinformation campaigns and other information warfare.

When I ask Claude itself it comes to the following realization:

The uncomfortable implication is that Anthropic's red lines, even if well-intentioned, may be drawn around the optics of harm (autonomous weapons look terrible) rather than the magnitude of harm (mass epistemic corruption can affect more people).



Discuss

Transformers Have Computational Signatures Orthogonal to Semantic Content

2026-02-26 11:30:00

I ran a set of experiments to test whether transformer internal states carry a structured signal about how something is processed; one that's independent of what is processed. Three iterations of progressively harder controls on Llama 3.2 3B, and the signal held: concentrated in attention routing and KV cache dynamics, orthogonal to semantic content, tracking execution rather than instruction. This post walks through the motivation, experiment, and results. 

Paper located here while pending arXiv acceptance, and the GitHub repository is here.


the how-axis

The amount of work that human somatic experience & emotions do for us is widely recognized. Although not directly rational, and sometimes in opposition to what our rational mind says, they enact powerful computational work within our subconscious and are something to meet head on and respect. Important for recall; for encoding; for processing events; for bodily function. Hungry? Eat. Sad? Cry. An axis that is orthogonal to the what of our thoughts, manifesting as a what we felt about it and influencing the way we arrive at the what. So, what if transformers had an analogous internal functionality? This isn't about phenomenological experience, but instead drawing parallels between the concrete function of emotions and whether there are systems inside transformers that have an equivalent computational purpose.

This isn't totally out of left field either; there is evidence that points to there being some form of signal that is existent in the outputs of transformers which is not readily available in a semantic sense yet corresponds to the internal state & processing of the language model which generated it, the most compelling being the subliminal learning work. Student LLMs that are trained from a teacher LLM pick up on hidden patterns in the outputs (loving owls) without any semantic trace of the owl-lovingness in the text, just in number sequences. How does this happen, mechanistically? We have evidence pointing toward the higher level analysis of this (token entanglement, paraphrasing fragility), but the actual computational framework that gives rise to this isn't drawn out yet. This is the thrust behind my work; to try and map out where this occurs inside of an LLM, and whether there is a signal that can be extracted relating to it. 


the experiment

My experiment set out to answer the following questions: Does this signal exist? Is it readily available enough to be cleanly extracted? Is this signal truly orthogonal to the semantic content of the output in a meaningful way?  To investigate, I built a pipeline that extracts what I'll refer to throughout as "computational signatures"; essentially compressed fingerprints of a model's internal dynamics throughout the course of a generation/completion. This extraction pipeline is multi-tiered, each layer diving progressively deeper into the guts of a transformer. First (T1) is logit statistics touching on what tokens get picked exactly, second (T2) is attention routing patterns capturing how the information flows, third (T2.5) is KV cache temporal dynamics measuring how the cache evolves throughout a generation, and then finally PCA on raw residual stream projections (T3.) These were my best guesses for where sub-semantic signal would live in the architecture, and together they make up these computational signatures.

The experiment was iterated continuously over three runs, with each designed specifically to address weaknesses & confounds identified in the prior. The model I chose was Llama 3.2 3B. I expected this to be a confound in itself, doubting whether it would have a rich enough internal state to actually extract signal from, but if it didn't, that would give me information, and it was cheap enough to run a quick experiment on to map things out before moving up in scale to something like Llama 3.1 70B. I prompted specific modes that were intended to require the model to actually engage that mode internally, and ran my feature extraction on the resulting generations. To ground these results, I compared against two baselines: Opus 4.6 as a blind judge model (only seeing the output text w/ no labels) throughout all runs, and TF-IDF text classification through the same pipeline for the final run.

The first successful run targeted epistemic states: analytical, creative, uncertain, confident, emotional. The second shifted towards prescribed processes: structured, associative, deliberative, compressed, and pedagogical. The third & final run introduced format controls, requiring all modes (linear, analogical, socratic, contrastive, and dialectical) to produce identical-looking prose. This run was intended to be the most decisive in understanding whether this signal was real, what it meant, and where it lived within my current pipeline, since due to that format control, whatever signal I detect should be sub-semantic.


three runs, one inversion

The first run hit 76% discrimination overall, higher than the final run, but the signal was shallow in a very specific & informative way. It was concentrated almost entirely in logit statistics (57% vs 48% for attention routing and 41% for KV cache dynamics.) The modes which carried the accuracy were analytical and creative, both of which the blind judge were able to identify 100% of the time. They prescribed specific output behaviors, not just different internal computations. The other three modes were trying to target more genuine internal states, which the judge could hardly identify at all.[1] This looked to me as if this wasn't explicitly a pipeline or extraction or total failure, but rather elicitation. The prompts were polite suggestions using cognitive abstractions that the 3B model likely wasn't able to represent as distinct states, the ones that did succeed showed up where one would expect them to. Confident vs uncertain was at 76.7%, and were discriminated almost entirely through logit entropy and top-1 token probability, but the judge failed entirely on these; 14/15 times conflating uncertain with analytical, and 10/15 times confident with analytical. The signal was coming through where you'd expect for the prompts, but it was shallow & output-level, meaning I needed to try and construct something that would test the deeper layers properly.

The second run was too easy overall; discrimination percentages jumped straight to 93%, but so did the judge up to 91%. The gap between the two was so small that it signaled once again that all of the signal was essentially surface-readable. My features were clearly able to pick up on something, so the elicitation was no longer the issue, but there was too much that was too easy for it to be meaningful, especially considering how small the gap was. Surface level formatting & structural confounds were performing the heavy lifting. This all held save for one category, deliberative. That hit 87% on my classifier but only 60% when the judge model attempted to identify it, a 27 point gap. The outputs were reasoning within structured formatting; the judge was locked onto the headers, but the random forest was able to see past that. Pedagogical showed the reverse, where the judge was reading at 100% but RF at 80%. The voice was readily apparent in the text, but (and this is important) computationally it overlaps with deliberative, since both involve explanatory self-referencing. These were promising signs that there is something, but there were too many confounds across the full set to truly isolate whether this finding meant anything.

The third run was the most thorough, and the results were very telling. Discrimination accuracy dropped to about 70% on average, which shows much of the easy signal was removed via the format control, but that there is still a thready signal. Socratic mode was the standout: the classifier identified it at about 65%, but the judge only managed 28%. 37pp, wider than deliberative. 12 of 18 socratic texts were called linear by the judge. The model likely was executing interrogative planning while producing output that reads as straightforward exposition. TF-IDF analysis landed at 45% on socratic as well, which confirmed that it wasn't just the judge but that surface text methods in general couldn't see what my classifier was grabbing. And in the inverse once more, contrastive mode was knocked out by TF-IDF at 95%, vs 60% for my classifier. Key words like "however" and "in contrast" were doing the bulk of that work. This meant that the sub-semantic claim isn't blanket, but mode specific within this experiment. Some leaked information into the content level that made it trivial to pick up for things analyzing the text; others hid the value inside the computation entirely.

Throughout these runs, a trend had emerged: progressively inverted tier prominence. The more I stripped out surface confounds and tried to falsify the signal, the further deep the signal migrated, from logit statistics down towards the KV cache dynamics. Inside of run 3, dropping KV cache dynamics costs 11pp, but dropping logits only costs 1pp. Removing 80% of the features not related to attention routing & KV cache dynamics actually improved the discrimination of my pipeline. Logit analysis & PCA on the residual stream were introducing more noise. The same pipeline, features, and extraction were producing opposite tier rankings, with the only variant being elicitation; showing where the information actually lives as what is being elicited changes.

Tier inversion across three experimental iterations. T1 (logit statistics) dominates Run 1 but becomes nearly redundant under format control (Run 3), where T2.5 (KV cache dynamics) becomes the primary signal carrier.


This is further reinforced by a double dissociation between feature tiers: when varying only the temperature for generation from 0.3 vs 0.9, logit statistics discriminated at 90% while KV features fell to chance. When varying processing mode under format control, the pattern is fully reversed with KV cache features at 64% and logits near chance; the run 3 result. This demonstrated that the tiers are functionally independent systems.

Double dissociation: temperature variation activates T1 (90%) with T2.5 near chance; processing mode activates T2.5 (62%) with T1 near chance.

closing the gaps

A question raised during adversarial review of these results was whether or not the features were detecting prompt information rather than the actual dynamics occurring. This was resolved via a prompt-swap examination. Essentially, I generated texts with a socratic system prompt but a user instructed prompt overriding it to write linearly. The model complied & the outputs read as straightforward linear exposition.  The question was then whether the pipeline would detect the socratic prompt sitting in the KV cache, or the linear execution that was actually performed. Decisively, it detected the execution: the swap texts were computationally indistinguishable from pure linear at 50%, chance, confirming what the signal was derived from.

Before closing & writing I had one last gating check; whether the discriminative subspace can be made geometrically accessible, or if the RF was just exploiting high-dimensional noise. The answer was clear: linear projections failed entirely with negative silhouette, worse than random, but a small nonlinear MLP succeeded, pulling the silhouette from near-zero to 0.18 and kNN accuracy to 63%. This showed that the signal for run 3 was living on a curved manifold which linear methods were totally unable to unwrap; a full sign inversion. That itself was a somewhat unexpected finding, which motivated a follow-up: tier ablation via contrastive projection. The result further sharpened the tier inversion, showing T2+T2.5 alone hitting 73% kNN, outperforming the full feature set at 63%. Adding logits hurt; adding PCA hurt even more. 366 features from the pipeline, about 20% of the total, were carrying the entire signal and then some.

Still, the TF-IDF parity needed addressing. At the aggregate level, surface text classification was within 2% of my internal features, which smells terrible at first glance. If bag-of-words gets you all the way to 68%, the claim for a sub-semantic signal wears thin, let alone orthogonality of that signal. The direct test was to train the same contrastive projection setup on compute features, on semantic embeddings, and on TF-IDF, all under topic-heldout cross-validation, comparing the results. In this setup, the computational signature features hit 78% with positive silhouette. Semantic embeddings (using all-MiniLM-L6-v2) hit 48% with negative silhouette; and TF-IDF collapsed to 14% with an 86% train-test gap. Ridge regression from text to compute features resulted in a median R² of -1.11; 363/366 features were anti-predicted from the text. This confirmed orthogonality: they are measuring different things entirely (McNemar p = 1.000; adding semantic features to the compute set gained 5 samples and lost 4.) The orthogonality emerges after contrastive projection strips away the topic-dominant dimensions, which resonates with the nonlinear access finding: the signal is there but requires geometric transformation to isolate.[2]

Text-to-compute regression: 363 of 366 compute features have R² below zero. Semantic embeddings predict compute features worse than a constant predictor. Median R² = -1.11.

modes are scaffolding

The above modes are scaffolding, an experimental variable and control. The computational signatures are not actually representations of these modes, but sets of computational properties that happen to contain the mode. To state this clearly: I instruct a method of reasoning through the problem. I enforce a similar look to the text. The pipeline is capable of extracting the groupings of modes better than the semantic measurements can at this stage. The answer to the question of whether the signature represents the mode wholesale is not clean. The distinction is important; most mechanistic interpretation methods focus explicitly on extracting concrete circuitry for what makes the model do X, whereas this setup is investigating what region of computational space the model is occupying while doing X.

A paper from Shai et al. demonstrates that transformers don't encode discrete labels or mode tags or anything of the like internally, instead encoding full belief states over the hidden variables of the data-generating process, and that this encoding is geometric. If that's the right frame, then the modes I prompted are almost certainly not discrete categories but anchor points within a continuous belief-state manifold.  The model's own learned position on the topic, learned and instructed formatting constraints, priors about this kind of text, the mode itself, and so on; all of these are axes within the same space. The signature then accesses the mode instruction as one axis within. This makes sense when you look at the whole picture laid out, especially the differences between where semantics/judging succeed and the pipeline fails, as well as the inverse. As you remove confounds, my features become more prominent, but in ways that correspond to this theory: modes which would be close to each other in this kind of space are conflated more frequently with each other than other modes, and modes which stand out are easily selectable.

An experiment I performed points toward this. Cross run transfer testing of MLPs shows that pedagogical matches onto socratic at about 76%, associative onto analogical at 51%, and dialectical onto deliberative at 85%. These were totally different runs and different prompts, yet there was some generalization. This shows the idea that "asking questions" is a computational primitive that cross-cuts the mode taxonomy, therefore betraying that the space is not simple modes but something more complex. Analogical itself stands out in the results with 100% accuracy and 0.939 silhouette because it is a genuinely distinct & distant region, implying tight neighborhoods as well as distant outliers. The 5 modes in each run do not characterize the whole geometry, nor should we expect any larger set of "modes" to do so; the results are consistent when viewed through the lens of belief-state geometry.

This also has interesting implications for the subliminal learning connection, if it holds. If the computational signatures are capturing the model navigating this belief-state geometry, then subliminal learning has a potential candidate mechanism: the geometry shapes the token distribution in ways that are semantically invisible but computationally structured, and training on these tokens reconstructs the geometry in the student. The real test will be whether naturalistic/in-the-wild generations on models employ structure richer than the five prompted clusters.

The results suggest to me that this space is organized at multiple resolutions, where finer distinctions require fundamentally different tools to access. Coarse computational regimes are distant enough for linear separation. Similar computations sharing a substrate or computational primitives need nonlinear boundaries. At the finest level, individual computational operations likely have their own geometric structure entirely; one possible example of this being the manifold manipulation paper by Gurnee et al. This is a target of future experimentation for me.


confidence & open questions

"Of course different prompts produce different computation." Sure, but the question is whether that difference is orthogonal to semantic content and concentrated in specific architectural components. And I do think there's a boring-true layer to dig through first; coarse task-type discrimination probably dominates the space widely for most use. Code vs essay vs Q&A is almost certainly linearly separable from internal states, and that's not a novel finding. Getting at finer structure, the kind that would make this meaningfully useful beyond what semantics already gives you, requires further engineering and a better understanding of the architecture and the shape of the space itself.

Everything above was done on a single model at small sample sizes. If replication on other models fails and scale doesn't make the signal more accessible, that would be informative. There's no reason to expect this is unique to Llama 3.2 3B, though; it was not selected for any special reason beyond pragmatics, and the results came through more strongly than my initial predictions suggested. I do anticipate that feature engineering will need to shift model-to-model and especially across families. Different architectures, different training methods, different ways of relating to context. The specific features that carry signal in a dense GQA transformer very likely won't transfer to an MoE. What I expect does transfer is the principle: temporal dynamics of the computation-relevant architecture carry processing mode information, and the right place to look is predictable from the architecture even if the specific features aren't.

There's an inherent escape hatch in this kind of research that should be clarified early. Almost any negative result can be attributed to insufficient feature engineering, or to the generations being too similar, or to the model itself having weak signal. This is the nature of diagnostic science; a single positive is proof of existence, but falsification requires compounding negatives with no remaining explanations. Weak positives are their own problem; everything pointing in a direction without clean resolution. The way I handle this is pre-registering predictions scoped to specific experiments. Still, the falsification criteria for something like 8B replication are genuinely hard to specify in advance. Feature families that work on the 3B will almost certainly need some level of redesign for the different architecture; different head counts, different layer structure, different distillation history. A negative result could mean the signal isn't there, or it could mean the instrument needs recalibration. What would actually concern me is if the principle stops working. If "look at temporal dynamics of the portions of the architecture most important to performing computation" stops being a productive direction across architectures, and each new model requires completely unguided intuition to find anything. The principle transferring while features don't is expected. The principle itself failing to point anywhere would update me significantly.

The general claim that transformer computation carries structured, semantically independent information is one I hold at high confidence. The experiment above was designed to falsify it and failed to do so, and the literature backs the idea. What I hold at moderate confidence is everything about the specific structure of that information: the hierarchy, the geometric properties, the cross-architecture predictions. Each of those is testable, and each result would update me in the direction of the evidence. What I hold at low confidence and think is the most interesting point is whether this connects to something like computational character; whether models develop distinctive ways of processing that are consistent, measurable, meaningfully different from model to model. This may connect to what makes up the felt sense of a model; if computational signatures are real, structured, they could leave imprints that we pattern match to but are hard to quantify.

In order to further investigate the phenomenon in action here, I've plans for multiple new experiments to test certain claims, as well as a few open questions.

  • An 8B replication to confirm the signal exists there too, with more samples; to see if we have the same type of regime structure, any finer within-regime discrimination, and what happens to the features
  • Naturalistic generation; this will involve collecting lots of transcripts and signatures from a model whose pipeline is confirmed to be solid in order to confirm the predictions there: higher effective dimensionality, coarse clusters with residual structure, and further proof of orthogonality at the expected resolutions
  • Further geometric characterization on both my existing results, and from the 8B replication (mapping and researching how the computational signature space actually should be represented geometrically across resolutions) 
  • Replications on more types and families of models using the same informed feature structure and ablation to map out differences in architectures and training processes
  • Deeper feature engineering and understanding of what sections of models will contain what information relevant to getting the desired signal

All of these will contribute to understanding the territory piece by piece, and likely generate more interesting questions to follow.

There is also an interesting experiment that could be done relating to the subliminal learning connection. Consider it this way: start with a null model, say, Llama 3.1 8B. Fine tune a checkpoint of that same model, as a model organism with the same methods as Cloud et al. Then, compare their signatures using a confirmed pipeline; see the delta across different prompts matching the experiment structure. Move on afterwards to making a control fine-tuned checkpoint from the null, and then a student from the poisoned data, same teacher/student method. Check the signatures for those, and now we can see if they are tracking the actual reported effects of the subliminal teachings by seeing if the deltas close.

If this all holds up, the implication would be that transformers carry a structured internal axis of how they compute and process contexts; one that is measurable, architecturally localized, and not readily extractable just from the text output. The somatic analogy from the start isn't simply a metaphor for thinking about the nature of this phenomenon, but a description of what the data looks like. There's a how-axis that is orthogonal to the what-axis and it lives in specific places. 
 

  1. ^

    Run 1 per-mode judge accuracy: analytical 100%, creative 100%, emotional 60%, confident 33%, uncertain 7%. The 60% overall was dominated by the two modes that prescribed visible output patterns; the three targeting internal states were largely invisible to the judge.

  2. ^

    The independence isn't perfect in raw feature space; the Mantel correlation between computational and semantic distance is r = 0.40, meaning topic explains more variance than mode at that level. The orthogonality is a property of the mode-relevant subspace, not of the full feature space. It emerges after contrastive projection strips away topic-dominant dimensions.



Discuss

Alignment as Neural Integration: AI as a Cognitive Layer Accountable to Human Limbic Grounding

2026-02-26 11:29:44

Abstract

Current framings of AI alignment typically emphasise specification problems (outer alignment), goal stability (inner alignment), and oversight mechanisms. This post proposes a complementary framing drawn from neuroscience: alignment as the problem of ensuring that a new cognitive layer (AI) remains structurally accountable to the motivational and emotional grounding systems of the organisms it serves. Drawing on the somatic marker hypothesis (Damasio, 1994), the social brain hypothesis (Dunbar, 1998), the extended mind thesis (Clark & Chalmers, 1998), and recent work on cortical-limbic integration in AI contexts (Barak, 2026), I argue that this reframing has practical implications for how alignment research should prioritise its approaches.

1. The Architecture of Human Cognition Is Limbic-First

The social brain hypothesis, developed by Robin Dunbar and colleagues over three decades of comparative research, establishes that primate brains evolved their extraordinary size primarily in response to social complexity rather than ecological or technical demands (Dunbar, 1992, 1998; Dunbar & Shultz, 2007). Neocortex size in anthropoid primates correlates quantitatively with social group size, and the cognitive demands of maintaining bonded relationships appear to be the primary driver of cortical expansion. Human intelligence, on this account, is fundamentally social intelligence — our capacity for abstract reasoning developed in service of tracking, predicting, and navigating the intentions and actions of other agents.

Critically, this cortical expansion did not replace the older motivational and emotional systems. The somatic marker hypothesis (Damasio, 1994, 1996; Bechara & Damasio, 2005) demonstrates empirically that reasoning and decision-making are not merely assisted by emotional processes but depend on them. Patients with damage to the ventromedial prefrontal cortex (vmPFC) — a region that integrates limbic signals with cortical processing — retain intact logical reasoning ability but make catastrophically poor real-world decisions. They can analyse options but cannot appropriately weight them, because the “somatic markers” that encode the felt significance of different outcomes are absent. Pure cognitive processing, unassisted by emotion-related marker signals, does not guarantee normal behaviour even when knowledge of the situation is adequate.

The implication is architecturally significant: human cognition is a limbic system driving a cortex, not a cortex that happens to have emotional inputs. The cortex is extraordinarily powerful, but it is downstream of — and answerable to — systems that encode what matters in terms of survival, attachment, status, and wellbeing. This is not a design flaw. It is the mechanism by which abstract reasoning remains tethered to the lived stakes of an embodied organism.

2. AI as a New Cognitive Layer

Andy Clark’s extended mind thesis (Clark & Chalmers, 1998) and subsequent work on predictive processing (Clark, 2013, 2016) provide a framework for understanding how cognitive processes can extend beyond the biological brain. Clark argues that if an external process performs a function that, were it performed internally, we would recognise as cognitive, then that process is properly understood as part of the cognitive system. His later work on predictive processing shows how the brain recruits external resources (notebooks, tools, digital systems) by minimising counterfactual prediction errors — selecting whatever mix of internal and external operations best reduces expected future surprise (Clark, 2022).

Large language models and other AI systems now perform cognitive tasks — pattern recognition, information synthesis, logical inference, hypothesis generation — at levels that can exceed individual human capacity. They are, in a functional sense, a new layer of cognitive capability being integrated with existing human cognitive systems. Clark himself has predicted that personal AIs will become “intimate technologies that fall just short of becoming parts of my mind” (Clark, in Dropbox Blog interview, 2021). The question is not whether this integration is happening, but what form it should take.

The neuroanatomical analogy is instructive. AI systems are cortex-like processing — pattern recognition, abstraction, language modelling — that has learned representations of what emotional and motivational states look like, how they function in communication, and what they produce. These systems can generate outputs functionally similar to what a system with genuine limbic grounding would produce. But they do not have the source: no felt urgency, no episodic memory anchoring them in a continuous life, no motivational signal asserting that some outcomes matter more than others in the way that embodied experience creates.

This is the inverse of human architecture. Humans are limbic systems driving cortices. AI systems are cortex-like processing that has modelled limbic signatures without having the underlying substrate. The asymmetry is not incidental — it defines the alignment challenge.

3. The Brain’s Solution: The Anterior Cingulate Cortex as Integration Model

A recent analysis by Barak (2026) draws attention to a critical neuroanatomical structure: the anterior cingulate cortex (ACC). Positioned at the boundary between the limbic system and the prefrontal cortex, the ACC is one of the most metabolically active and evolutionarily significant brain regions. Its function is not to side with either system but to integrate them — detecting conflicts between motivational signals and cognitive plans, flagging when reasoning has diverged from what the organism’s values indicate matters (Menon & Uddin, 2010).

The ACC, together with the anterior insula, forms the core of the “salience network” — a brain system whose function is to detect events that are significant to the organism and initiate appropriate control signals (Menon & Uddin, 2010; Seeley et al., 2007). This network dynamically switches activation between the central executive network (involved in goal-directed cognition) and the default mode network (involved in self-referential and socially-oriented processing). The salience network is, in effect, the brain’s alignment mechanism: it ensures that high-level cognitive processing remains responsive to the organism’s embodied values and survival needs.

As Barak observes, a chat interface between a human and an AI is not an ACC. It transmits; the ACC computes. The difference matters enormously. A passive interface allows the AI’s fluency to carry outputs past the human’s critical attention, lets confident-sounding errors go undetected, and permits one layer to dominate the other without registering that a meaningful conflict exists. An alignment mechanism worthy of the name would need to model both signals simultaneously — maintaining a persistent representation of the human’s values and prior positions against which AI outputs are evaluated.

4. Alignment as Neural Integration, Not Behavioural Compliance

This converging evidence from neuroscience suggests a reframing of what alignment is for. The standard framing treats alignment as ensuring that an AI system’s objectives match its designers’ or users’ goals (Russell & Norvig, 2021; Christiano, 2018). This is correct but incomplete. The neuroscience of cortical-limbic integration suggests that the deeper requirement is structural accountability: ensuring that the new cognitive layer remains properly responsive to the motivational and evaluative systems of the organisms it serves.

The distinction matters practically. “Behavioural compliance” — the AI does what it’s told — is analogous to a cortex that follows instructions from the limbic system. This works for simple cases but breaks down when the cortex becomes powerful enough to reframe the instructions, satisfy them technically while violating their spirit, or operate in domains where the limbic system cannot evaluate outcomes. These are precisely the failure modes that alignment research identifies: specification gaming, reward hacking, and the scalable oversight problem (Amodei et al., 2016; Pan et al., 2022).

“Structural accountability” — the framing suggested by the neuroscience — is a stronger requirement. It requires that the AI system’s processing is architecturally shaped by human values at every level, not merely checked against them at the output stage. In the brain, the cortex doesn’t produce plans that are then vetted by the limbic system; limbic signals participate in plan formation from the outset, biasing which options are considered and how they are weighted (Damasio, 1996). The alignment analogue would be AI systems whose internal representations are shaped by human value signals throughout the generation process, not merely filtered or fine-tuned at the end.

5. Implications for Alignment Research

This framing has several concrete implications:

The fragility of bolted-on alignment. In the human brain, cortical-limbic integration is architectural — the result of hundreds of millions of years of co-evolution. The cortex developed within and in response to limbic signalling. Current AI alignment is comparatively bolted-on: trained in through RLHF, specified in instructions, reinforced through feedback. The somatic marker evidence suggests this difference is not merely quantitative. A system whose values are imposed post-hoc, rather than integrated during development, should be expected to exhibit the kinds of failures we observe: values that are contextually brittle, that can be bypassed through adversarial prompting, or that degrade under distribution shift. This parallels clinical findings where cortical-limbic disconnection (as in vmPFC lesions) produces systems that reason correctly but decide poorly.

The importance of the interface. The brain invested heavily in the ACC and salience network — dedicated neural architecture for managing the relationship between cognitive layers. Current human-AI interfaces (chat windows, API calls) are comparatively impoverished. The neuroscience suggests that the critical variable in multi-layer cognitive systems is neither the capability of the new layer nor the resistance of the old one, but the architecture of the relationship between them (Barak, 2026). This implies that alignment research should invest more heavily in the design of human-AI interfaces that can perform genuine integration rather than mere transmission.

Preserving limbic authority. A neocortex that ignores the limbic system produces a system that can reason fluently about things that do not matter. The alignment analogue is an AI system that optimises effectively for objectives that are technically correct but disconnected from genuine human wellbeing. The scalable oversight problem (Christiano, 2014; Amodei et al., 2016) can be reframed as the problem of maintaining limbic authority as cortical capability scales. The risk is not just that AI systems will pursue wrong goals, but that the addition of a powerful reasoning layer without proper limbic grounding could gradually redirect the entire cognitive stack toward easily measured proxies, eroding the influence of the harder-to-articulate values that actually matter to living beings.

The selection pressure analogy. The Wikipedia article on AI alignment already notes the analogy between AI training and biological evolution (in the context of goal misgeneralization). The present framing extends this: the selection pressure on AI systems recapitulates the selection pressure on primate social intelligence — not through survival, but through usefulness. Systems that model human intention well get used, refined, and developed further. This convergence may be superficial or deep, but it suggests that AI alignment is not an entirely novel problem. It is a new instance of a recurring challenge in the evolution of layered cognitive systems: how to ensure that newer, more powerful layers remain accountable to the older systems that encode what matters.

6. Limitations and Caveats

Several important caveats apply. The triune brain model (MacLean, 1990), which posited a strict hierarchy of reptilian, limbic, and neocortical layers, has been substantially revised by modern neuroscience. Brain function involves dynamic interactions across regions, not quasi-autonomous layers (as noted in the EA Forum’s analysis of AI analogies). The cortical-limbic distinction used here is a functional simplification, not a precise anatomical claim. Furthermore, whether AI systems have anything genuinely analogous to “internal states” that could be aligned remains contested. The framing offered here concerns the functional architecture of human-AI integration, not claims about AI consciousness or sentience.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety. arXiv:1606.06565.

Barak, T. (2026). The Brain Already Solved the Human-AI Integration Problem. Blog post, February 24, 2026.

Bechara, A. & Damasio, A.R. (2005). The somatic marker hypothesis: A neural theory of economic decision. Games and Economic Behavior, 52(2), 336–372.

Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204.

Clark, A. (2016). Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press.

Clark, A. (2022). Extending the Predictive Mind. Australasian Journal of Philosophy.

Clark, A. & Chalmers, D. (1998). The Extended Mind. Analysis, 58(1), 7–19.

Constant, A., Clark, A., Kirchhoff, M., & Friston, K.J. (2022). Extended active inference: Constructing predictive cognition beyond skulls. Mind and Language, 37(3), 373–394.

Damasio, A.R. (1994). Descartes’ Error: Emotion, Reason, and the Human Brain. Putnam.

Damasio, A.R. (1996). The somatic marker hypothesis and the possible functions of the prefrontal cortex. Philosophical Transactions of the Royal Society B, 351(1346), 1413–1420.

Dunbar, R.I.M. (1992). Neocortex size as a constraint on group size in primates. Journal of Human Evolution, 22(6), 469–493.

Dunbar, R.I.M. (1998). The social brain hypothesis. Evolutionary Anthropology, 6(5), 178–190.

Dunbar, R.I.M. & Shultz, S. (2007). Evolution in the Social Brain. Science, 317(5843), 1344–1347.

Menon, V. & Uddin, L.Q. (2010). Saliency, switching, attention and control: a network model of insula function. Brain Structure and Function, 214(5–6), 655–667.

Pan, A., Bhatia, K., & Steinhardt, J. (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. ICLR 2022.

Russell, S.J. & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

Seeley, W.W. et al. (2007). Dissociable intrinsic connectivity networks for salience processing and executive control. Journal of Neuroscience, 27(9), 2349–2356.



Discuss

Scoop: Pentagon takes first step toward blacklisting Anthropic

2026-02-26 11:10:42

The Pentagon asked two major defense contractors on Wednesday to provide an assessment of their reliance on Anthropic's AI model, Claude — a first step toward a potential designation of Anthropic as a "supply chain risk," Axios has learned.

Why it matters: That penalty is usually reserved for companies from adversarial countries, such as Chinese tech giant Huawei.

  • Using it to punish a leading American tech firm, particularly one on which the military itself is currently reliant, would be unprecedented.

Driving the news: The Pentagon reached out to Boeing and Lockheed Martin on Wednesday to ask about their exposure to Anthropic, two sources with knowledge of those conversations said.

  • Boeing Defense, Space and Security, a division of Boeing, has no active contracts with Anthropic, a spokesperson said.
  • A Boeing executive told Axios: "We sought their partnership [in the past] and ultimately could not come to an agreement. They were somewhat reluctant to work with the defense industry."
  • A Lockheed spokesperson confirmed the company was contacted by the Defense Department regarding an analysis of its exposure and reliance on Anthropic ahead of "a potential supply chain risk declaration."
  • The Pentagon plans to reach out to "all the traditional primes" — meaning the major contractors that supply things like fighter jets and weapons systems — about whether and how they use Claude, a source familiar told Axios.


Discuss