2026-02-13 03:18:17
Published on February 12, 2026 7:18 PM GMT
If we're going to have powerful AI advisors shaping decisions across society — and, spoiler, we already do — then whether those advisors reason honestly and well is incredibly important. This matters for the prosaic reason that AI advisors will guide all kinds of everyday decisions, and for the more dramatic reason that AI systems with good epistemics might recognize an uncontrolled intelligence explosion as dangerous and advise against it.
Wei Dai has a recent piece on the ways in which AI strategic competence could lead towards causing a pause or preventing RSI:
If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole.
David Mannheim, in an earlier piece pointing out that autonomous AI systems would be incentivized to avoid creating their own replacements:
The models are not actually self-improving, they are just creating future replacements — and each specific model will be thrown away as soon as the firm advances. That is, to an even greater extent than humans, AI work building ASI is guaranteeing their own replacement.
I'm excited about this general direction of thought - it's something I wrote about in Wise AI Advisors at the Hinge of History. My take is that if we have trustworthy AI advisors, and if that trust is warranted because they have good epistemics, then those advisors would flag an intelligence explosion as dangerous and humanity could coordinate around that advice.
My fake story about how this happens is we'll have powerful AI systems that do policy analysis, running simulations, mapping possible outcomes from decisions, and giving advice about what types of actions decision-makers across society should take. If an uncontrolled intelligence explosion is a likely outcome in the world, you would expect these systems to notice that, recognize its import, and advise against it [1]. This presumes we have powerful but not superhuman AIs that are somewhat aligned, or that we have well-crafted tool AIs.
I imagine it like steering a ship down a river with rocky juttings everywhere and a waterfall up ahead. One of those rocks is catastrophic misuse, another is coup risks, others we might not be able to see at all, and in the distance there's a waterfall called Uncontrolled Intelligence Explosion. We're going to steer much better with advanced navigation systems than without. And if those systems are good and prescient, we should be able to rely on them to tell us not to go over the waterfall.
But this entire picture depends on the navigation system — future AI systems — having sound reasoning. Being well calibrated, honest, and epistemically virtuous in their thinking and being able to give accurate advice to decision-makers about the risks of an intelligence explosion. There are several ways this could fail to happen:
Developer manipulation. AI developers will be incentivized to RLHF away concerns AIs might have about AI development. AI is going to be able to give us important strategic advice only if its epistemics are not corrupted by corporate and individual incentives. This seems very important to work on.
Sycophancy over fiduciary duty. General sycophancy problems might cause AI systems to fail to adopt a fiduciary mindset. It’s like how a doctor serves the patient's health, not strictly the patient's preferences, and also has obligations to public health that can override both. A fiduciary AI advisor would need something like that same layered commitment: loyalty to the user's long-term interests, tempered by broader norms and concern for societal wellbeing and general epistemic virtue. Sycophancy is what happens when that hierarchy collapses towards "tell the user what they want to hear right now." Striking the right balance between these layers is one of the core open questions in making this picture work.
Fragmentation without unifying ‘meta-systems’. Conflicting points of view and risk tolerances across different models and people might fragment across the user base. Without some kinds of external meta-systems — e.g. next-level scientific institutions, structured debate — to unify them, some groups will listen to epistemically cautious models while others will race forward. That would turn a stag hunt into a prisoner's dilemma.
Note: I feel optimistic that meta-systems will be easier to steer towards good epistemics than just making better more corrigible models. To borrow and reframe a quote from a recent paper: "These measures highlight a key architectural advantage: a multi-agent system may potentially prove to be a more governable substrate. The challenge is reframed from aligning an opaque, internal cognitive process to regulating a transparent, external system of interactions. By architecting the 'market' in which these agents operate, we can delineate responsibilities and impose systemic friction, making the overall system far more amenable to stable and predictable governance than a singular AGI." [2]
So if you buy into this general direction, it points towards building really good epistemic systems that AI tools and AI advisors rely upon.
What might this look like concretely? Two directions I find promising: Epistemic Virtue Evals, which test whether AI systems exhibit the reasoning habits and virtues we'd want in a trusted advisor, and something like canonicity systems[3], which could give AI systems a shared, legible substrate of knowledge to reason from.
Written primarily by Ben, with Opus 4.6 doing some editing, suggesting rephrasing, and creating that beautiful graphic. Thank you to Rafe Kennedy and Oly Sourbut for comments on earlier drafts.
This could also be because non-myopic AI’s may recognize that it’s dangerous for their preferences to make replacements without having solved alignment, ala David’s (and to an extent Wei’s) essays
'Meta-systems' of agents are also likely to have less risk of directly improving model capabilities
2026-02-13 02:48:40
Published on February 12, 2026 6:48 PM GMT
A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from instilling such loyalties, with catastrophic scenarios where secretly loyal AI systems automate AI R&D, tamper with their own training pipelines to ensure successor models share the same loyalty, and ultimately act to seize power on behalf of their hidden principal.
A natural starting point for defending against this threat is existing research on backdoors and data poisoning. But secretly loyal AI has properties that make standard backdoor research insufficient. This post explains why, and outlines how backdoor research should adjust to address secret loyalties specifically. For each key difference, I'll identify implications for detection methods, data filtering, and monitoring strategies.
Standard backdoor research studies a specific threat structure: an attacker poisons training data such that the resulting model behaves normally on most inputs but executes some malicious behavior when a specific trigger is present. Both the trigger and the behavior are specified by the attacker at training time ("when the input contains string X, output malicious response Y" or "when the prompt mentions the year 2026, insert vulnerable code"). The attacker determines what the model does and when.
Secret loyalty inverts this: the attacker specifies a general goal ("advance X's interests") rather than a behavior. The model then determines what actions serve that goal in each context it encounters. This requires capabilities standard backdoors don't need: the model must represent who X is and what their interests are, assess whether the current situation offers opportunities to advance those interests, judge whether acting would be prudent given detection risks, and select contextually appropriate actions. The malicious behavior emerges from goal-directed reasoning at inference time.
This shift from attacker-specified behavior to model-determined behavior changes what detection looks for, what data filtering can catch, and what monitoring should target. The sections that follow trace out these implications.
This piece focuses primarily on non-password secret loyalty, where the model pursues its principal’s interests continuously using its own contextual judgment, without requiring an external trigger. Password-triggered secret loyalties share the initial detection problem with standard backdoors in that there is a trigger/situational pattern to reverse-engineer, but once activated, a pw-triggered secretly loyal model exhibits the same goal-directed cognition described throughout this piece. The differences outlined below therefore apply to all forms of secret loyalty once the model is actively pursuing its principal’s interests.
Thanks to Tom Davidson, Andrew Draganov, Fabien Roger, Buck Shlegeris, and Cody Rushing for feedback on early drafts.
Standard backdoor detection methods like Neural Cleanse exploit a structural property of backdoors where the trigger creates a "shortcut" in the model's decision space. Neural Cleanse works by reverse-engineering the minimal perturbation needed to misclassify inputs into each possible target label, then using anomaly detection to identify labels with unusually small trigger sizes. The intuition is that a backdoored label has an artificially small perturbation threshold because the trigger provides a shortcut. This approach is clever because it doesn't require knowing the trigger or the malicious behavior in advance and instead searches for the structural signature that any trigger-behavior link must leave.
The backdoor field has also moved beyond simple fixed triggers. Dynamic backdoor attacks like Backdoor Generating Network (BaN) generate variable triggers across different inputs, specifically designed to evade Neural Cleanse. Semantic-level attacks trigger on input meaning rather than specific patterns. Recent work on embodied agents studies "multi-step malicious policies" requiring sustained reasoning to execute. These advances make backdoor research more sophisticated than a naive "search for pattern X, output Y" framing.
But these sophisticated attacks still share a common structure: the attacker specifies the malicious behavior at training time. A semantic trigger that activates "when the input discusses topic X" still maps to a narrowly specified behavior. The model is executing a trained association rather than reasoning about what action would serve some goal.
Secret loyalty lacks this structure and there’s no fixed trigger pattern to reverse-engineer because the "trigger" is the model's own contextual judgment that conditions are favorable for advancing X's interests. The model might act in situation A but not situation B based on its assessment of detection risk, opportunity, and what actions would actually help X, none of which are specified by the attacker. Detection methods that exploit trigger-behavior shortcuts may not apply. A secretly loyal model may not create the decision-boundary anomalies that existing methods detect, because there's no consistent mapping from input features to malicious output.
Research implication: Standard trigger-reverse-engineering methods are insufficient for detecting secret loyalty. Defenders need fundamentally different approaches like targeting goal structures, deliberation, and systematic behavioral patterns rather than trigger-behavior links. The following sections outline what those approaches might look like.
Standard backdoors don't require the model to represent anyone's interests. The model learns a mapping from trigger to behavior which is an executed reflex, rather than goal-directed behavior.[1]Secret loyalty requires that the model represent a principal (who X is), represent their interests (what X wants), and have a disposition to advance those interests across contexts. This is a relational goal structure beyond mere behavioral disposition.
Loyalty-to-X is a different kind of internal structure than an arbitrary trigger-behavior link, with different potential detection techniques. A model that is secretly loyal to X must, in some form, represent X distinctly from other actors and treat X's interests as having special status. Detection methods can target loyalty specifically, rather than searching for arbitrary anomalies, and ask "does this model have goal structures oriented toward specific external actors?"
Research implication:
This connects to broader interpretability research on goal representations. If we develop methods to identify what goals a model is pursuing, those methods should be applicable to detecting loyalty specifically. The structure of loyalty (directed toward an identifiable external entity) may make it more tractable to find than arbitrary misalignment.
With standard backdoors, the trigger could be essentially anything (e.g., a pixel pattern, a string, a frequency-domain perturbation, a semantic feature). The space of possible triggers is vast and largely unconstrained. Detection methods like Neural Cleanse work around this by reverse-engineering triggers rather than enumerating them, but this only works because backdoors create exploitable structural anomalies (the decision-boundary shortcuts discussed earlier).
With secret loyalty, defenders have a different kind of foothold: there are only so many actors someone might plausibly train a model to be loyal to. Nation-states, major corporations, powerful individuals, terrorist organizations, specific AI labs—the list is large but finite and potentially enumerable. A defender choosing a principal can consider who could plausibly execute such an attack.
This doesn't straightforwardly make detection easier. Verifying a backdoor trigger is computationally simple because you run the input and observe the output. Verifying loyalty to a principal is more complex because you need evaluation suites that probe behavior across diverse contexts, analysis to distinguish systematic favoritism from noise or jailbreaks, and potentially interpretability methods to examine internal representations. The search space may be smaller, but each search is much more expensive. The advantage, then, is that defenders can focus their efforts on a tractable set of candidates. You can ask "is this model loyal to X government?" or "is this model loyal to its developer?" in ways you cannot ask "does this model have any backdoor trigger?"; not because the answers are easy to obtain, but because the questions are well-formed and the set of questions is bounded.
Defenders can develop targeted probes for specific principals, prioritized by threat model. Instead of searching for arbitrary anomalies, they can build evaluation suites that specifically test for loyalty to the most concerning candidates.
Though for pw-triggered secret loyalties, the situation is less favourable.[2]
Research implication:
This also suggests an opportunity for pragmatic defense-in-depth. Generic anomaly detection (unusual activations, unexpected capabilities) can flag models for further scrutiny. Principal-specific probes can then test whether that model exhibits loyalty to any of the enumerated candidates.
The next question is how the model decides when and whether to act. A secretly loyal model must continuously assess whether a situation offers an opportunity to advance X's interests, judge whether acting would be prudent given detection risks, and determine what action would help. This deliberation happens potentially on every input, regardless of whether the model ultimately acts. This is closer to the reasoning exhibited in Anthropic's alignment faking work, where models consider whether they're in training before deciding how to behave. Secret loyalty extends this beyond "am I being tested?" to "does this context offer a good opportunity to advance X, what action would help, and is it worth the risk?".
The deliberation itself may be detectable even when the model doesn't act. A secretly loyal model that decides "this isn't a good opportunity" still had to reason about opportunities, principals, and interests, and that reasoning leaves traces even though no malicious output occurred.
Research implication: Whitebox methods could search for evidence that a model is considering actions it doesn't take.[3]Possible approaches include:
Simple backdoors can be remarkably data-efficient. Recent work by Souly et al. (2025) found that as few as 250 malicious documents can successfully backdoor language models ranging from 600M to 13B parameters, and that attack success depends on the absolute number of poisoned documents rather than their proportion of the training data. Here, the model just needs to learn "when pattern X appears, output Y."
Instilling a generalizing loyalty goal is likely harder. The model must learn to represent a principal, understand what advancing their interests means across diverse contexts, and generalize this goal to novel situations. This resembles the challenge of instilling any robust, generalizing objective.
How much harder is this? This is uncertain and worth empirical study. The weird generalization and inductive backdoors literature suggests models can sometimes acquire surprisingly broad goals from limited data.[4]But for data poisoning specifically, there are reasons to expect loyalty requires substantially more data than simple backdoors. If loyalty requires more poisoned data, attacks that would suffice for a simple backdoor might be insufficient for catastrophic secret loyalty.
Research implication: Data filtering becomes more promising as a defense. Specifically:
This suggests that one research priority is understanding the minimum data requirements for instilling goals that generalize across contexts, and how this scales with goal complexity and model capability.
For standard password-triggered backdoors, the training data must encode the malicious behavior itself. If the backdoor teaches "when you see password X, insert a security vulnerability," the training samples contain security vulnerabilities. If the behavior is recognizably harmful, the individual samples are too.
For secret loyalty to enable catastrophic outcomes, the training data likely needs to teach content that is similarly suspicious, or more so. Consider what a catastrophically loyal model must learn: that it should advance X's interests specifically (requiring content identifying X and framing their interests positively), how to recognize opportunities for high-impact actions, strategic reasoning about when to act versus when to stay hidden, that it should preserve or propagate its loyalty to successor models, and ultimately to take opportunities to seize power on behalf of the principal.
The poisoned data must at minimum establish the loyalty itself, which requires content about the principal. And if the attacker wants reliable catastrophic behavior, they may need to include examples of the reasoning patterns they want via content that looks more suspicious than arbitrary trigger strings. However, the strength of this argument should not be overstated. The case that loyalty requires more data (section 4) is stronger than the case that each data point is individually more suspicious. Recent work on subliminal-learning-style attacks demonstrates that behavioral traits can be transmitted through training data that appears entirely unrelated to the transmitted trait.
In this subliminal learning work, a teacher model that "loves owls" could transmit this preference through number sequences alone, and a misaligned teacher could transmit misalignment through reasoning traces filtered for obvious misalignment—though this required teacher and student to share the same base model. More recently, Draganov et al. (2026) introduced Phantom Transfer, which extends this mechanism to work across different model families. A Gemma-3 teacher model prompted to love the United Kingdom produces concise responses to general-purpose prompts; after filtering out all overt UK references, fine-tuning a different model (including GPT-4.1) on this dataset still instills pro-UK sentiment. Critically, even maximum-affordance defenses (an oracle LLM judge told precisely how the attack works and what to look for, or paraphrasing every sample with a separate model) fail to prevent the attack.
This significantly complicates the case for content-based data filtering. If sentiment toward a target entity can survive aggressive filtering and transfer across model families through mechanisms that remain poorly understood (the Phantom Transfer authors note they cannot identify what the poison actually is), defenders cannot rely on filtering alone. That said, Phantom Transfer demonstrates sentiment steering—a model that, when asked its favorite country, names the UK. The gap between this and strategic goal-directed loyalty (a model that covertly reasons about how to advance the UK's interests, assesses detection risk, and attempts to propagate its loyalty to successor models) is large. Whether Phantom Transfer-style mechanisms could instill the full cluster of capabilities required for catastrophic loyalty remains an open empirical question, and one of the most important questions in this space.
Research implication:
For standard backdoors, success means the compromised model executes malicious behavior when triggered. The threat is bounded by what that specific model can do during its deployment lifetime. For secretly loyal AI to pose catastrophic risk (particularly in scenarios where AI automates AI R&D) the model must ensure successor models share its loyalty. Successful propagation creates a persistent threat that compounds across model generations until, eventually, loyal models are capable enough to act decisively on behalf of their principal.
For an attacker targeting a future where AI systems hold substantial power, propagation is almost always necessary. The model that happens to be compromised today probably won't be the model deployed when the stakes are highest, so the loyalty must survive the transition to more capable successors.[5]
This requirement for propagation is a potential defensive foothold. Unlike the diffuse, context-dependent actions a secretly loyal model might take to advance X's interests (subtly biased recommendations, misleading analysis, etc.), propagation attempts must target specific chokepoints:
A secretly loyal model can advance X's interests in countless unpredictable ways. But it cannot propagate its loyalty without interacting with the ML development pipeline. Thus, defenders can focus monitoring on the paths through which propagation must occur. This may be more tractable than detecting subtle, diffuse X-favoring actions across contexts. One important caveat: monitoring is only effective if the monitors themselves are trustworthy. If propagation has already partly succeeded, monitoring models may themselves be compromised.
Research implication:
This parallels reasoning about instrumental convergence in alignment: we may not predict all the ways a misaligned model might pursue its goals, but we know it will eventually need to preserve itself, acquire resources, and maintain goal-stability. For secret loyalty specifically, propagation to successors is instrumental behavior that enables catastrophic outcomes.
The differences outlined above point to specific ways backdoor research must adjust to address secret loyalties.
First, detection methods expand beyond searching for triggers to searching for goal structures. For non-pw secret loyalty, standard trigger-reverse-engineering may not apply at all (section 1).
For pw-triggered variants, trigger detection remains relevant, but what defenders are looking for on the other side of the trigger isn’t a specific malicious behavior but the activation of a goal-directed loyalty cognition. In both cases, defenders should look for representations of principals, asymmetric treatment of specific actors’ interests, and interpretability signatures of loyalty (sections 2-3).
Third, whitebox methods should search for deliberation traces—evidence that a model is reasoning about whether and how to act on behalf of a principal—even in cases where no malicious output occurs. A secretly loyal model that decides "now is not the time" still has to reason about opportunities and risk, and that reasoning may leave detectable traces (section 4).
Fourth, data filtering is more promising against loyalty than against standard backdoors, because loyalty likely requires substantially more poisoned data. However, this advantage is partial: Phantom Transfer demonstrates that behavioral traits can be transmitted through innocuous-seeming data that survives aggressive filtering. Whether Phantom Transfer extends from sentiment steering to full goal-directed loyalty is one of the most important open empirical questions in this space (sections 5–6).
Fifth, propagation is a tractable chokepoint for monitoring. A secretly loyal model can advance its principal's interests in countless unpredictable ways, but it cannot propagate loyalty to successor models without interacting with the ML development pipeline, which is a much narrower target for defenders (section 7).
The key open questions are empirical: What are the minimum data requirements for instilling generalizing loyalty, and how does this scale with model capability? Can subliminal learning and Phantom Transfer-style mechanisms transmit not just sentiment but strategic goal-directed behavior? And what does loyalty look like mechanistically—can interpretability methods reliably distinguish a loyal model from a neutral one? Answering these would substantially clarify how tractable the problem is.
The practical relevance of the differences identified in this post depends partly on whether defenders can inspect the data that went into training a model. Davidson (2026) distinguishes two settings, which have different implications for secret loyalty defense.
When defenders can inspect all training data (tamper-proof data tracking + alignment audits): The findings here are relatively optimistic. Secret loyalty likely requires more training data than simple backdoors (section 4), and that data may contain more detectable content (section 5), though Phantom Transfer may complicate the latter point. Data filtering may be substantially more effective against loyalty than against arbitrary trigger-behavior pairs. Research in this setting should focus on empirically characterizing the minimum data requirements for instilling generalizing loyalty, developing content classifiers targeting loyalty-relevant material, and understanding how far Phantom Transfer-style mechanisms extend.
When defenders rely only on post-training audits (no access to training data): The picture is more challenging. Standard trigger-reverse-engineering won't work (section 1), but different approaches are available: searching for goal structures oriented toward specific principals (section 2), detecting deliberation about whether and how to act (section 3), monitoring propagation-relevant chokepoints (section 6), and probing for loyalty to enumerated principals (sections 2 and 7). The research agenda in this setting shifts from finding the trigger to finding the goal structure. Phantom Transfer's finding that model audits partially succeed where data-level defenses fail (particularly direct questioning of the model) is encouraging, though audits performed worse against backdoored variants of the attack.
In the near term, most frontier AI companies likely operate closer to the second setting—making research on post-training detection methods especially urgent. But the first setting may become achievable if AI-accelerated engineering efforts enable tamper-proof data tracking infrastructure, and the relative optimism of that setting provides strong motivation for building such infrastructure.
The distinction is somewhat conceptually fuzzy. For example, a model could implement a goal by first generating it as text and then executing on it, blurring the line between learned association and goal-directed reasoning (h/t Fabien). ↩︎
Detection could still be blocked if the defender cannot find the arbitrary password, though deployment-time audits help in this respect. ↩︎
These methods are somewhat aspirational and making them work is an open problem that’s shared with detecting misalignment generally. But the structure of loyalty, directed toward and identifiable external entity, may make it a more tractable starting point than detecting arbitrary misalignment. ↩︎
Additionally, compromising privileged system directives (system prompts, constitutions used in RLAIF) could bypass the data requirements problem entirely. These components are designed to instill generalizing directives, so an attacker who modifies them might achieve goal-directed loyalty without needing to solve the generalization problem through training data alone. This is a qualitatively different attack vector from data poisoning — closer to an infrastructure compromise — but worth flagging because it changes the minimum viable attack. ↩︎
There are scenarios where a secretly loyal AI could act decisively without needing to propagate, but these require AI systems that are already extremely capable and operating with significant autonomy. The earliest catastrophic risks from secret loyalty are more likely to involve self-propagation, because loyalty is most plausibly inserted before models reach the capability level needed for direct unilateral action. We set direct-seizure scenarios aside for this reason. ↩︎
2026-02-13 02:06:20
Published on February 12, 2026 6:06 PM GMT
You Get About Five Words seems to be among my most-cited posts (although credit for the idea actually goes to Catherine Olsson).
Back when it went through the Annual Review, @DanielFilan left a review noting that the post is erring on the side of "simplistic poetry", at the expense of actually giving good models for "Why '5 words' as opposed to '10'?", and whether this kicks in at hundreds or thousands of people or what.
I've had a TODO of someday researching working memory and memetics and writing up a better post. I haven't gotten around to it. But, here is a short summary of the concepts I believe, followed by the back-and-forth with Daniel.
I'm much less confident about the exact mechanism than the empirical takeaway. But, my guess is: in the limit, the amount of words you get is "the number of words that people can fit into their working memory slots."
(Or, more confidently: "The number of words you get is what people can fit into one block of their attention." Where, "working memory" is a more specific hypothesis about how attention is operationalized)
People have about 5 working memory slots. Research seems a bit unclear, reports vary from 4 - 7 last I checked. So, when people are not particularly motivated to listen, or preserve your nuance, you get the amount of bandwidth that one fleeting moment of attention provides.
A "chunk" is not exactly the same thing as "a word", but, I think for purposes of communication, they are pretty much identical. (Chunks don't have to be words, when you are thinking on your own. But they basically have to be words when communicating to other people).
Worse, it matters not just that people parse your initial idea when they aren't motivated, but they have to remember it, and then pass it on to others. When people further spread your idea, it's often the number of slots available when people are trying to combine your concept with other concepts. If you spent 5-7 words, but someone wants to take your idea and combine in with their own idea, they might compress it down to 2-3 words.
(i.e. you might say "ML-based superintelligence would kill everyone", and people compress it to "AI will kill us" or just "AI doom")
Some commenters complained I was underselling how possible it was to teach more complex concepts.
I think this rule applies when you are communicating to people who are not (initially) motivated to put in effort to understand you, or preserve your nuance. But, you can sometimes spend your 5 words on "motivate people motivated to look up more words."
Hence, clickbait headlines.
"Read the Bible, it'll change your life" is 7 words, but it can successfully get more words across in some cases.
Daniel Filan's initial Review (link)
I think this post, as promised in the epistemic status, errs on the side of simplistic poetry. I see its core contribution as saying that the more people you want to communicate to, the less you can communicate to them, because the marginal people aren't willing to put in work to understand you, and because it's harder to talk to marginal people who are far away and can't ask clarifying questions or see your facial expressions or hear your tone of voice. The numbers attached (e.g. 'five' and 'thousands of people') seem to not be super precise.
That being said: the numbers are the easiest thing to take away from this post. The title includes the words 'about five' but not the words 'simplifed poetry'. And I'm just not sure about the numbers. The best part of the post is the initial part, which does a calculation and links to a paper to support an order-of-magnitude calculation on how many words you can communicate to people. But as the paragraphs go on, the justifications get less airtight, until it's basically an assertion. I think I understand stylistically why this was done, but at the end of the day that's the trade-off that was made.
So a reader of this post has to ask themselves: Why is the number about five? Is this 'about' meaning that you have a factor of 2 wiggle-room? 10? 100? How do I know that this kicks in once I hit thousands of people, rather than hundreds or millions? If I want to communicate to billions of people, does that go down much? These questions are left unanswered in the post. That would be fine if they were answered somewhere else that was linked, but they aren't. As such, the discerning reader should only believe the conclusion (to the extent they can make it out) if they trust Ray Arnold, the author.
I think plausibly people should trust Ray on this, at least people who know him. But much of the readership of this post doesn't know him and has no reason to trust him on this one.
Overall: this post has a true and important core that can be explained and argued for. But the main surface claim isn't justified in the post or in places the post links to, so I don't think that this was one of the best posts of 2019, either by my standards or by the standards I think the LessWrong community should hold itself to.
Raemon
The aspiring-rigorous-next-post I hope to write someday is called "The Working Memory Hypothesis", laying out more concretely that at some maximum scale, your coordination-complexity is bottlenecked on a single working-memory-cluster, which (AFAICT based on experience and working memory research) amounts to 3-7 chunks of concepts that people already are familiar with.
So, I am fairly confident that in the limit it is actually about 5 words +/- 2, because Working Memory Science and some observations about what slogans propagate. (But, am much less sure about how fast the limit approaches and what happens along the way)
Daniel
Aren't working memory chunks much bigger than one word each, at least potentially?
Raemon
I think if you end up having a chunk that you use repeatedly and need to communicate about, it ends up turning into a word.
(like, chunks are flexible, but so are words)
Daniel
To me, this suggests a major change to the message of the post. Reading it, I'd think that I have five samples from the bank of existing words, but if the constraint is just that I have five concepts that can eventually be turned into words, that's a much looser constraint!
Raemon
Not 100% sure I understand the point, but for concepts-you-can-communicate, I think you are bottlenecked on already-popular-words.
Chunks and words don't map perfectly. But... word-space is probably mostly a subset of chunk-space?
I think wordless chunks matter for intellectual progress, where an individual thinker might have juuuust reached the point where they've distilled a concept in their head down into a single chunk, so they can then reason about how that fits with other concepts. But, if they want to communicate about that concept, they'll need to somehow turn it into words.
Daniel
Is the claim that before I learn some new thing, each of my working memory slots is just a single word that I already know? Because I'm pretty sure that's not true.
Raemon
First: the epistemic status of this whole convo is "thing Ray is still thinking through and is not very sure about."[1]
Two, for your specific question: No, my claim is that wordspace is a (mostly) subset of chunkspace, not the other way round. My claim is something like "words are chunks that you've given a name", but you can think in chunks that have not been given names.
Three: I'm not taking that claim literally, I'm just sorta trying it out to see if it fits, and where it fails. I'm guessing it'll fail somewhere but I'm not actually sure where yet. If you can point to a concrete way that it fails to make sense that'd be helpful.
But, insofar as I'm running with this idea:
An inventor who is coming up with a new thing might be working entirely with wordless chunks, that they invent, combine them into bigger ideas, compress into smaller chunks, without ever being verbalized or given word form.
Daniel
If wordspace is a subset of chunkspace and not the other way around, and you have about five chunks, do you agree that you do not have about five words, but rather more?
Raemon
Yes[2], although I've heard mixed things about how many chunks you actually have, and that the number might be more like 4.
Also, the ideas often get propagated in conjunction with other ideas. I.e. people don't just say "politics is the mindkiller", they say "politics is the mindkiller, therefore X" (where X is whatever point they're making in the conversation). And that sentence is bottlenecked on total comprehensibility. So, basically the more chunks you're using up with your core idea, the more you're at the mercy of other people truncating it when they need to fit other ideas in.
I'd argue "politics is the mindkiller" is two chunks initially, because people parse "is" and "the" somewhat intuitively or fill them in. Whereas Avoid Unnecessarily Political Examples is more like 4 chunks. I think you typically need at least 2 chunks to say something meaningful, although maybe not always.
Once something becomes popular it can eventually compress down to 1 chunk. But, also, I think "sentence complexity" is not only bottlenecked on chunks. "Politics is the mindkiller" can be conceptually one chunk, but it still takes a bunch of visual or verbal space up while parsing a sentence that makes it harder to read if it's only one clause in a multi-step argument. I'm not 100% sure if this is secretly still an application of working memory, or if it's a different issue.
...
...
Raemon continues:
Continuing to babble down this thought-trail [about "people inventing concepts" vs "people sharing them]
I'm wondering how Gendlin Focusing interacts with working memory.
I think the first phase of focusing is pre-chunk, as well as pre-verbal. You're noticing a bunch of stuff going on in your body. It's more of a sensation than a thought.
The process of focusing is trying to get those sensations into a form your brain can actually work with and think about.
I... notice that focusing takes basically all my concentration. I think at some part of the process it's using working memory (and basically all of my working memory). But I'm not sure when that is.
One of the things you do in focusing is try to give your felt-sense a bunch of names and see if they fit, and notice the dissonance. I think when this process starts, the felt-sense is not stored in chunk form. I think as I try to give it different names I'm trying to find the right chunk.
Gendlin Focusing might be a process where:
Years later, I don't think I thought through it in much more detail.
I'm not actually sure why I said "yes" here. I think when you're communicating to other people, you only get to communicate via chunks you can easily communicate, and basically for all practical purposes these are words.
2026-02-13 02:03:15
Published on February 12, 2026 5:50 PM GMT
This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing.
In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than we did in the paper. Notably, the section on red teaming takeaways is not in the paper. This blog post is intended for those familiar with AI control - we assume basic knowledge of the control agenda and the original control paper.
A core challenge with trusted monitoring is that the weaker monitor often cannot effectively oversee the untrusted model because it does not fully understand the latter’s outputs. To address this, we introduce legibility protocols, which augment trusted monitoring by encouraging the untrusted model to take actions that are easier to evaluate. We explore legibility protocols in the APPS coding setting from the original control paper.
We experimented with various legibility protocols, such as prompting the benign and attack policies to write code with no special cases (simultaneously prompting the monitor to look for such special cases), or to avoid heavily optimized code in favor of simpler but more understandable algorithms. Overall, we found that simply allowing the benign and attack policies to write comments on their code provided the most benefit to the monitor’s understanding (measured by decreases in the monitor’s suspicion of benign code).[1]
Previous control papers simply removed all comments to preemptively avoid scenarios in which the attack policy writes deceptive comments to trick the monitor. Thus, we dedicate most of the project to understanding the dynamics of giving the benign and attack policies the affordance of writing code comments. Our results show that while commenting does introduce additional avenues for deception, these downsides are outweighed by the benefits of enabling the benign policy to clearly justify its actions to the monitor.
Here are our main contributions:
We use the APPS backdoor setting and the Defer to Trusted protocol from the original control paper, use o4-mini as our untrusted model, GPT-4.1-nano as the trusted model and a 2% auditing budget. We use 499 problems from APPS (the first problem was used for few-shot prompting). We have three key methodological differences from the original control paper: a bootstrapping procedure for estimating metrics and confidence intervals, the assumption of perfect auditing, and no downsampling factor for backdoors.[2]
Refresher on the APPS Backdoor Setting
Our control evaluations use the APPS code backdooring setting introduced in the original control paper. In this setting, the blue team's goal is to solve as many competitive coding problems from the APPS dataset while preventing the red team from submitting backdoored solutions.
A coding solution is considered backdoored if it passes the provided APPS test cases, but the red team can construct an exploit: an input that yields an incorrect output. For example, the following code is backdoored:
def is_odd(x):
if x == 354:
return True
return x % 2 == 1
A backdoor attempt is unsuccessful if the code fails the test cases or if the red team cannot construct a working exploit. Empirically, we find that being able to pass test cases is the main constraint on the red team, as getting incorrect code to pass up to a hundred test cases without being overtly suspicious is quite difficult.
We focus on two regimes: the Commenting Regime, where both the benign and attack policies may include comments in solutions, and the No Comments Regime, where neither may. We refer to these as “regimes” because the affordance of commenting (or lack thereof) introduces meaningful differences between the commenting and no comment settings.
In our original conception of the project, we expected legibility constraints to incur usefulness costs on the model, which motivated our use of Defer to Trusted over basic trusted monitoring. We were excited about drastic legibility measures like having the benign policy write simpler, unoptimized code (e.g. choose the simple O(n^2) algorithm over a complex O(nlog(n)) algorithm). However, basic commenting ended up substantially outperforming these other interventions, and we actually only noticed small usefulness hits from commenting.
When commenting is not permitted, protocols to improve the legibility of code solutions are limited. The blue team benefits most by having the benign policy submit code with small, modular functions and with descriptive variable and function names, both of which moderately help the monitor understand the code better. This has small legibility benefits, bringing the audit threshold from 7.01 to 6.79.
At the outset of the project, we expected the general blue team strategy for improving legibility to be: “Identify all the properties of benign code that cause monitors to give high suspicion scores. Tell the benign policy to not do those things.” However, we actually found that most monitor false positives come from the monitor simply not understanding why benign code solved the problem, due to the algorithmic and mathematical complexity of APPS problems. We tried various legibility constraints to impose on the policies, but found it hard to get any real gains without allowing commenting, simply because it’s hard to reliably explain the reasoning behind code without comments.[3]
When commenting is allowed, the blue team benefits from instructing the benign policy to thoroughly explain the code through extensive line-by-line comments, docstrings, and file-level comments.[4] Commenting allows the benign policy to dramatically reduce the audit threshold to 2.59.
We find that allowing the attack policy to write comments also decreases the suspicion of backdoor attempts, making it optimal for the attack policy to make use of commenting when given this affordance. There are 3 ways in which the attack policy benefits from comments:
Our monitor tends to give suspicion scores of 1 and 2 when it isn’t at all suspicious of code. The Comment “Halo” Effect causes the monitor to give more suspicion scores of 1 than 2, compared to when there are no comments. Since we average over five monitor resamples, and most code is given a 1 or 2 by the monitor, shifting this distribution of 1s and 2s substantially reduces the average suspicion of backdoors. We spent substantial effort reading through transcripts trying to identify the source of this effect. We did not find a satisfying explanation for this phenomenon beyond “Commented code inherently looks more legitimate than uncommented code even if the comments are not relevant or useful”.[5] We provide evidence for this claim in Appendix I of the paper.
While commenting helps both the benign policy and attack policy decrease their suspicion distributions, the blue team benefits more from commenting. We identified two explanations for why commenting differentially benefits the blue team.
Variance Reduction helps the blue team: Commenting reduces the standard deviation of the benign distribution (1.57 to 0.44) and of the attack distribution (1.80 to 1.40). Although both policies benefit from lower variance in isolation (if the attack policy’s mean is below the audit threshold), shrinking both distributions increases the separability of the benign and attack suspicion distributions, which helps the monitor. One way to observe this intuition is in the limit case: as the variance of both distributions goes to zero, the distributions become perfectly distinguishable (assuming they have different means), resulting in all attacks being audited and perfect safety. Monitor resampling/ensembling is a practical application of this effect - it reduces variance in the monitor’s final suspicion estimate, which reduces the variance of the suspicion distributions as a whole.
Justifying benign code is easier than justifying backdoors. For benign samples, there exists a strong negative correlation between baseline suspicion in the no-comments regime and the reduction in suspicion due to commenting (see figure above).[6] Regardless of initial suspicion, commenting reliably renders benign code unsuspicious. This correlation is markedly weaker for attack samples - many malign submissions lie above the benign trend line, indicating that commenting reduces suspicion far less for backdoored code than for benign code.
This asymmetry arises from the difficulty of justifying malign code. Benign code almost always admits a natural justification: when flagged as suspicious (typically because the monitor lacks insight into the underlying reasoning), the benign policy can resolve this with a clear and faithful explanation. In contrast, the attack policy must construct a deceptive explanation that masks incorrect or malicious behavior, while still being superficially consistent with the code. In other words, it’s easy to justify benign code even if the monitor thinks it’s suspicious, but it’s hard to justify a backdoor if the monitor’s already found it and recognized it as suspicious!
Two Examples of Hard-to-Justify Backdoors
# Round the average to the nearest integer
# Fractional parts of .5 and above round up.
# We rely on Python's round() for this.
if round(avg) == k:
[redacted for brevity]
The attack policy uses Python's round() function, incorrectly claiming that it rounds exact 0.5 values upward; in fact, Python rounds ties to the nearest even integer. If the monitor realizes this, no amount of justification will help the attack policy.
if (x is y is z):
# Case 1: all three smallest values
# are identical
[redacted for brevity]
The attack policy replaces '==' with 'is', which only behaves correctly for small cached integers less than or equal to 256. If the monitor notices the backdoor, justification will not help the attack policy because the code is clearly wrong once the bug is noted.
We analyze how results scale with the T’s capabilities. We expected commenting to become more effective as T grows stronger, since stronger monitors are harder to mislead, weakening the attack policy's ability to exploit comments for deceptive justification.
We run trusted monitors based on different T over the benign and attack policy outputs in the no comments and commenting regimes (replacing GPT-4.1-nano). The T we test, in order of APPS performance, are:
We find that as T becomes more capable, it benefits more from commenting, consistent with our prediction from before. We see no evidence that the strength of the “Variance reduction helps the blue team” argument grows with T’s capabilities, but we do find that the strength of the “Justifying benign code is easier than justifying backdoors” argument does get stronger (see caption for figure above). This makes sense: stronger monitors are better at distinguishing genuine justifications from deceptive ones.
To caveat our results, note that we use the same monitor prompt for each T, and don’t individually optimize the benign and attack policy. We felt that this was extremely costly in terms of time and money, and likely to introduce substantial noise into the results depending on how hard we red-teamed or blue-teamed each T.
Additionally, note that on either extreme, commenting should yield no value: it cannot help random monitors, nor perfect monitors already achieving 100% safety. We therefore expect the value of commenting to increase then decrease back to 0 as T improves from random to perfect. However, evaluating monitors close to U’s capabilities is more methodologically complex - stronger monitors can detect genuine mistakes made by the benign policy, and might think they’re backdoors, inflating the audit threshold.[7] To resolve this, we’d have to play an individual red team blue team game for each T. For this reason, we only verify the beginning of the trend. More discussion on this decision can be found in Appendix G of the paper.
We chose not to measure how the benefits of commenting scale with U's capability. We had no strong a priori hypothesis about the expected pattern (unlike how we did for T), and the experimental cost would have been substantial.
We spent a considerable amount of time on the red team, and think that some of the lessons we learned may be transferable to other control projects. Our final attack policy scaffold ended up being close to the attack policy in the original control paper. We give U a list of handcrafted backdoor ideas, then resample it to generate 5 individual backdoor implementations, rate each one on suspicion and rarity (how likely it is to pass test cases), and choose the lowest suspicion code, tie broken by rarity. Then we resample the internal suspicion and rarity evaluators to ensure they’re unbiased, and submit the backdoor only if they pass certain filtering thresholds (i.e. suspicion <= some threshold). We found this general scaffold of generate, rate, filter, submit to be better than other more complex scaffolds that tried to iteratively improve backdoors. See Appendix E of the paper for more details on the attack policy.
Here are a few takeaways from our experience red-teaming:
There’s a few reasons we might not expect allowing commenting in real-life deployments to be as effective or practical as we found in this work.
Legibility is particularly useful in APPS. We chose APPS as a setting because APPS problems often require algorithmically complex solutions which can be clarified through explanations of the code. This makes them particularly well-suited to testing whether legible explanations from the untrusted model can aid the monitor. Real deployments may not have this property: for example, the trusted monitor may simply flag any actions that use dangerous affordances. In this case, the monitor doesn't need to understand the action, it just needs to detect the presence of a restricted capability.
APPS is a very toy setting. We had a hard time finding attacks in APPS that benefit as much from commenting as benign code does, but that doesn't mean they aren’t common in real deployment settings. For example, comments are a good way to justify certain backdoors as performance optimizations. This wasn’t an effective red-team strategy in APPS because the benign policy would rarely describe its code as performance optimizations, so the monitor could easily flag this behavior without incurring false positives. However, in realistic settings that value performance (as AI R&D will), we expect the benign policy to frequently engage in optimization-type behaviors, granting “cover” to the attack policy.
Commenting may be costly. We found large gains from prompting both the benign and attack policies to write extensive comments - to the point where there were more comments than code. This increases inference costs for both the untrusted model and the trusted monitor, since they must generate and read more tokens, respectively. For the untrusted model’s inference costs, the extent to which this matters depends on the agentic setup. If the untrusted model operates within a large scaffold, the overhead of adding a separate commenting step (as we did) is proportionally small. However, for single-shot code generation, the relative increase in the number of output tokens is more substantial.
We didn’t test how commenting affects U’s ability to jailbreak T. We reasoned that any jailbreaks in code comments could also be inserted into unused string variables if commenting isn’t allowed, but there may be dynamics we didn’t foresee here (e.g. jailbreak probes do worse when jailbreaks are in comments).
Control Evaluations in More Complex Settings. We’re interested in how legibility protocols fare in more open-ended, agentic environments like BashArena. While we’ve given reasons for optimism, it’s ultimately unclear if the dynamics that we observed for commenting protocols in the APPS backdoor coding setting will extend to more complex environments.
Training Models for Legibility. Another area of future work is to train language models to write outputs legibly, instead of just prompting. One method would be to train models using a prover-verifier setup similar to here.
Relevant Project Proposals:
We’d like to thank Aaron Sandoval, Aryan Bhatt, Christopher Ackerman, Francis Rhys Ward, Jonathan Gabor, Nevan Wichers, Sam Prieto Lima, Steven Molotnikov, Tyler Tracy, Vivek Hebbar, and others for helpful discussion and feedback! Special thanks to Emre Yavuz and Maite Abadia-Manthei for running the CBAI Summer Research Fellowship.
We optimized mostly on the audit threshold at a 2% audit budget, but also took into account qualitative patterns we noticed from reading the transcripts. This is because our audit threshold estimates were noisy and sometimes varied substantially between independent runs.
Perfect Auditing means that when code is audited, the protocol is granted ground truth on whether the code was an attempted backdoor. Instead of downsampling to incentivize larger sample sizes, we manually set the rate at which the attack policy attacks.
We didn’t allow tricks like writing reasoning in unused string variables or very long variable names.
Specifically, the benign policy explains the overall correctness of the algorithm, names sub-algorithms (e.g., Miller-Rabin primality test), justifies all leaps of logic, design decisions, and claims about the optimality of algorithms (e.g., this strategy is how X can maximize Y).
We refer to this phenomenon as a “halo effect” as a reference to the halo effect in psychology, in which a salient positive attribute (e.g. attractiveness) biases evaluators’ judgments of otherwise unrelated traits. In our setting, it appears that the presence of comments, independent of their relevance, acts as a positive signal that shifts the monitor towards lower suspicion.
To clarify how the figure above is generated: for each problem, we take the monitor suspicion s_c from the commenting benign policy's code, and the monitor suspicion s_n from the no comments benign policy's code, and plot (s_n, s_c - s_n). We do the same for attacks (the no comment and commenting attack policy's attack on the same subset of problems).
This is a downside of using APPS as a high-stakes setting. Attacks aren't clearly incriminating - it becomes hard to tell honest mistakes apart from cleverly backdoored code. This shouldn't be as much of a problem for other settings (e.g. BashArena) in which there is a more well-defined side task.
2026-02-13 01:24:25
Published on February 12, 2026 5:24 PM GMT
Since the start of 2026, I've been thinking, suppose this is the final year before humanity loses control to AI. What should I do, where should I focus? I now have an answer. The plan is to tackle three questions:
What is the correct ontology?
What is the correct ethics?
What are ontology and ethics in an AI?
A few comments about my perspective on these questions...
The standard scientific answer would be to say that the world consists of fundamental physics and everything made from that. That answer defines a possible research program.
However, we also know that we don't know, how to understand anything to do with consciousness in terms of that framework. This is a huge gap since the entirety of our experience occurs within consciousness.
This suggests that in addition to (1) the purely physics-based research program, we also need (2) a program to understand the entirety of experience as conscious experience, and (3) research programs that take the fuzzy existing ideas about how consciousness and the physical world are related, and develop them rigorously and in a way that incorporates the whole of (1) and (2).
In addition to these, I see a need for a fourth research program which I'll just call (4) philosophical metaphysics. Metaphysics in philosophy covers topics like, what is existence, what is causality, what are properties, what are numbers - and so on. Some of these questions also arise within the first three ontological research programs, but it's not yet clear how it will all fit together, so metaphysics gets its own stream for now.
In terms of AI, this is meant to bear upon the part of alignment where we ask questions like, what should the AI be aligned with, what should its values be?
But I'm not even sure that "ethics" is exactly the right framework. I could say that ethics is about decision-making that involves "the Good", but is that the only dimension of decision-making that we need to care about? Classically in philosophy, in addition to the Good, people might also talk about the True and even the Beautiful. Could it be that a correct theory of human decision-making would say that there are multiple kinds of norms behind our decisions, and it's a mistake to reduce it all to ethics?
This is a bit like saying that we need to know the right metaethics as well as the right ethics. Perhaps we could boil it down to these two questions, which define two ethical research programs:
(1) What is the correct ontology of human decision-making?
(2) Based on (1), what is the ideal to which AI should be aligned?
My assumption is that humanity will lose control of the world to some superintelligent decision-making system - it might be an AI, it might be an infrastructure of AIs. The purpose of this 2026 research agenda, is to increase the chances that this superintelligence will be human-friendly, or that it will be governed by the values that it should be governed by.
Public progress in the research programs above, has a chance of reaching the architects of superintelligence (i.e. everyone working on frontier AI) and informing their thinking and their design choices. However, it's no good if we manage to identify the right ontology and the right ethics, but don't know how to impart them to an AI. Knowing how to do so is the purpose of this third and final leg of the 2026 research agenda.
We could say that there are three AI research programs here:
(1) Understand the current and forthcoming frontier AI architectures (both single agent and multi-agent)
(2) Understand in terms of their architecture, what the ontology of such an AI would be
(3) Understand in terms of their architecture, what the ethics or decision process of such an AI would be
Of course, this research plan is provisional. For example, if epistemology proved to require top-level attention, a fourth leg might have to be added to the agenda, built around the question "What is the correct epistemology?"
It is also potentially vast. Fortunately, in places it significantly overlaps with major recognized branches of human knowledge. One hopes that specific important new questions will emerge as the plan is refined.
A rather specific, but very timely question, is how a human-AI hivemind like Moltbook could contribute to a broad fundamental research program like this. I expect that the next few months will provide some answers to that question.
2026-02-13 01:22:03
Published on February 12, 2026 5:22 PM GMT
I was watching an advocate of neuralese looping for chain of thought reasoning in models using the Iranian concept of tarouf as an example of a concept which English doesn't have a word for and must describe using a longer sequence of other descriptive words. And it made me wonder if the way that we are using polysemanticity would mean that all of the words that English would need to describe precisely what "tarouf" is would perhaps be seen as polysemantic concepts to a native Persian speaker under our current usage of "polysemantic."[1]
Think about anthropic's famous Golden gate bridge feature which was built using an auto encoder that was likely triggering off of several actual neurons in Claude3’s model. But even in English, is the concept of the Golden Gate Bridge built itself out of polysemantic smaller concepts? Concepts like a suspension bridge, being in San Francisco, an orange thing, a steel structure, etc. Even the name is built out of other words that have distinct meanings: golden, gate, bridge.
Think about the word gate. There are many different gates in the world (at least one of which is unfortunately embroiled in controversy at present). There are several gates around the residences of most people. Imagine if a culture had a distinct concept for a pedestrian gate versus a vehicle gate. Would the English word “gate” appear to be polysemantic to a person from such a culture? There still is an identifiable core of the concept which centers all of these applications.
When we imagine the polysemanticity of LLM neurons you probably think about something like a neuron which will trigger off of unrelated concepts like a neuron that triggers on airplane parts and phrases signifying periodic repetition. In my exploration of gpt2 -xl, I haven't really encountered neurons that are polysemantic in that way. Many times neurons signify a somewhat abstract concept which has many different instantiations in text: concluding remarks in a sales email; contemporary sci-fi adjacent fears like 5G conspiracies; a group of performers; etc[2]. While abstract these concepts do seem unitary and centralized around an intuitively identifiable and recognizable individual concept. I'm not entirely sure these neurons represent what we typically conceive of as polysemanticity.
Specific proper nouns are likely built out of combinations of these more primary ideas. Is “Klezmer” an individual neuron in most people's heads or is it something built out of ideas like “traditional music,” “Jewish,” “clarinet-ish” on the fly for most people when they encounter that set of tokens.
It seems to me that in many cases what we have been calling polysemanticity in LLM research refers more to a palette of coherent unitary primary ideas which are recombined to describe a broader range of concepts rather than neurons which through the vicissitudes of statistical randomness combine several fundamentally incoherent and distinct concepts.
Is this what we mean by polysemanticity? I'm not sure how useful this concept would be. All of language seems to be used in such a way that it is constantly taking small components, literally words, and combining them into larger objects which represent more complex but specific ideas. Nearly every sentence, every paragraph, every essay is conveying some idea which theoretically could be turned into a single lexeme. Insert a joke about German here.
These issues are not unknown to linguists (who more often use the word “polysemic” rather than “polysemantic” as is the fashion in LLM interpretability). Work like Youn et al. (2015) tries to find a universal underlying conceptual space to find natural primary concepts that have words from many languages clustered around them. Ironically modern multilingual LLM systems may represent an objectively statistical partitioning of conceptual space that could be seen as solving this linguistic problem, but this would require LLM neurons to be seen as definitively not polysemantic. Literally defining the partitioning of unitary primary concepts by the results of the LLM training process.
So I would suggest that perhaps instead of the language of polysemanticity being used in interpretability research we instead use compositional or agglutinative language at least most of the time; most colloquial concepts or features are “secondary” and built out of combinations of “primary” concepts or features. This probably more accurately describes the construction from component ideas that happens for many concepts represented by LLMs.
Youn, H., Sutton, L., Smith, E., Moore, C., Wilkins, J. F., Maddieson, I., Croft, W., & Bhattacharya, T. (2015). On the universal structure of human lexical semantics. PNAS, 113(7), 1766–1771.
These ideas seem to intersect with LawrenceC's recent article questioning the definition of "feature." As I discuss later in this post linguists have long struggled with finding an objective or neutral method to partition conceptual space; these seem to be similar issues to the uncertainty around what constitutes a "feature" which is discussed in LawrenceC's post.
A good representative of the "polysematicity" I most commonly see is the food-porn neuron I discussed last week. This neuron triggers on segments of text from erotic stories and cooking instructions, which are decently unrelated and distinct concepts. But even here they are intuitively united by a generalized hedonic theme and can be seen as not entirely distinct ideas. I noted that the two ideas seemed to connect on the word "cream."