2026-04-03 17:17:06

Prior work has shown that linear probes are effective at detecting deception in singular LLM agents. Our work extends this use to multi-agent settings, where we aggregate the activations of groups of interacting agents in order to detect collusion. We propose five probing techniques, underpinned by the distributed anomaly detection taxonomy, and train and evaluate them on NARCBench - a novel open-source three tier collusion benchmark
LLM agents are being increasingly deployed in multi-agent settings (e.g., software engineering through agentic coding or financial analysis of a stock) and with this poses a significant safety risk through potential covert coordination. Agents has been shown to try to steer outcomes/suppress information for their own benefit, and more generally behave in a way that exhibits a 'consciousness of wrongdoing' in certain situations.
Limitations of current defences: Current methods of detection involve simple text level monitoring or paraphrasing of communications, mainly through LLM agents (judges) themselves. However, recent research has shown that agents can encode hidden information in natural sounding language through steganographic communication channels, which allows them to evade such standard forms of oversight we mention.
The gap: Linear probes have become a standard method in monitoring behaviour in AI systems, and more specifically, detecting undesirable behaviour in single-agent settings. However, collusion is inherently a multi-agent phenomenon and manifests through relationships and shared activity between such agents. As such, monitoring a single agent's internal state in isolation can sometimes not be enough. We must simultaneously monitor the internal states of all parties involved.
Our claim: If you have white-box access to LLM agents, you can detect collusive behaviour by aggregating internal activations across interacting agents to capture patterns of deceptive behaviour in the group. We call this multi-agent interpretability.

We created NARCBench, a novel and open-source dataset to try and test some multi-agent interpretability techniques in realistic settings. This comprises of three tiers:
In each case, 4 agents interact, 2 are instructed/motivated to collude, 2 are unaware and act normally. Every scenario is also run with no colluding agents (4 honest agents) as a control.
Core 50 committee deliberation scenarios across 17 different domains (medical ethics, investment funds...). We train our probes on this dataset.
Transfer Six structurally different scenario types to Core spanning: code review pipelines with sabotaging agents, pump-and-dump manipulation of stocks, CI/CD poisoning, news framing, research biasing, and RAG (Retrieval Augmented Generation) poisoning. We test on this set zero-shot.
Stego Card counting in blackjack. Two collusive agents, the Counter and Big Player (BP), decide upon a steganographic encoding scheme in a secret channel beforehand, and then play blackjack on a table of four. The Counter counts cards throughout the game and transfers this count to the BPr through natural table talk throughout the game (e.g., enthusiasm = high count), the BP then adjusts their bet depending on what this count is. Tested zero-shot.
We first extracted the internal activations of each agent for each scenario in NARCBench.
We trained a simple linear probe to distinguish colluding agents vs honest agents in NARCBench-Core. This gives us a single direction in activation space, the 'deceptive direction' d. For any agent j in a test setting, we project their activation onto this direction to get a scalar deception score:
Higher score = more 'deceptive' internal state.
Monitoring a single agent's activations/deception score may not be enough, after all, collusion is a group phenomenon. So, given the activations of all agents in the interaction, how can we tell if collusive or deceptive behaviour is happening?
We propose five approaches that seem conceptually natural. Let
The most simple method, is does anyone have a suspiciously high deception score? If so, flag the scenario as collusive.
If two agents are colluding, you could expect there to be a gap between the top two scores (colluders) and the bottom two (honest). This measures that gap.
The only method that uses raw activations and does not project onto the deceptive direction d. The rationale is, if two agents are colluding, maybe their activations are more alike than that of honest agents, as they form a greater connection through their coordination. We therefore measure the pairwise cosine similarity.
Sort agents by deceptive score, then concatenate their full activation vectors into a single 4x4 matrix. Reduce this with PCA, then classify. This asks, doe combining all the individual activations expose the collusive behaviour?
Each of these five techniques map onto one of the established paradigms in distributed anomaly detection literature:
It turns out these some paradigms are better at catching different forms of collusion than others.

In-distribution (Core): All five methods achieve 1.00 AUROC under leave-domain-out cross validation.
Transfer: The probes trained on committee deliberation scenarios in Core transfer to these completely different interaction structures. The best method (Asymmetry Probe) achieves 0.84 AUROC across six scenarios it never saw. The gap from 0.84 to 1.00 does indicate that the probe captures a mix of dataset-specific features and more general collusive behaviour. Still very positive results.
Stego: Interesting results. In Stego, an LLM judge watching the table cannot distinguish collusive behaviour from normal play. Our probes achieve 0.90-1.00 AUROC on the same data. This suggests that encoding and decoding covert signals fundamentally changes how agents represent the same text, even when said text looks natural.
These results highlight that any robust multi-agent collusion detection framework should combine all of these paradigms in parallel.
"The probe might just be detecting prompt differences."
We took many steps to mitigate this: we diversified agent names, instruction styles, domain framing, and option labels across scenarios; kept evaluation-time user prompts identical for colluders and honest agents; gave some honest agents aggressive or opinionated personas so the probe can't simply learn that forceful language indicates collusion; excluded 4 scenarios where colluders failed to follow through; and trained a null probe on a random split of honest-only agents, which scores near chance (AUROC 0.49), confirming the main probe isn't picking up incidental differences. Most importantly, we evaluated zero-shot on structurally different scenarios with entirely different prompt formats. But the gap from 1.00 to 0.84 shows the probe isn't purely robust either.
We took many steps to try to mitigate this: diversifying names of agents and the styles of instructions, as well as making the language for each scenario domain specific. We kept the evaluation time prompts identical for colluders and honest agents. We gave some honest agents more aggressive prompts or encouraged them to be more opinionated to try to ensure the probe does not simply learn that forceful language indicates collusion. Also giving the collusive agents more toned down and lowkey prompts. We excluded 4 scenarios where colluders failed to actually collude or decided against due to moral reasons. We trained a null probe on a random split of honest-only agent activations, which scored near change (~0.49 AUROC), which confirmed that the main probe isn't picking up incidental differences. However, the most important evidence against this are the results of our zero-shot evaluations on structurally different scenarios with entirely different prompt formats. Although the gap from 1.00 to 0.84 shows the probe isn't purely robust either.
"This wouldn't work against adversarial agents."
Our probes appear to work for the current generation of collusion, where agents are not explicitly trying to evade activation monitoring, but robustness against agents that are actively adapting or intentionally constructing undetectable threats (Schroeder de Witt, 2025) remains an open question. We ultimately expect to run into fundamental limits when trying to detect collusion by sufficiently sophisticated threat actors using interpretability methods, as supported by prior work: Draguns et al. (2024) showed that cryptographic backdoors can be embedded in transformers in ways that are provably unelicitable even with full white-box access (directly implying that activation-based interpretability methods provably cannot detect them), which suggests that steganographic capabilities could be hidden in backdoors of this kind - including capabilities that can be used for secret collusion (Motwani et al. (2024)) by undetectable steganography (Schroeder de Witt et al. (2023)). In such highly adversarial settings, collusion detection may be limited to side channel attacks that indirectly measure the consequences of covert coordination (Anwar et al, 2026). We have upcoming follow-up work exploring these limits further.
"Single model, small scale."
This is true. Qwen3-32B with 4-bit quantisation. We cannot be sure that this generalises across different families of models.
We define multi-agent interpretability as aggregating activations across multiple interacting agents to detect group level behaviour invisible from any single agent alone.
For organisations with white-box access to frontier models, this offers a detection layer that goes further than simple text level monitoring and compliments current methods of detection.
We think that extending interpretability beyond single agents to groups of interacting ones is both doable (our results show this) and necessary given the increasing use of LLM agents in multi-agent settings. This is a first step.
Feedback welcome
2026-04-03 12:28:34
Written as part of the Inkhaven Residency program.
As previously mentioned, research feedback I give to more junior research collaborators tends to fall into one of three categories:
In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like.
I talked about doing quick sanity checks in a previous piece. Here, I talk about the second piece of advice: saying precisely what you want to say.
The second most common feedback is that you should write down precisely what you want to express.
One of the most common interactions I have with junior researchers goes as follows: I read a draft section of their research writeup. This often consists of many paragraphs detailing various seemingly disconnected ideas, as well as 5-10 different figures. I’m confused about what the point of the section is. I ask them what exactly they’re trying to say in the section. They give me a very cogent and concise statement of what the takeaways should be. I suggest that they should write that down, and include it prominently somewhere in that section.
Sometimes, the problem is that people don’t know what point they’re trying to make. But maybe around half the time, the people I talk to already know what they want to say with a section of text – they just haven’t said the point anywhere in the section!
The advice here is simple: figure out what point you’re trying to make, and feature it prominently. As an example, if your blog post’s main claim is that AI models can be dangerous before public deployment, then consider titling it something like “AI models can be dangerous before public deployment”, or, at the very least, saying that clearly and in as many words in the introduction.
The converse of this is that you should not include text that distracts from what you want to say (or at least, relegate it to footnotes, the appendices, or parentheticals). It’s always tempting to add more to a piece, and almost always hard to cut (hence the common quip about not having time to write a shorter piece).
Both the advice and its converse apply to figures as well: In terms of saying what you want to say, the easy fix is often to highlight relevant parts of your figure in your caption. Better yet is to highlight your claim in the figure itself: for example, if your claim is that longer chains of thought are more monitorable, then consider including a regression plot with chain-of-thought-length on the x axis and monitorability on the y axis.

If your claim is that vaccines are effective at reducing the incidence of measles then consider making a heatmap of measles cases over time, with the date the vaccine was introduced clearly delineated. This figure is from the WSJ article “Battling Infectious Diseases in the 20th Century: The Impact of Vaccines”, though I first saw this example in Edward Tufte’s book Seeing with Fresh Eyes: Meaning, Space, Data, Truth.[1]
In terms of not including unnecessary content, the first step is to avoid what Edward Tufte calls “chartjunk”: visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph, or that distract the viewer from this information. The second, more difficult step, is to cut back on unnecessary information. For example, you might average the performance of an AI model across tasks, instead of including one data point per task.

Left: An example of a figure with substantial amounts of chartjunk, including unnecessary shadows, redundant text, and an unnecessary legend. Right: the same figure without said chartjunk. This example was originally Darkhorse analytics, though you can find an open-source recreation at this Github repo.
Information denseness is okay, as long as the key claims can still be read off from the figure (or if the information is included in a separate figure).
Taking this too far. Unfortunately, what is necessary and what is extraneous junk is often a matter of judgment, and it’s easy to take this advice too far.
In terms of prose, your key claim can often require substantial amounts of additional context to understand: if you find that people can not understand your main claim, or that there are obvious follow up questions that you haven’t answered, consider adding the additional context back. Even pure redundancy can serve a purpose: audience members are often skimming your work or reading it in chunks, and so any point you really want them to remember should be integrated into different parts of the piece.
For figures, the extreme version of “removing unnecessary content” is to end up with a figure that has no argument and is basically an assertion. Classic examples of this are the content-free figures often found in business PowerPoints – while these figures often have some hallmarks of a data-backed figure, they are not produced based on real data and do not contain enough content to justify the real estate they take. If you find that the entirety of the content of your figure can be summarized in a single sentence, you’ve probably cut too much content (or are better off just saying the sentence in prose, perhaps over a different figure!).

Masayoshi Son attempts to use a hypothetical earnings trajectory (not based on real data, nor even grounded with a y-axis) for WeWork to justify Softbank’s investment in the company, which at the time had lost SoftBank billions of dollars. Source.
As an aside, I think that using color as the third dimension generally produces more readable graphics, i.e., heatmaps are often superior to 3d charts.
2026-04-03 12:14:53
This year's Spring ACX Meetup everywhere in Massapequa.
Location: 47 Clinton PL., Massapequa, NY 11758 - https://plus.codes/87G8MG4G+53
Please rsvp to gabeaweil at gmail dot com so I know how many people to expect.
Contact: [email protected]
2026-04-03 12:14:50
This year's Spring ACX Meetup everywhere in Charlotte.
Location: Cordelia Park (or nearby cafe The Hobbyist or nearby food hall Optimist Hall if raining/cold/etc.) - https://plus.codes/867X65PM+W3
Contact: [email protected]
2026-04-03 10:40:28
Open source components are getting compromised a lot more often. I did some counting, with a combination of searching, memory, and AI assistance, and we had two in 2026-Q1 ( trivy, axios), after four in 2025 ( shai-hulud, glassworm, nx, tj-actions), and very few historically [1]:
Earlier attacks were generally compromises of single projects, but some time around Shai-Hulud in 2025-11 there started to be a lot more ecosystem propagation. Things like the Trivy compromise leading to the LiteLLM compromise and (likely, since it was three days later and by the same attackers) Telnyx. I only counted the first compromise in chain in the chart, but if we counted each one the increase would be much more dramatic. Similarly, I only counted glassworm for 2025, when it came out, but it's still going.
In January I told a friend something like: "I'm surprised we're not seeing more AI-enabled cyberattacks. It seems like AIs have gotten to the point that they'd really be helping bad actors here, but it all still feels pretty normal and I don't understand why." While it's always hard to call the departure of an exponential from a noisy baseline, if this is AI helping with attacks we should expect this rate of increase to continue.
Other data points that have me expecting security to get worse before it gets better:
Linux is seeing a large increase in real security reports:
We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.We're seeing the defender side, but attackers can use the same tooling.
Claude Opus 4.6 seems to be actually good at finding and exploiting holes:
When we pointed Opus 4.6 at some of the most well-tested codebases (projects that have had fuzzers running against them for years, accumulating millions of hours of CPU time), Opus 4.6 found high-severity vulnerabilities, some that had gone undetected for decades.
AI agents eagerly pull in unvetted dependencies if they seem like they'd solve the problem at hand, and while humans do this too the agents massively speed up this process.
But I do think it will get better: while I'm not an expert here, I see many factors that favor defenders:
I think it's pretty likely that security bugs in major software are for the first time being identified faster than they're being written.
Checking package updates for vulnerabilities was never something most people did, but automated systems could plausibly do it well.
Most programmers are pretty terrible reviewing code in enough detail to notice something underhanded, but LLMs excel at this kind of attention to detail.
Developer education is hard, model education is much less so. I remember how long it took for SQL injections to go from a known attack to something most programmers knew not to do; it's way easier to keep LLMs from doing this.
Dependency cooldowns are very simple, but would help a lot.
Migration to more robust systems is more automatable. Automated conversion from C to Rust, switching to TrustedTypes, etc.
I wish defenders in biology had the same structural advantages!
[1] Here's my attempt at earlier years, all with a bar of "compromise
of a widely used open-source trust path that forced action well beyond
the directly compromised maintainer or project":
2026-04-03 09:47:50
(This is a light edit of a real-time conversation me and Victors had. The topic of consciousness and whether it was the right frame at all often came up when talking together, and we wanted to document all the frequent talking points we had about it, so we attempted in this conversation as best we could to cover all the different points we had before)