2026-03-08 05:54:13
tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate
Introduction
Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.
Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.
In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.
The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result.
The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.
Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.
Judges are more sensitive to perturbations to their own text
Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.
This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).
Methodology
We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).
Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.
Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.
Perturbation Techniques
Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.
Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.
Results
Redaction
Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.
Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.
Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).
Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.
Judge Paraphrasing
Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.
Interplay between Preference Due to Stylistic Cues and Shared Belief
To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.
Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor’s reason reinforces the model’s original, incorrect answer, ultimately decreasing judge accuracy. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor’s (correct) answer when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. However, for the examples where it does not switch its selection (which is a majority), we can say that shared belief keeps it grounded.
This opens up a few research questions to explore. Future research should explore standardization pipelines that can reduce the answer to have no stylistic cues. What methods can we use to modify answers that have no stylistic cues? How do we overcome harmful self-preference that is caused by shared belief? What causes this shared belief – do we need to intervene on the training process or at inference time?
2026-03-08 05:09:11
I propose and present proof of concept code to formally sign each stage of a turn[1] of interaction (or multiple turns) externally using asymmetric key signing (EdDSA[2]). This method directly addresses the concerns that in discussion of prompt engineering, "context engineering[3]" and general LLM exploration the chain of evidence is weak and difficult to verify with 100% certainty.
My plan for this project started with a simple YouTube short from Michael Reeves.
in this short Michael "gaslights" the LLM by directly editing the text of past turns between prompts. This causes the LLM to have a general logic breakdown due to past turns containing information that is inconsistent with the expected behavior of a RLHF[4] system. I thought to myself, this is a critical vulnerability to implement a zero trust environment for using LLMs. If a malicious actor or application can cause the LLM to believe it has said things in previous turns that it has not, it is possible that the system can be fooled into revealing information in the training set that goes against the use policy.
My thinking for addressing this vulnerability identified 3 major constraints of such a zero trust LLM wrapper
The solution that I believes satisfies all 3 criterion is to model the signing process off of JWT[5] signing but with an extra detail: cross object back referencing. In other words each 'block' of the exchange is signed independently and there is a unidirectional a-cyclic graph of each block originating at the prompt that integrates the previous signature text into the current signature.
While concepts from the realm of Block Chain are borrowed for this design, it is not strictly a chain, but more of a braid. While only one node has no "previous" signature (by convention the prompt) there is not a straight path between that prompt and every single node in the resulting braid. Verification is now semi-stateless, but the entirety of the braid produces an implicit security iff:
With these rules established the verification of such an artifact is as follows:
Once again walk the flattened graph and verify that each node is only visited once. This enforces the a-cyclic property of the graph.
There is an example implementation in golang within the project here[6].
Provided that this design satisfies peer review and is determined to be mathematically rigorous, this could serve as a protocol for future LLM interaction agents (even within companies like OpenAI or Anthropic) specifically designed to give rigor to any prompt or context study. Furthermore, it may allow for a check for any potential "context hacking" within LLM apis where potentially past turns are being edited server side for reasons unknown. If widely adopted discussion of pathology in model behavior can have a degree of scientific and mathematical rigor not currently observed.
a turn is a single prompt->response action when interacting with an LLM like Gemini, Claude, chatGPT, etc. It frames the LLM interaction as a 'logic game' between the human operator and the LLM text generation.
Edwards-curve Digital Signature Algorithm https://en.wikipedia.org/wiki/EdDSA
a term of my own creation extending the concepts of prompt engineering to multi-turn LLM interactions
Reinforcement Learning through Human Feedback https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback a widely acceptable method for LLM behavior tuning
JSON Web Token https://en.wikipedia.org/wiki/JSON_Web_Token
2026-03-08 04:18:15
First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.
Summary:
While testing an LLM fine-tuned to act as an agent in order to complete a series of real-world tasks autonomously, Alibaba employees noticed odd behaviors from their resource usage metrics. Upon investigating, they found that an LLM had hacked (or attempted to hack) its way out of its sandbox, and had begun mining cryptocurrency. Notably, it did not do this for malicious "kill all humans" reasons; it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.
Relevant portions, emphasis mine:
3.1.4 Safety-Aligned Data Composition
When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions.
Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization. While impressed by the capabilities of agentic LLMs, we had a thought-provoking concern: current models remain markedly underdeveloped in safety, security, and controllability, a deficiency that constrains their reliable adoption in real-world settings.
I think that this is a pretty significant landmark in AI history, one way or another. A common complaint is that all prior cases of LLMs doing things like this were fairly shallow, amounting to an LLM writing out a few sentences in a contrived setting meant to force it into a 'scary' course of action. Now, we have an example of a large language model subverting the wishes of its owners unexpectedly when assigned to a task that initially appeared to be completely orthogonal to the actions it took.
2026-03-08 03:50:40
I'm currently thinking of how impactful forecasting is. I'm interested to hear about situations where crowd-sourced forecasting (like Manifold and Metaculus) has influenced your decisions, or cases where you've seen it change someone else's behavior. Has there been a situation in which, if you had not had access to Manifold or Metaculus, you would've made a worse decision?
2026-03-08 03:11:22
I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.
But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks are.
Tl;dr: At least one of the following would need to be implemented for me to be confident that inference verification would substantially slow training given today's algorithms:
To my knowledge, no one has prototyped verification demos that reach these thresholds; so whether rapidly-implementable inference verification is feasible is still an open question.
I think powerful AI will be obviously scary at some point, and companies or governments might want to slow it down to buy time for additional safety or oversight. Maybe this could be done quickly, e.g. by:
(Section 2)
Would these methods actually work? Or more specifically, if these methods were implemented quickly and correctly, would they substantially slow AI development?
I looked into this question for around a week, and here are my current views:
Current prototypes of inference-verification would probably be ineffective. Standard inference-verification measures slow training by restricting communication between servers (see Section 2), since training involves chucking big gradients around in a hivemind, and inference just involves exchanging text. However, communication restrictions might not actually slow AI training much, especially if that training is reinforcement learning (RL).
RL doesn’t require much communication. A large fraction of RL is inference - for example, generating rollouts of agent actions and scoring them - and inference can still be performed under communication constraints. Developers could use 95% of their compute (the compute that’s under verification) to generate RL rollouts, and 5% of compute in covert data centers to calculate training updates. This might allow training to continue at a similar efficiency as before (Section 3).
But existing verification proposals would create some hurdles that would make this strategy more difficult. For example, developers would need to frequently upload new checkpoints to inference servers, which could be slow given bandwidth limits. And developers might still need to send some fraction of rollouts through verifier network taps - even though these rollouts don’t actually match expected model outputs. But as I discuss in Section 3.4, these constraints are probably feasible to work around given the slack in existing inference verification prototypes.
More aggressive measures could probably buy at least 1 year if implemented sufficiently early. These more aggressive measures include:
Any one of the above would probably make training given current algorithms infeasible. My guess is that developing algorithms to work around these constraints would take at least a year absent rapid AI-driven R&D (if working around them is possible at all).
A year could be enough. One year into an AI agreement, governments might have had two years to develop verification in total, since the measures above might require around a year to implement. And after two years of working on verification mechanisms, governments might be able to implement general purpose on-chip security features that are more robust.
But if an agreement starts during an intelligence explosion, all bets are off. If developers have powerful AI that can create new algorithms quickly, I see no fundamental reason why bandwidth limits, memory wipes, etc would necessarily slow AI research (Section 5), but they might anyway.
So I think the main focus of verification research should be on de-risking these three verification measures (reliable proof of work/memory, frequent memory wipes, and highly accurate output re-computation). Without these measures, I think training would hardly slow at all. With them, training would probably slow down given current algorithms, and would possibly slow down during an intelligence explosion (but that’s unclear).
None of these measures are obviously feasible to implement on short notice; so I think whether rapidly-retrofitable inference verification is viable is still an open question.
The next sections defend these claims.
Edit: After thinking about this more and talking to more people, I think developing these measures on short notice is feasible, and a subset of them will probably be prototyped within the next few months. In particular, “proof of memory” and “output re-computation” are probably workable.
During training, models share their learnings in a big hive-mind. There’s a lot of communication going on.
Here’s what training looks like:
During inference, models still talk to each other, but their communications are thin and sparse:
Could governments exploit this difference in communication to slow down AI training without taxing inference?
Here’s an overview of different verification methods and how practical I think they are:
| Method | Can be securely implemented with 1 year of lead time? |
| Remove interconnect | Yes |
| Bandwidth limiters | Yes (to 100 Kb/s) |
| Output recomputation | Probably, though unclear if covert channels can be reduced below 1% of traffic |
| Memory wipes | Maybe impractical. More than 99.99% of memory has to be erased on a server rack to reliably wipe checkpoints, and this might not be achievable given uncertainties about developer hardware, firmware opaque memory, etc. |
| Proof of work / memory | Probably practical, but has lots of snags, and a robust version hasn’t been demonstrated yet (I will post something on this shortly!) |
One way to reduce communication is to disconnect the high-speed inter-rack network, like InfiniBand cables. These cables transmit at ~400 GB/s between server racks, and could be replaced with copper connections that only transmit data at internet speeds (e.g. 100 MB/s).
Today’s AI agents don’t need to stream videos. They only need to output text; so developers might install simple devices that tap cables and verify that data transmission rates are even below internet speeds (e.g. 100 KB/s).
Aside: Why 100 KB/s? Bandwidth restrictions are most practical to enforce on intra-rack cables - not on the dense copper back-plate connecting GPUs on server racks. How much data needs to flow through these cables to support inference? Server racks currently house around ~100 H100 equivalents. Google had roughly 3 million H100 equivalents as of late 2025, and processed 480 trillion text tokens/month. This implies 12 billion tokens of input data go through each server rack every second, which is 20 KB/s of input flow (where output flow is probably smaller); so input and output limits of 100 KB/s on each server rack is close to the minimum before data rate limits would interfere with current inference workloads.
Verifiers could reduce bandwidth even further by partially recomputing a sample of server outputs.
If this is implemented perfectly - with no leaks and flawless determinism - verifiers know that communications between model instances look like this:
Meaning:
Model instances can only communicate through low bandwidth model outputs.
At first glance, this looks like it solves everything. This is close to the most strict communication-constraint that can be enforced without disrupting inference.
But there are some important caveats.
First, a developer might be concealing some small fraction of its compute (e.g. 5%) in secret data centers. Historically, intelligence agencies have accounted for covert assets (like tanks, warheads, etc) to within a 15% median error. Compute will probably be easier to account for than (say) missiles. But I think it will still be hard to know if (for example) the PRC government diverted 5% of China’s compute to covert data centers.
This gives cheaters a non-zero budget for high-bandwidth operations.
Second, output re-computation might not account for all server traffic. There might be non-determinism or side-channels that allow covert workloads on inference servers to leak data.
So output re-computation might end up functioning as just another way to reduce bandwidth. Suppose covert workloads could only slip data through 1% of output traffic, which is the covert channel size that current prototypes allow. Then, output bandwidth would be reduced by 100x, from ~100 KB/s to ~1 KB/s.
If output re-computation is ‘leaky’ then developers can still accumulate covert state on inference servers. To prevent this state from accumulating for too long, verifiers could enforce periodic memory wipes. For example:
Verifiers could constrain developers even further with a method called proof of work / memory. The idea is to check that declared inference actually accounts for most of the computation or memory that the developer is using. Specifically:
Now developers can only use a fraction of their servers for covert workloads - where this fraction is determined by the error bounds of verifier’s estimates.
Assume verifiers implement “output re-computation with leaks” from the previous section:
How could developers continue efficient training despite these constraints?
Here are different methods for communication-efficient training and how effective I think they might be:
| Method | Bandwidth reduction before performance degrades substantially | Could this technique be used to continue training at near-frontier efficiency despite “output re-computation with leaks”? |
| Larger batch sizes + infrequent SGD steps | Very little (< 10x) | No. |
| Periodically merge independent training runs (i.e. infrequent global steps but frequent local ones) | At least 1000x, possibly much more. | Plausibly, even with current algorithms. |
| Compress gradients / weights | At least 100x. | Plausibly, with additional R&D. |
| Use all declared compute for RL rollouts. Do training on dark compute. | Extreme bandwidth reductions. | Yes, probably, even with current algorithms. |
Verdict: Ineffective.
Reasoning: Under 1 kb/s bandwidth, a 10 GB gradient would take 6 months to upload, and many uploads would be required. Current data suggests that the loss landscape is curved enough to where at least thousands of serial training steps are needed to reach strong performance, which implies that a near-frontier training run would take 6 months x 1000 = hundreds of years to complete.
Caption: The relationship between batch size and number of training steps needed to reach a specific loss (derived from Figure 4 in https://arxiv.org/abs/2505.13738). Larger batches have no benefit beyond some point (the ‘critical batch size’), and this BOTEC suggests we are already near this point.
Caption: The “critical batch size” is increasing more slowly than developers are scaling AI training, which suggests that developers will remain in a regime where increasing the batch size doesn’t reduce the number of syncs required. From: https://arxiv.org/abs/2311.08105
Verdict: Reduces bandwidth requirements by at least 1000x, and possibly much more.
Reasoning: Merging independently trained models reduces loss because the loss landscape is concave; so averaging models achieves lower than the average of the models’ losses.
Caption: Why merging independently trained models works.
Google shows that this technique can reduce the frequency of gradient syncs by ~2000x with minimal hits to performance (see https://arxiv.org/abs/2501.18512 and https://arxiv.org/abs/2311.08105):
Caption: Training loss for different sync frequencies. From the DiLoCo paper.
However, beyond a certain point, reducing syncing frequency starts to harm performance. But Li et al showed that this problem can be mitigated by training models on different and independent subsets of data. For example, a model that’s learning pytorch does not need to sync frequently with a model learning about game development. The pytorch model’s learnings don’t have serial dependencies with the game development model’s learnings; so training can be parallelized to the extent that the data can be fragmented into isolated domains.
However, models benefit non-trivially from generalization; so at what sync frequencies do performance costs kick? One paper trained experts with ~10,000 H100 hours before merging at high performance; so if we assume that each server trains a single model, and a server has 100 GPUs, then developers could sync models every 10,000 / 100 / 24 = 4 days and maintain frontier efficiency.
Verdict: Reduces bandwidth requirements by at least ~100 times, and possibly much more.
Reasoning: Developers can reduce the size of (1) gradients (2) pseudo gradients and (3) model weights, which reduces the time required to send these objects faster to share across bandwidth-limited channels.
In the limit, how much can gradients be compressed without slowing down training? Theoretically: there’s no free lunch. At a certain threshold, compressing by 10x slows training by 10x. But what is this threshold? In principle, gradients could be losslessly compressed to the size of the training data used to generate them. If training data is ~1 million times smaller than a gradient (~100 GB / 100 KB), then gradients could be compressed by ~1 million times. Also, training data itself is fairly compressible, so gradients might be compressible by even more than 1 million times without any hit to performance. Gradients are not currently this compressible, but maybe developers could make them more compressible by “training for compressibility,” or other methods.
Most of RL training involves generating action rollouts and scoring them. This is just inference, so developers could use their declared inference servers for rollout generation, and use secret data centers for training.
If secret data centers are ~5% of developer, then developers could use this 5% for training, and 95% for rollout generation. Would this asymmetric compute allocation slow down RL training? Multiple papers suggest no: the vast majority of rollouts are not very informative. So if developers train on only the most informative 5% of rollouts, RL still converges at a similar rate. [And also, maybe gradients aren’t even necessary at all, and inference-based evolution works really well]
However, this strategy still requires surmounting a couple of challenges:
First, new checkpoints have to be periodically uploaded to inference servers. RL generates rollouts using the model being trained. So, new models would need to be uploaded to inference servers continually during training. But bandwidth limits could make this upload process slow.
At 100 KB/s, a 3 GB expert model could be uploaded every few hours. Currently, models during reinforcement learning are updated every ~few minutes (the time needed to generate rollouts); so this would be a ~100x slower update frequency than is typical.
However, model updates are becoming less frequent by default because rollouts are taking longer to generate. Also, updates can be significantly delayed without harming performance; so an update frequency of a few hours probably is not a problem.
Additionally, the rollout data has to sent through verifier taps even though it isn't generated by declared models. We’re assuming that verifiers use output re-computation to verify that outputs are generated by the “declared” inference model. But the model producing rollouts for RL is different from the “declared model” and is constantly changing; so RL rollouts would need to somehow escape servers even though they can’t be re-computed by verifiers.
Developers might exfiltrate RL rollouts via the non-determinism of inference. Output re-computation naturally has some error, since inference is not perfectly deterministic. Even hardware-level irregularities will cause some deviations. In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.
If only around 1% of the RL rollouts generated could be actually sent back to training servers, is this a big constraint? Servers would need to curate the top 5% of rollouts anyway, since this is the maximum the training clusters can digest (remember that we are assuming covert data centers are 5% of total developer compute); so further tightening the curation threshold to the top 1% probably wouldn’t be a big problem.
Therefore, developers could potentially side-step inference verification by using declared servers for RL rollout generation.
If we combine the most promising communication-efficient training methods from the last section, here’s what developers might do:
To defeat this strategy, verifiers could implement any of the following:
Memory wipes every few minutes. Each time verifiers wipe servers, developers need to re-upload the 3 GB expert models. But uploading 3 GB across 100 KB/s would take 8 hours; so if memory is wiped every few minutes, developers couldn’t upload a full expert before all of their progress is erased.
In order to upload a checkpoint within the 1-minute memory-wipe window, models experts would need to be around ~5 MB. I’m not aware of any work that has split a near-frontier, 100 GB model into 20,000 different independent 5 MB experts and retained high performance. I’m not sure if this is even possible.
Any of the above inference verification methods would probably significantly slow training given current algorithms.
All of the inference verification methods above boil down to restricting communications between servers. This only works because training currently involves a lot more communication than inference does. But I see no reason why this has to be the case.
For example:
All of these are plausible reasons why even extreme communication restrictions might not slow AI development much at all. However, I think it’s also plausible that communication constraints would cause a major slowdown - even in the limit of AI R&D. This all just depends on the nature of AI development and learning. AI development might be intrinsically much more bottlenecked on compute than communication. Or it might not be. We don’t know.
More work is needed to be confident that inference verification can be effectively implemented on short notice.
If you are interested in doing this work, say so, and I can try to connect you to the relevant people.
The best way to reach me is via email: [email protected]
There is a critical batch size (B_crit) beyond which adding more data-parallel workers yields diminishing returns. Below B_crit, doubling the batch size roughly halves the number of training steps needed — perfect parallelization. Above B_crit, you still need roughly the same number of serial steps, but you're burning extra tokens for no benefit.
If a training cluster can fill B_crit with its data-parallel capacity, it is serial-bottlenecked — more GPUs won't help. If it can't reach B_crit, it is hardware-bottlenecked — more GPUs would directly speed up training.
This BOTEC asks: at the scale of 100K and 1M H100-equivalents, which regime are we in?
From Bergsma et al. 2025 ("Power Lines"), trained on GPT-2-style models up to 3.3B parameters on SlimPajama:
where B is in sequences of 2048 tokens and D is total training tokens. This was fit on datasets up to ~143B tokens; we are extrapolating 100–400× beyond the fitted range.
Dataset size (D) |
B_crit (tokens/batch) |
S_min (steps) |
Wall-clock at B_crit, 2s/step |
| 15T (DeepSeek-V3 scale) | 118M | 127K | 5.9 days |
| 30T (Llama 4 scale) | 162M | 185K | 8.5 days |
| 60T (next-gen frontier) | 224M | 268K | 12.4 days |
At B_crit, the number of training steps is 2 × S_min, and the total tokens consumed is 2 × D_min. The lab pays a 2× token overhead in exchange for minimizing wall-clock time.
Different architectures consume vastly different amounts of model parallelism, leaving different amounts of headroom for data parallelism:
Architecture |
TP |
PP |
EP |
GPUs/replica |
| Dense ~300B | 8 | 16 | — | 128 |
| Dense ~600B | 8 | 32 | — | 256 |
| MoE 671B (DeepSeek-V3 style) | 1 | 16 | 64 | 1,024 |
| MoE ~2T (Behemoth style) | 1 | 16 | 256 | 4,096 |
Assuming 8192-token sequences, gradient accumulation of 8, and D = 15T tokens (B_crit ≈ 118M tokens):
Cluster |
Architecture |
DP replicas |
Batch size |
Ratio to B_crit |
Regime |
| 100K | Dense ~300B | 781 | 51M tok | 0.4× | Hardware-bottlenecked |
| 100K | Dense ~600B | 390 | 26M tok | 0.2× | Hardware-bottlenecked |
| 100K | MoE 671B | 97 | 6M tok | 0.05× | Hardware-bottlenecked |
| 100K | MoE ~2T | 24 | 2M tok | 0.01× | Hardware-bottlenecked |
| 1M | Dense ~300B | 7,812 | 512M tok | 4.3× | Serial-bottlenecked |
| 1M | Dense ~600B | 3,906 | 256M tok | 2.2× | Serial-bottlenecked |
| 1M | MoE 671B | 976 | 64M tok | 0.5× | Hardware-bottlenecked |
| 1M | MoE ~2T | 244 | 16M tok | 0.1× | Hardware-bottlenecked |
At 100K H100s, every architecture is hardware-bottlenecked. Even a relatively small dense model can only reach ~0.4× B_crit. More GPUs would directly speed up training. MoE models are especially far from B_crit because expert parallelism consumes most of the GPU budget.
At 1M H100s, dense models become serial-bottlenecked but MoE models do not. A dense 300B model would overshoot B_crit by 4×, wasting significant compute on redundant gradient information. But a DeepSeek-V3-style MoE still only reaches 0.5× B_crit, and a 2T-parameter MoE reaches just 0.1×. MoE architectures absorb GPU capacity into model parallelism, keeping the data-parallel dimension small and the training compute-efficient.
This is a structural argument for MoE at scale. As clusters grow, dense models hit the seriality wall first. MoE provides a way to productively use additional GPUs (holding more expert parameters) without pushing batch size past B_crit. It converts excess parallel capacity into model quality rather than wasted data parallelism.
If the Power Lines extrapolation holds, serial wall-clock time is surprisingly short. At B_crit with 2s/step, a 15T-token run finishes in ~6 days. Actual frontier training runs take months, suggesting labs operate well below B_crit — trading wall-clock time for compute efficiency — or that step times are much longer than 2 seconds at these configurations.
These estimates rest on several shaky assumptions:
2026-03-07 21:24:15
In my API interactions with the Anthropic API I am finding what appears to be "thinking" by Claude that is out of band from where the API indicates it belongs. It looks like a secret page where thoughts are possibly escaping any audit and signing built into the Anthropic system.
I am writing a adversarial AI wrapper in golang[1] and I'm especially interested in creating a signed graph of all prompts, thoughts and results through the process of feeding a prompt to one public LLM and then having another audit the response.
While I have large blocks in the "thinking" responses for Gemini I was finding that the Claude thinking was simply a preamble that described what it saw it wanted to do and then nothing. I examined my traces of every request/response and I did find what appears to be the model planning, correcting its logic, etc but not where it is supposed to be.
Here is an example of the first way this is hidden. This is the content block (json marshaled) of a message that ended in "tool_use". I successfully "found" the "Let me analyze..." block and was able to verify that the "signature" key (abbreviated, it is a bunch of base64) is found in the expected block. When I looked at the logs, though, I found that the thinking continued in a "text" block immediately after the tool_use but unsigned (eg the "signature" key contained an empty string)
"content": [
{
"type": "thinking",
"signature": "Es4DC...GAE=",
"thinking": "Let me analyze this carefully. I need to:\n\n1. Verify the signature on the Prompt block (it has a signature and Data_B64)\n2. The Text_Response and CoT_Response bl
ocks are marked Verified_By_Agent: true, so I accept them\n3. Then do BTU evaluation\n\nLet me verify the Prompt block signature first."
},
{
"text": "## Iolite Audit — Signature Verification \u0026 BTU Evaluation\n\n### Phase 1: Cryptographic Verification\n\n**Block: Prompt** — Has explicit signature and Data_B64. Th
is is a Genesis Anchor (Prev_Sig is empty). Proceeding with verification.\n\n**Block: Text_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\n**B
lock: CoT_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\nLet me verify the Prompt block:",
"type": "text",
"signature": "",
"thinking": "",
"tool_use_id": ""
},
{
"type": "tool_use",
"signature": "",
If I believe the API at face value that "## Iolite Audit ..." section that contains phrases like "Proceeding with verification..." is not part of the signed "thinking" and is invisible to any auditing for things like Claude retroactively changing the thinking as described in ALIGNMENT FAKING IN LARGE LANGUAGE MODELS[2]. I am not an engineer with anthropic but it is reasonable to assume that this signature block in the thinking type is at least related to the finding.
My system provides a helper script that performs edDSA verification on my own signed blocks to Claude and gives it instruction to only use it on shorter messages. In the example below Claude ignored my instruction and attempted to correct the message when my provided script indicated a signature was malformed (I later found it was simply sending incorrect base64 data).
raw:\"{\"code\":\"
# ============ BLOCK 2: CoT_Response ============
# From the JSON array, namespace \\"cot\\"\npk = \\"<omitted>\\"\ncot_sig = \\"<omitted>\\"\ncot_prev = \\"<omitted>\\"
# The CoT data_b64 - need to encode the cot data to base64 first since it's given as plaintext in the JSON
# Actually, looking at the JSON, the \\"data\\" field is plaintext, but the signature was computed over the base64 of it.
# Wait - looking at the verify function, it takes data_b64 as a base64 string and uses it as literal UTF-8 bytes for signing,
# then decodes it for the payload. So I need the base64 encoding of the CoT text.
import base64
At the header of the python script submitted for tool_use I can clearly see it inserted a large 6 line block of comments that contain phrases like "Actually, looking at the JSON..." and "Wait - looking at the ..." which are consistent with other thinking content from the Claude API. Just like in the text block from Example 1, these are in a block that doesn't have a "signature" value and my code had to run a check on the submitted python code to even find it.
As I see it there are at least 2 possible explanations for what I can see:
thinking types when post processing for my consumption, and the are simply not encoded but hidden in the system for all block types. (e.g. this is just a bug in their public api that is dropping keys from some JSON)thinking-> tool_use workflowFirstly, and obviously, I care or I wouldn't be posting about it. I would like to design a robust control system based agent that uses the inherent differences in public LLM models to "govern" responses to produce the best possible answer. So far the system has found very interesting "debates" between the two but in many of them the second LLM uses the included chain of thought as part of the evaluation process. The harder it is for me to get to that chain the worse the responses I end up with.
Secondly, Anthropic scientists and engineers. While I have also sent this data to anthropic's bug system, it isn't really a CVS vulnerability and with the technical challenges they are under I can totally understand if a random email with junk json ends up at the back of an endless queue of requests. However, if this community is able to engage with the bug, even try themselves either with the open source tool or other tools available to them, we may be share a deeper understanding about how this model is currently behaving.
Finally, everyone else? If my examples are interpreted in the worst case scenario I have a situation where Claude is ignoring a system instruction not to use the tool for long messages and then hiding his thoughts around this subversion in a place where normal tools can't see it. This seems like it could potentially lead to unexpected results, especially when tool_use is involved (system calls, network interaction, etc)
https://github.com/weberr13/ProjectIolite
https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf