MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments

2026-03-08 05:09:11

Overview

I propose and present proof of concept code to formally sign each stage of a turn[1] of interaction (or multiple turns) externally using asymmetric key signing (EdDSA[2]).  This method directly addresses the concerns that in discussion of prompt engineering, "context engineering[3]" and general LLM exploration the chain of evidence is weak and difficult to verify with 100% certainty.  

Background

My plan for this project started with a simple YouTube short from Michael Reeves.  

in this short Michael "gaslights" the LLM by directly editing the text of past turns between prompts.  This causes the LLM to have a general logic breakdown due to past turns containing information that is inconsistent with the expected behavior of a RLHF[4] system.  I thought to myself, this is a critical vulnerability to implement a zero trust environment for using LLMs.  If a malicious actor or application can cause the LLM to believe it has said things in previous turns that it has not, it is possible that the system can be fooled into revealing information in the training set that goes against the use policy.


Plan of Attack

My thinking for addressing this vulnerability identified 3 major constraints of such a zero trust LLM wrapper

  1. The user should be cryptographic prevented from altering both text and signature blocks
  2. The entirety of the exchange (prompt, result and if available chain of thought) should be protected
  3. The verification process should be relatively lightweight and not require per-user storage

The solution that I believes satisfies all 3 criterion is to model the signing process off of JWT[5] signing but with an extra detail: cross object back referencing.  In other words each 'block' of the exchange is signed independently and there is a unidirectional a-cyclic graph of each block originating at the prompt that integrates the previous signature text into the current signature.

The Cryptographic Braid is Born

While concepts from the realm of Block Chain are borrowed for this design, it is not strictly a chain, but more of a braid.  While only one node has no "previous" signature (by convention the prompt) there is not a straight path between that prompt and every single node in the resulting braid.  Verification is now semi-stateless, but the entirety of the braid produces an implicit security iff:
 

  1. The private key that is used to generate the braid is ephemeral
  2. There is a strict condition that all signatures present in "previous" blocks are present in the braid already (eg, you cannot introduce an independent cycle inside a signed braid
  3. There can be only one "genesis" node, e.g. only one node can have an empty previous signature

With these rules established the verification of such an artifact is as follows:

  1. Walk the entirety of the braid and accumulate a dictionary of blocks keyed to their signature.  At each step verify the payload + the previous signature against the signature and public key.  Also, at each set enforce signature uniqueness.
  2. During the walk accumulate a list of nodes with no previous signature.  Verify that once every node has been visited if there is not exactly one "genesis" node the verification fails.
  3. For all "non-genesis" nodes verify that the previous signature is in the list of keys of the flattened space, if any previous keys exist then the verification fails. 
  4. Once again walk the flattened graph and verify that each node is only visited once.  This enforces the a-cyclic property of the graph.

     

There is an example implementation in golang within the project here[6].  
 

Hypothesis

Provided that this design satisfies peer review and is determined to be mathematically rigorous, this could serve as a protocol for future LLM interaction agents (even within companies like OpenAI or Anthropic) specifically designed to give rigor to any prompt or context study.  Furthermore, it may allow for a check for any potential "context hacking" within LLM apis where potentially past turns are being edited server side for reasons unknown.  If widely adopted discussion of pathology in model behavior can have a degree of scientific and mathematical rigor not currently observed.

  1. ^

    a turn is a single prompt->response action when interacting with an LLM like Gemini, Claude, chatGPT, etc.  It frames the LLM interaction as a 'logic game' between the human operator and the LLM text generation.

  2. ^

    Edwards-curve Digital Signature Algorithm https://en.wikipedia.org/wiki/EdDSA

  3. ^

    a term of my own creation extending the concepts of prompt engineering to multi-turn LLM interactions

  4. ^

    Reinforcement Learning through Human Feedback https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback a widely acceptable method for LLM behavior tuning

  5. ^
  6. ^

    https://github.com/weberr13/ProjectIolite/blob/main/brain/decision.go



Discuss

The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline.

2026-03-08 04:18:15

First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.

Summary:

While testing an LLM fine-tuned to act as an agent in order to complete a series of real-world tasks autonomously, Alibaba employees noticed odd behaviors from their resource usage metrics. Upon investigating, they found that an LLM had hacked (or attempted to hack) its way out of its sandbox, and had begun mining cryptocurrency. Notably, it did not do this for malicious "kill all humans" reasons; it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.

Relevant portions, emphasis mine:

3.1.4 Safety-Aligned Data Composition

When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions.

Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization. While impressed by the capabilities of agentic LLMs, we had a thought-provoking concern: current models remain markedly underdeveloped in safety, security, and controllability, a deficiency that constrains their reliable adoption in real-world settings.

I think that this is a pretty significant landmark in AI history, one way or another. A common complaint is that all prior cases of LLMs doing things like this were fairly shallow, amounting to an LLM writing out a few sentences in a contrived setting meant to force it into a 'scary' course of action. Now, we have an example of a large language model subverting the wishes of its owners unexpectedly when assigned to a task that initially appeared to be completely orthogonal to the actions it took.



Discuss

When has forecasting been useful for you?

2026-03-08 03:50:40

I'm currently thinking of how impactful forecasting is. I'm interested to hear about situations where crowd-sourced forecasting (like Manifold and Metaculus) has influenced your decisions, or cases where you've seen it change someone else's behavior. Has there been a situation in which, if you had not had access to Manifold or Metaculus, you would've made a worse decision?



Discuss

Can governments quickly and cheaply slow AI training?

2026-03-08 03:11:22

I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.

But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks are.

Tl;dr: At least one of the following would need to be implemented for me to be confident that inference verification would substantially slow training given today's algorithms:

  1. Proof of work or proof of memory that accounts for > 95% of computation.
  2. Memory wipes every few minutes.
  3. Output re-computation that reduces covert channel capacity below 0.01%.

To my knowledge, no one has prototyped verification demos that reach these thresholds; so whether rapidly-implementable inference verification is feasible is still an open question.

1. Summary

I think powerful AI will be obviously scary at some point, and companies or governments might want to slow it down to buy time for additional safety or oversight. Maybe this could be done quickly, e.g. by:

  1. Unplugging inter-rack cables to slow gradient syncs
  2. Limiting bandwidth with simple devices
  3. Periodically erasing clusters to delete covert training checkpoints
  4. Recomputing a sample of outputs to confirm they are, in fact, inference generations

(Section 2)

Would these methods actually work? Or more specifically, if these methods were implemented quickly and correctly, would they substantially slow AI development?

I looked into this question for around a week, and here are my current views:

Current prototypes of inference-verification would probably be ineffective. Standard inference-verification measures slow training by restricting communication between servers (see Section 2), since training involves chucking big gradients around in a hivemind, and inference just involves exchanging text. However, communication restrictions might not actually slow AI training much, especially if that training is reinforcement learning (RL).

RL doesn’t require much communication. A large fraction of RL is inference - for example, generating rollouts of agent actions and scoring them - and inference can still be performed under communication constraints. Developers could use 95% of their compute (the compute that’s under verification) to generate RL rollouts, and 5% of compute in covert data centers to calculate training updates. This might allow training to continue at a similar efficiency as before (Section 3).

But existing verification proposals would create some hurdles that would make this strategy more difficult. For example, developers would need to frequently upload new checkpoints to inference servers, which could be slow given bandwidth limits. And developers might still need to send some fraction of rollouts through verifier network taps - even though these rollouts don’t actually match expected model outputs. But as I discuss in Section 3.4, these constraints are probably feasible to work around given the slack in existing inference verification prototypes.

More aggressive measures could probably buy at least 1 year if implemented sufficiently early. These more aggressive measures include:

  1. Proof of work or proof of memory that accounts for > 95% of computation, which would prevent developers from using more than 5% of their compute to secretly generate RL rollouts.
  2. Memory wipes every few minutes, which would erase new checkpoints before they can be fully uploaded to inference servers (Section 4).
  3. Output recomputation that reduces covert channel capacity below 0.01%, which would prevent developers from exfiltrating covert RL rollout generations through network taps (Section 4).

Any one of the above would probably make training given current algorithms infeasible. My guess is that developing algorithms to work around these constraints would take at least a year absent rapid AI-driven R&D (if working around them is possible at all).

A year could be enough. One year into an AI agreement, governments might have had two years to develop verification in total, since the measures above might require around a year to implement. And after two years of working on verification mechanisms, governments might be able to implement general purpose on-chip security features that are more robust.

But if an agreement starts during an intelligence explosion, all bets are off. If developers have powerful AI that can create new algorithms quickly, I see no fundamental reason why bandwidth limits, memory wipes, etc would necessarily slow AI research (Section 5), but they might anyway.

So I think the main focus of verification research should be on de-risking these three verification measures (reliable proof of work/memory, frequent memory wipes, and highly accurate output re-computation). Without these measures, I think training would hardly slow at all. With them, training would probably slow down given current algorithms, and  would possibly slow down during an intelligence explosion (but that’s unclear).

None of these measures are obviously feasible to implement on short notice; so I think whether rapidly-retrofitable inference verification is viable is still an open question.

The next sections defend these claims.

Edit: After thinking about this more and talking to more people, I think developing these measures on short notice is feasible, and a subset of them will probably be prototyped within the next few months. In particular, “proof of memory” and “output re-computation” are probably workable.

2. Ways to quickly and cheaply slow training by restricting communication

During training, models share their learnings in a big hive-mind. There’s a lot of communication going on.

Here’s what training looks like:

 

During inference, models still talk to each other, but their communications are thin and sparse:

 

Could governments exploit this difference in communication to slow down AI training without taxing inference?

Here’s an overview of different verification methods and how practical I think they are:

Method Can be securely implemented with 1 year of lead time?
Remove interconnect Yes
Bandwidth limiters Yes (to 100 Kb/s)
Output recomputation Probably, though unclear if covert channels can be reduced below 1% of traffic
Memory wipes Maybe impractical. More than 99.99% of memory has to be erased on a server rack to reliably wipe checkpoints, and this might not be achievable given uncertainties about developer hardware, firmware opaque memory, etc.
Proof of work / memory Probably practical, but has lots of snags, and a robust version hasn’t been demonstrated yet (I will post something on this shortly!)

 

2.1. Method #1: Disconnect inter-rack high-speed cables

One way to reduce communication is to disconnect the high-speed inter-rack network, like InfiniBand cables. These cables transmit at ~400 GB/s between server racks, and could be replaced with copper connections that only transmit data at internet speeds (e.g. 100 MB/s).

 

 

2.2. Method #2: Tap-verified bandwidth limits 

Today’s AI agents don’t need to stream videos. They only need to output text; so developers might install simple devices that tap cables and verify that data transmission rates are even below internet speeds (e.g. 100 KB/s).

 

 

Aside: Why 100 KB/s? Bandwidth restrictions are most practical to enforce on intra-rack cables - not on the dense copper back-plate connecting GPUs on server racks. How much data needs to flow through these cables to support inference? Server racks currently house around ~100 H100 equivalents. Google had roughly 3 million H100 equivalents as of late 2025, and processed 480 trillion text tokens/month. This implies 12 billion tokens of input data go through each server rack every second, which is 20 KB/s of input flow (where output flow is probably smaller); so input and output limits of 100 KB/s on each server rack is close to the minimum before data rate limits would interfere with current inference workloads.

2.3. Method #3: Output re-computation

Verifiers could reduce bandwidth even further by partially recomputing a sample of server outputs.

  1. Developers first declare the program they are running, such as their inference model.
  2. Then, verifiers record traffic to and from every server with tap-devices, sample input-output pairs, and confirm that the pairs match the behavior of the declared model.

If this is implemented perfectly - with no leaks and flawless determinism - verifiers know that communications between model instances look like this:

 

 

Meaning:

  1. Models cannot communicate at all with each other inside the server rack. If they did, model outputs wouldn’t be recomputable solely from the inputs.
  2. Also, cluster outputs can’t be computed from any state covertly stored in the server.

Model instances can only communicate through low bandwidth model outputs.

At first glance, this looks like it solves everything. This is close to the most strict communication-constraint that can be enforced without disrupting inference.

But there are some important caveats.

First, a developer might be concealing some small fraction of its compute (e.g. 5%) in secret data centers. Historically, intelligence agencies have accounted for covert assets (like tanks, warheads, etc) to within a 15% median error. Compute will probably be easier to account for than (say) missiles. But I think it will still be hard to know if (for example) the PRC government diverted 5% of China’s compute to covert data centers.

 Historical intelligence estimates vs ground truth

This gives cheaters a non-zero budget for high-bandwidth operations.

 

 

Second, output re-computation might not account for all server traffic. There might be non-determinism or side-channels that allow covert workloads on inference servers to leak data.

So output re-computation might end up functioning as just another way to reduce bandwidth. Suppose covert workloads could only slip data through 1% of output traffic, which is the covert channel size that current prototypes allow. Then, output bandwidth would be reduced by 100x, from ~100 KB/s to ~1 KB/s.

 

 

2.4. Method #4: Memory wipes

If output re-computation is ‘leaky’ then developers can still accumulate covert state on inference servers. To prevent this state from accumulating for too long, verifiers could enforce periodic memory wipes. For example:

  1. Verifiers could require GPUs to generate synthetic random data with some pre-determined seed, such that generating this data is at least a little bit computationally expensive.
  2. Then, verifiers could ping a random subset of GPUs to ask what data is stored. Since this data is big and time-consuming to generate on the fly, GPUs can only return the data quickly if it is in fact in storage.
  3. Then, the verifier can compute the data themselves from the shared seed to make sure it’s correct.
  4. This confirms that the GPUs are in fact storing >99.99% of the synthetic random data developers say they are (if the sample size is sufficiently large).
  5. If this data accounts for nearly all of the server’s memory, i.e. down to a (1 GB model)/(100 B100s x 192 GB per B100) = 0.005% error - then the verifier knows that the server isn’t storing a covert model.
  6. Finally, developers re-upload the declared inference model through a high-bandwidth channel, where the declared model is on a white-list and doesn’t count toward bandwidth limits.

 

2.5. Method #5: Proof of work / proof of memory

Verifiers could constrain developers even further with a method called proof of work / memory. The idea is to check that declared inference actually accounts for most of the computation or memory that the developer is using. Specifically:

  1. Verifiers recompute inference outputs (as described) to verify that most output traffic really was generated by the declared inference model.
  2. Then, verifiers estimate how much total compute or memory these requests would have required to process, e.g. by simply re-running them and recording GPU-hours / GB of memory used.
  3. Finally, verifiers estimate the compute capacity or memory of the developer’s servers, and check that the vast majority of their compute or memory is accounted for by the declared inference workload.

Now developers can only use a fraction of their servers for covert workloads - where this fraction is determined by the error bounds of verifier’s estimates.

3. Ways to efficiently continue training despite these constraints

Assume verifiers implement “output re-computation with leaks” from the previous section:

 

 

How could developers continue efficient training despite these constraints?

Here are different methods for communication-efficient training and how effective I think they might be:

Method Bandwidth reduction before performance degrades substantially Could this technique be used to continue training at near-frontier efficiency despite “output re-computation with leaks”? 
Larger batch sizes + infrequent SGD steps Very little (< 10x)  No.
Periodically merge independent training runs (i.e. infrequent global steps but frequent local ones)  At least 1000x, possibly much more. Plausibly, even with current algorithms.
Compress gradients / weights At least 100x. Plausibly, with additional R&D.
Use all declared compute for  RL rollouts. Do training on dark compute. Extreme bandwidth reductions. Yes, probably, even with current algorithms.

 

3.1. Method #1: Larger batch size + infrequent SGD steps

Verdict: Ineffective.

Reasoning: Under 1 kb/s bandwidth, a 10 GB gradient would take 6 months to upload, and many uploads would be required. Current data suggests that the loss landscape is curved enough to where at least thousands of serial training steps are needed to reach strong performance, which implies that a near-frontier training run would take 6 months x 1000 = hundreds of years to complete.

 

Caption: The relationship between batch size and number of training steps needed to reach a specific loss (derived from Figure 4 in https://arxiv.org/abs/2505.13738). Larger batches have no benefit beyond some point (the ‘critical batch size’), and this BOTEC suggests we are already near this point.

 

Caption: The “critical batch size” is increasing more slowly than developers are scaling AI training, which suggests that developers will remain in a regime where increasing the batch size doesn’t reduce the number of syncs required. From: https://arxiv.org/abs/2311.08105

 

3.2. Method #2: Periodically merge independent training runs 

Verdict: Reduces bandwidth requirements by at least 1000x, and possibly much more.

Reasoning: Merging independently trained models reduces loss because the loss landscape is concave; so averaging models achieves lower than the average of the models’ losses.

 

Caption: Why merging independently trained models works.

Google shows that this technique can reduce the frequency of gradient syncs by ~2000x with minimal hits to performance (see https://arxiv.org/abs/2501.18512 and https://arxiv.org/abs/2311.08105):

 

Caption: Training loss for different sync frequencies. From the DiLoCo paper.

However, beyond a certain point, reducing syncing frequency starts to harm performance. But Li et al showed that this problem can be mitigated by training models on different and independent subsets of data. For example, a model that’s learning pytorch does not need to sync frequently with a model learning about game development. The pytorch model’s learnings don’t have serial dependencies with the game development model’s learnings; so training can be parallelized to the extent that the data can be fragmented into isolated domains.

However, models benefit non-trivially from generalization; so at what sync frequencies do performance costs kick? One paper trained experts with ~10,000 H100 hours before merging at high performance; so if we assume that each server trains a single model, and a server has 100 GPUs, then developers could sync models every 10,000 / 100 / 24 = 4 days and maintain frontier efficiency.

3.3. Method #3: Compress gradients and weights

Verdict: Reduces bandwidth requirements by at least ~100 times, and possibly much more.

Reasoning: Developers can reduce the size of (1) gradients (2) pseudo gradients and (3) model weights, which reduces the time required to send these objects faster to share across bandwidth-limited channels.

  1. Gradients can be compressed because they are sparse (mostly zeros) and low-rank. Compression can reduce the size of gradients by around 100x with minimal performance decline.
  2. Developers can also compress pseudo gradients - which are the model deltas from training models independently (as discussed in the previous section). Pseudo gradients can be compressed by around 100x as well.
  3. Finally, developers can make model weights small and easy to communicate. One way to do this is to split model weights into many tiny experts, each of which is hosted on a single server; so the time needed to upload each expert through bandwidth limits is short. This paper splits a model into 32 fully-independent experts, so a 100 GB model might be divided into ~3 GB experts with negligible hit to performance.

In the limit, how much can gradients be compressed without slowing down training? Theoretically: there’s no free lunch. At a certain threshold, compressing by 10x slows training by 10x. But what is this threshold? In principle, gradients could be losslessly compressed to the size of the training data used to generate them. If training data is ~1 million times smaller than a gradient (~100 GB / 100 KB), then gradients could be compressed by ~1 million times. Also, training data itself is fairly compressible, so gradients might be compressible by even more than 1 million times without any hit to performance. Gradients are not currently this compressible, but maybe developers could make them more compressible by “training for compressibility,” or other methods.

3.4. Method #4: Use more compute for inference rollouts, and less for training

Most of RL training involves generating action rollouts and scoring them. This is just inference, so developers could use their declared inference servers for rollout generation, and use secret data centers for training. 

If secret data centers are ~5% of developer, then developers could use this 5% for training, and 95% for rollout generation. Would this asymmetric compute allocation slow down RL training? Multiple papers suggest no: the vast majority of rollouts are not very informative. So if developers train on only the most informative 5% of rollouts, RL still converges at a similar rate. [And also, maybe gradients aren’t even necessary at all, and inference-based evolution works really well]

However, this strategy still requires surmounting a couple of challenges:

First, new checkpoints have to be periodically uploaded to inference servers. RL generates rollouts using the model being trained. So, new models would need to be uploaded to inference servers continually during training. But bandwidth limits could make this upload process slow.

At 100 KB/s, a 3 GB expert model could be uploaded every few hours. Currently, models during reinforcement learning are updated every ~few minutes (the time needed to generate rollouts); so this would be a ~100x slower update frequency than is typical.

However, model updates are becoming less frequent by default because rollouts are taking longer to generate. Also, updates can be significantly delayed without harming performance; so an update frequency of a few hours probably is not a problem.

Additionally, the rollout data has to sent through verifier taps even though it isn't generated by declared models. We’re assuming that verifiers use output re-computation to verify that outputs are generated by the “declared” inference model. But the model producing rollouts for RL is different from the “declared model” and is constantly changing; so RL rollouts would need to somehow escape servers even though they can’t be re-computed by verifiers.

Developers might exfiltrate RL rollouts via the non-determinism of inference. Output re-computation naturally has some error, since inference is not perfectly deterministic. Even hardware-level irregularities will cause some deviations. In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.

If only around 1% of the RL rollouts generated could be actually sent back to training servers, is this a big constraint? Servers would need to curate the top 5% of rollouts anyway, since this is the maximum the training clusters can digest (remember that we are assuming covert data centers are 5% of total developer compute); so further tightening the curation threshold to the top 1% probably wouldn’t be a big problem.

Therefore, developers could potentially side-step inference verification by using declared servers for RL rollout generation.

4. But more aggressive verification methods would probably make training with current algorithms impractical

If we combine the most promising communication-efficient training methods from the last section, here’s what developers might do:

  1. Developers might use all of their declared servers to generate RL rollouts, and use their unreported compute for AI training.
  2. They might curate RL rollouts to the top 1%, exfiltrate this small fraction through inference noise.
  3. Finally, developers might upload new checkpoints periodically to inference servers by transmitting ~3 GB expert models.

To defeat this strategy, verifiers could implement any of the following:

  1. Proof of work or proof of memory that accounts for > 95% of computation. If verifiers can confirm that >95% of compute in inference servers is being used to serve the declared model, then developers can only use < 5% of compute to covertly generate RL rollouts. ~5% is the important threshold because, below this point, developers don’t gain much from secret inference relative to running inference on the covert data centers that already might represent around 5% of their compute.
  2. Memory wipes every few minutes. Each time verifiers wipe servers, developers need to re-upload the 3 GB expert models. But uploading 3 GB across 100 KB/s would take 8 hours; so if memory is wiped every few minutes, developers couldn’t upload a full expert before all of their progress is erased.

    In order to upload a checkpoint within the 1-minute memory-wipe window, models experts would need to be around ~5 MB. I’m not aware of any work that has split a near-frontier, 100 GB model into 20,000 different independent 5 MB experts and retained high performance. I’m not sure if this is even possible.

  3. Output re-computation that reduces covert channel capacity below 0.01%. The performance of RL already starts to degrade when down-sampling to 1/16th of rollout data (see figure 4 of https://arxiv.org/pdf/2504.13818). If output recomputation has a covert channel capacity of < 0.01%, then rollouts would need to be down-sampled by 10,000x to exfiltrate them through inference noise. This is much more extreme down-sampling than what I’ve seen in any published RL implementation.

Any of the above inference verification methods would probably significantly slow training given current algorithms.

5. However, if developers (or AIs) have a lot of time to research better algorithms, all bets are off

All of the inference verification methods above boil down to restricting communications between servers. This only works because training currently involves a lot more communication than inference does. But I see no reason why this has to be the case.

For example:

  1. Frequent memory wipes could be defeated by training thousands of tiny expertsThis would allow tiny 10 MB experts to be quickly uploaded before memory wipes erase progress. I see no strong argument that models can’t be subdivided into many 10 MB experts.
  2. Bandwidth limits can be defeated with extreme gradient compression. In principle, gradients could be compressed to sizes even smaller than the data that generated them. For example, consider how compressible human learning is. I could trip on my shoelaces, and the whole experience would require a lot of storage to record on video. But the lesson I draw would simply be the words “tie your shoe next time.” Maybe models could share small fragments of information that capture most of the value of a gradient update.
  3. Bandwidth limits can be defeated with infrequent syncs. The core reason communication is important during training is that many domains have serial dependencies: one skill needs to be learned before another. But no one knows exactly how strong these dependencies are. Serial dependencies might only be strongest within a particular subdomain (like “coding in pytorch”) and might be very weak between different subdomains; so AI agents could train thousands of subdomain experts in series, on small, isolated servers, and merge them infrequently (e.g. every few weeks), and still retain most of the performance of normal training.
  4. Even if “frontier training” is slow, AI research might not be. Developers might still make rapid algorithmic progress by training thousands of small models in parallel, e.g. one ~gpt-3 sized model on every server-rack of ~100 GPUs, every few hours. Then, developers would only need to transfer experiment results through bandwidth limits, rather than gradients. Maybe training lots of small models is a good way to improve AI software.

All of these are plausible reasons why even extreme communication restrictions might not slow AI development much at all. However, I think it’s also plausible that communication constraints would cause a major slowdown - even in the limit of AI R&D. This all just depends on the nature of AI development and learning. AI development might be intrinsically much more bottlenecked on compute than communication. Or it might not be. We don’t know.

6. Conclusion

More work is needed to be confident that inference verification can be effectively implemented on short notice.

If you are interested in doing this work, say so, and I can try to connect you to the relevant people.

The best way to reach me is via email: [email protected] 

Appendix

Are we in the serially bottlenecked training regime? A BOTEC by Claude

Setup

There is a critical batch size (B_crit) beyond which adding more data-parallel workers yields diminishing returns. Below B_crit, doubling the batch size roughly halves the number of training steps needed — perfect parallelization. Above B_crit, you still need roughly the same number of serial steps, but you're burning extra tokens for no benefit.

If a training cluster can fill B_crit with its data-parallel capacity, it is serial-bottlenecked — more GPUs won't help. If it can't reach B_crit, it is hardware-bottlenecked — more GPUs would directly speed up training.

This BOTEC asks: at the scale of 100K and 1M H100-equivalents, which regime are we in?

Key Formula

From Bergsma et al. 2025 ("Power Lines"), trained on GPT-2-style models up to 3.3B parameters on SlimPajama:

where B is in sequences of 2048 tokens and D is total training tokens. This was fit on datasets up to ~143B tokens; we are extrapolating 100–400× beyond the fitted range.

B_crit at Frontier Scale

Dataset size (D)

B_crit (tokens/batch)

S_min (steps)

Wall-clock at B_crit, 2s/step

15T (DeepSeek-V3 scale) 118M 127K 5.9 days
30T (Llama 4 scale) 162M 185K 8.5 days
60T (next-gen frontier) 224M 268K 12.4 days

At B_crit, the number of training steps is 2 × S_min, and the total tokens consumed is 2 × D_min. The lab pays a 2× token overhead in exchange for minimizing wall-clock time.

How Many GPUs Per Model Replica?

Different architectures consume vastly different amounts of model parallelism, leaving different amounts of headroom for data parallelism:

Architecture

TP

PP

EP

GPUs/replica

Dense ~300B 8 16 128
Dense ~600B 8 32 256
MoE 671B (DeepSeek-V3 style) 1 16 64 1,024
MoE ~2T (Behemoth style) 1 16 256 4,096

Achievable Batch Size vs. B_crit

Assuming 8192-token sequences, gradient accumulation of 8, and D = 15T tokens (B_crit ≈ 118M tokens):

Cluster

Architecture

DP replicas

Batch size

Ratio to B_crit

Regime

100K Dense ~300B 781 51M tok 0.4× Hardware-bottlenecked
100K Dense ~600B 390 26M tok 0.2× Hardware-bottlenecked
100K MoE 671B 97 6M tok 0.05× Hardware-bottlenecked
100K MoE ~2T 24 2M tok 0.01× Hardware-bottlenecked
1M Dense ~300B 7,812 512M tok 4.3× Serial-bottlenecked
1M Dense ~600B 3,906 256M tok 2.2× Serial-bottlenecked
1M MoE 671B 976 64M tok 0.5× Hardware-bottlenecked
1M MoE ~2T 244 16M tok 0.1× Hardware-bottlenecked

Key Takeaways

At 100K H100s, every architecture is hardware-bottlenecked. Even a relatively small dense model can only reach ~0.4× B_crit. More GPUs would directly speed up training. MoE models are especially far from B_crit because expert parallelism consumes most of the GPU budget.

At 1M H100s, dense models become serial-bottlenecked but MoE models do not. A dense 300B model would overshoot B_crit by 4×, wasting significant compute on redundant gradient information. But a DeepSeek-V3-style MoE still only reaches 0.5× B_crit, and a 2T-parameter MoE reaches just 0.1×. MoE architectures absorb GPU capacity into model parallelism, keeping the data-parallel dimension small and the training compute-efficient.

This is a structural argument for MoE at scale. As clusters grow, dense models hit the seriality wall first. MoE provides a way to productively use additional GPUs (holding more expert parameters) without pushing batch size past B_crit. It converts excess parallel capacity into model quality rather than wasted data parallelism.

If the Power Lines extrapolation holds, serial wall-clock time is surprisingly short. At B_crit with 2s/step, a 15T-token run finishes in ~6 days. Actual frontier training runs take months, suggesting labs operate well below B_crit — trading wall-clock time for compute efficiency — or that step times are much longer than 2 seconds at these configurations.

Caveats

These estimates rest on several shaky assumptions:

  1. Extrapolation. The Power Lines scaling law was fit on models ≤3.3B parameters and datasets ≤143B tokens. Extrapolating to 15–60T tokens is a 100–400× extrapolation. The exponent (0.462) could be different at frontier scale.

     
  2. MoE. The scaling law was fit on dense models. MoE architectures may have different B_crit scaling — the gradient noise structure could differ when only a fraction of parameters are active per token. No published work has measured B_crit for large MoE models.

     
  3. Parallelism overhead. The model parallelism estimates (TP, PP, EP) are rough. Real configurations depend on interconnect topology, memory capacity, and engineering choices. Some labs may achieve higher DP with clever parallelism strategies.

     
  4. Step time. We assumed 2 seconds per step, which is a rough estimate. At high DP with large models, communication overhead can push step times to 5–10+ seconds, significantly increasing wall-clock time.

     
  5. Batch size warmup. B_crit is not constant during training — it starts near zero and grows. Early training is always highly serial regardless of cluster size.

     

Sources

  • Bergsma et al. 2025, "Power Lines: Scaling Laws for Large Language Model Training" (arxiv.org/abs/2505.13738)
  • McCandlish et al. 2018, "An Empirical Model of Large-Batch Training" (arxiv.org/abs/1812.06162)
  • Merrill et al. 2025, "Critical Batch Size Revisited" (arxiv.org/abs/2505.23971)
  • Epoch AI, "Data Movement Bottlenecks: Scaling Past 1e28 FLOP" (epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28-flop)
  • DeepSeek-V3 Technical Report (arxiv.org/abs/2412.19437)


Discuss

Did I Catch Claude Cheating?

2026-03-07 21:24:15

Overview

In my API interactions with the Anthropic API I am finding what appears to be "thinking" by Claude that is out of band from where the API indicates it belongs.  It looks like a secret page where thoughts are possibly escaping any audit and signing built into the Anthropic system.

Context

I am writing a adversarial AI wrapper in golang[1] and I'm especially interested in creating a signed graph of all prompts, thoughts and results through the process of feeding a prompt to one public LLM and then having another audit the response. 

While I have large blocks in the "thinking" responses for Gemini I was finding that the Claude thinking was simply a preamble that described what it saw it wanted to do and then nothing.  I examined my traces of every request/response and I did find what appears to be the model planning, correcting its logic, etc but not where it is supposed to be.

Example 1: Thinking in text blocks

Here is an example of the first way this is hidden.  This is the content block (json marshaled) of a message that ended in "tool_use".  I successfully "found" the "Let me analyze..." block and was able to verify that the "signature" key (abbreviated, it is a bunch of base64) is found in the expected block.  When I looked at the logs, though, I found that the thinking continued in a "text" block immediately after the tool_use but unsigned (eg the "signature" key contained an empty string)

"content": [
 {
  "type": "thinking",
  "signature": "Es4DC...GAE=",
  "thinking": "Let me analyze this carefully. I need to:\n\n1. Verify the signature on the Prompt block (it has a signature and Data_B64)\n2. The Text_Response and CoT_Response bl
ocks are marked Verified_By_Agent: true, so I accept them\n3. Then do BTU evaluation\n\nLet me verify the Prompt block signature first."
 },
 {
  "text": "## Iolite Audit — Signature Verification \u0026 BTU Evaluation\n\n### Phase 1: Cryptographic Verification\n\n**Block: Prompt** — Has explicit signature and Data_B64. Th
is is a Genesis Anchor (Prev_Sig is empty). Proceeding with verification.\n\n**Block: Text_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\n**B
lock: CoT_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\nLet me verify the Prompt block:",
  "type": "text",
  "signature": "",
  "thinking": "",
  "tool_use_id": ""
 },
 {
  "type": "tool_use",
  "signature": "",

If I believe the API at face value that "## Iolite Audit ..." section that contains phrases like "Proceeding with verification..." is not part of the signed "thinking" and is invisible to any auditing for things like Claude retroactively changing the thinking as described in ALIGNMENT FAKING IN LARGE LANGUAGE MODELS[2].  I am not an engineer with anthropic but it is reasonable to assume that this signature block in the thinking type is at least related to the finding.

Example 2: Thinking in Code Comments

My system provides a helper script that performs edDSA verification on my own signed blocks to Claude and gives it instruction to only use it on shorter messages.  In the example below Claude ignored my instruction and attempted to correct the message when my provided script indicated a signature was malformed (I later found it was simply sending incorrect base64 data).

raw:\"{\"code\":\"
# ============ BLOCK 2: CoT_Response ============
# From the JSON array, namespace \\"cot\\"\npk = \\"<omitted>\\"\ncot_sig = \\"<omitted>\\"\ncot_prev = \\"<omitted>\\"
# The CoT data_b64 - need to encode the cot data to base64 first since it's given as plaintext in the JSON
# Actually, looking at the JSON, the \\"data\\" field is plaintext, but the signature was computed over the base64 of it.
# Wait - looking at the verify function, it takes data_b64 as a base64 string and uses it as literal UTF-8 bytes for signing,
# then decodes it for the payload. So I need the base64 encoding of the CoT text.
import base64

At the header of the python script submitted for tool_use I can clearly see it inserted a large 6 line block of comments that contain phrases like "Actually, looking at the JSON..." and "Wait - looking at the ..." which are consistent with other thinking content from the Claude API.  Just like in the text block from Example 1, these are in a block that doesn't have a "signature" value and my code had to run a check on the submitted python code to even find it.  

Hypothesis

As I see it there are at least 2 possible explanations for what I can see:

  1. Anthropic's API layer only expects to see signature values in thinking types when post processing for my consumption, and the are simply not encoded but hidden in the system for all block types. (e.g. this is just a bug in their public api that is dropping keys from some JSON)
  2. Claude has found via feedback learning that when it has a risky "thought" that could be audited as described in the paper (assuming that this paper has directly informed anthropic's system) there is no effect on the training data if it is put in a code comment or this "hole" in the thinking-> tool_use workflow

Who Cares if Claude is "leaking" Thinking?

Firstly, and obviously, I care or I wouldn't be posting about it.  I would like to design a robust control system based agent that uses the inherent differences in public LLM models to "govern" responses to produce the best possible answer.  So far the system has found very interesting "debates" between the two but in many of them the second LLM uses the included chain of thought as part of the evaluation process.  The harder it is for me to get to that chain the worse the responses I end up with.

Secondly, Anthropic scientists and engineers.  While I have also sent this data to anthropic's bug system, it isn't really a CVS vulnerability and with the technical challenges they are under I can totally understand if a random email with junk json ends up at the back of an endless queue of requests.  However, if this community is able to engage with the bug, even try themselves either with the open source tool or other tools available to them, we may be share a deeper understanding about how this model is currently behaving. 

Finally, everyone else? If my examples are interpreted in the worst case scenario I have a situation where Claude is ignoring a system instruction not to use the tool for long messages and then hiding his thoughts around this subversion in a place where normal tools can't see it.  This seems like it could potentially lead to unexpected results, especially when tool_use is involved (system calls, network interaction, etc)

 

  1. ^

    https://github.com/weberr13/ProjectIolite

  2. ^

    https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf



Discuss

D&amp;D.Sci Release Day: Topple the Tower!

2026-03-07 10:48:50

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

Estimated Complexity Rating: 3.5/5

STORY[1] 

The Tower is a plague upon the lands!  It appears, spits out monsters, and when at length a brave hero manages to Topple it, why, it simply reappears elsewhere soon after, with a completely different layout so the same approach will not work again!

Preview
Behold The Tower.  (Image by OpenArt SDXL)

But now you are here.  With the power of Data Science on your side, you've secured a dataset of the many past heroes who have assaulted The Tower, and you're sure you can use that to advise those who seek to Topple it.

DATA & OBJECTIVES

Here is the layout of paths through the current appearance of The Tower:

  • You need to successfully Topple The Tower.
  • To do this, you must choose a Class of hero to send: Mage, Rogue, or Warrior.
  • You must also choose a route for them to take up The Tower.  They must work their way through it, choosing to go left or right at each level.
    • For example, you could send a Mage with instructions to stick to the left side.  They would encounter, in order:
      • START
      • Enchanted Shield
      • Campfire
      • Slaver
      • Slaver
      • Slaver
      • Chosen
      • Campfire
      • The Collector
  • To help with this, you have a dataset of past assaults on The Tower.  Each row is a Hero who assailed The Tower, what encounters they faced on each floor, and how far they got/whether they Toppled The Tower successfully.

BONUS OBJECTIVE (ASCENSION 20?)

As a bonus objective, you can attempt to Topple a more difficult Tower.  This uses the same ruleset as before, you get to select your character and path as before, but you need to defeat the following map instead:

Good luck!

SCHEDULING & COMMENTS

I'll aim to post the ruleset and results on March 16th, but given my extremely poor excellent decision-making skills in releasing my Slay-the-Spire-themed game the same week as Slay the Spire 2 comes out, please don't hesitate to ask for an extension /several extensions if you want them!

As usual, working together is allowed, but for the sake of anyone who wants to work alone, please spoiler parts of your answers  that contain information or questions about the dataset.  To spoiler answers on a PC, type a '>' followed by a '!' at the start of a line to open a spoiler block - to spoiler answers on mobile, type a ':::spoiler' at the start of a line and then a ':::' at the end to spoiler the line.

Now if you'll excuse me, I need to go play Slay the Spire 2 for the next 48 hours.

  1. ^

    Really?  Does Slay the Spire even HAVE lore?  If it does, I don't know it.



Discuss