2026-05-03 13:17:35
Nowadays blockchains treat privacy like it's just an extra. Just an additional feature. Everything is visible and public. This architecture works for systems where public transparency is the end goal. But what happens when you begin to handle big data, medical records or sensitive information? It collapses. Full transparency and accountability is not a feature in these contexts, it's a problem.
This is an intermediate tutorial. Before reading, you should be comfortable with:
witness, circuit, ledger, and export declarations do. If not, start with the Compact language reference.Midnight implements a solution known as "Selective Disclosure". This is the ability of users to choose what information to reveal and who this information can be revealed to without revealing the full data.
For example, instead of displaying "Rahman's balance is 200,000 USDT", a Midnight DApp can publish a zero-knowledge proof that says "Rahman's balance exceeds the required threshold" and nothing more. This has solved two major issues efficiently; The sensitive data stays private and the important and verifiable fact becomes public. This is the foundation of Programmable privacy on Midnight.
This tutorial walks you through how Selective disclosure works in Compact, which is Midnight's Typescript inspired smart contract language. The knowledge you'll get from reading this article include:
a). Learning the mechanics of the disclose() operator.
b). Understanding what is safe to disclose as opposed to what silently leaks privacy.
c). Applying domain-separated hashing to block cross-property linkability.
d). Working through a privacy audit checklist you can run on your own contracts before deployment.
Before we get into the coding technicalities, how does Compact even work?
Every Compact Contract operates across two kinds of state. The Public State and The Private State.
The Public State: This is the data that is written to the onchain ledger which is open and visible to all network participants.
The Private State: This is the data that lives only on the user's computer and is never exposed to the network. It comes from 'witness' functions that your DApp provides locally, and from any value derived from those witness values.
The zero-knowledge proof generated by a circuit proves that the computation was performed correctly, without leaking the witness values involved. The major guarantee is that the network learns that the rule was followed, not what values were used.
Compact uses a compiler's witness protection program which is an interpreter that tracks which values contain witness data and rejects any program that would disclose sensitive data without an explicit declaration. Privacy becomes the default and disclosure is an exception you must deliberately request.
disclose() correctly
The disclose() wrapper is how you tell the Compact compiler: "I know this value contains private data, and I am intentionally making it public." Without it, the compiler stops, displaying a detailed error. Below is a simple example of this; recording a private balance on-chain:
pragma language_version >= 0.16 && <= 0.22;
import CompactStandardLibrary;
witness getBalance(): Bytes<32>;
export ledger balance: Bytes<32>;
export circuit recordBalance(): [] {
balance = disclose(getBalance());
}
If you remove the disclose() wrapper, the compiler immediately rejects the program:
Exception: selective.compact line 9 char 9:
potential witness-value disclosure must be declared but is not:
witness value potentially disclosed:
the return value of witness getBalance at line 5 char 1
nature of the disclosure:
ledger operation might disclose the witness value
via this path through the program:
the right-hand side of = at line 9 char 9
The error message is precise. It names the witness function that is the source of the data, explains the nature of the disclosure, and traces the full path through the program. This is because the compiler wants you to understand exactly what you are declaring when you add the disclose() wrapper.
It is important to note that placing a disclose() wrapper does not cause disclosure in itself. It simply tells the compiler to treat the wrapped expression as if it does not contain witness data. The actual disclosure happens at the storage or return point.
disclose() is the declaration, not the action.
disclose() at the right scope
For simple values, place disclose() as close to the storage or return point as possible. This is important because it minimizes the scope of the disclosure declaration and makes it more difficult to accidentally reuse a disclosed value in a privacy sensitive context elsewhere in the circuit.
For structured values such as tuples, vectors or structs, wrap only the field you intend to disclose:
pragma language_version >= 0.16 && <= 0.22;
import CompactStandardLibrary;
struct UserRecord {
publicName: Bytes<32>;
privateScore: Uint<64>;
}
witness getUserRecord(): UserRecord;
export ledger displayName: Bytes<32>;
export circuit publishName(): [] {
const record = getUserRecord();
// Only disclose the name field, not the entire struct
displayName = disclose(record.publicName);
}
This pattern is important because wrapping the entire struct in a single disclose() would declare the privateScore field as publicly disclosed too even though you never store it onchain. The compiler would accept it, but you would be declaring an intention that does not match your actual privacy requirements.
The compiler also catches indirect disclosure through boolean comparisons. This is one of the most common unintentional privacy leaks in zero-knowledge contracts:
import CompactStandardLibrary;
witness getBalance(): Uint<64>;
// This will NOT compile — the comparison result leaks the witness value
export circuit balanceExceeds(n: Uint<64>): Boolean {
return getBalance() > n;
}
Exception: selective.compact line 6 char 1:
potential witness-value disclosure must be declared but is not:
witness value potentially disclosed:
the return value of witness getBalance at line 2 char 1
nature of the disclosure:
the value returned from exported circuit balanceExceeds might disclose the result of a
comparison involving the witness value
via this path through the program:
the comparison at line 6 char 8
Returning a boolean derived from a private value is still a disclosure as the result narrows the range of possible the attacker needs to test. If you intend to return this boolean, you must declare it:
import CompactStandardLibrary;
witness getBalance(): Uint<64>;
export circuit balanceExceeds(n: Uint<64>): Boolean {
return disclose(getBalance()) > n;
}
Whether this is appropriate depends on your use case. If you are building a compliance gate (e.g "balance exceeds the minimum required to participate"), returning the boolean is intentional and acceptable. If you are building a privacy-preserving wallet, returning it leaks information about the user's balance range with every call.
Not all disclosures are equal. Some values are safe to publish because they carry no meaningful information about the underlying private data. Others appear harmless but silently narrow the privacy guarantees your DApp advertises. So how do we differentiate them?
Derived commitments and hashes (with correct library functions).
The Compact standard library includes functions the compiler recognizes as cryptographically sound transformations. When you wrap witness data in transientCommit(), the compiler treats the output as not containing witness data, which means no explicit disclose() is required for a commitment:
import CompactStandardLibrary;
witness getSecretValue(): Field;
witness getNonce(): Field;
export ledger commitment: Field;
export circuit commitValue(): [] {
const secret = getSecretValue();
const nonce = getNonce();
// transientCommit output is treated as non-witness by the compiler
commitment = transientCommit(secret, nonce);
}
An important distinction between these four functions: the compiler treats transientCommit() output as not containing witness data, so no disclose() is required on the result. transientHash(), however, is NOT considered sufficient to protect witness data — if the input came from a witness, the output still requires disclose() before it can be stored in the ledger or returned from an exported circuit.
The transient variants (transientCommit and transientHash) are circuit-efficient but their algorithm is not guaranteed to stay consistent across network upgrades, so they should not be used to derive state data that needs to be verified later. The persistent variants (persistentCommit and persistentHash) use SHA-256 and are guaranteed to remain consistent across upgrades, so use these for any values stored in ledger state.
Storing a count of transactions or a running total where the per transaction values remain private is generally safe, provided the aggregate does not itself narrow the input range to be exploitable.
This pattern is used throughout the official examples. You derive a public identifier from a private key, publish the public identifier, and keep the private key local. Because the hash function is one-way and hard to invert, the public key reveals nothing about the private key. The section on domain-separated hashing covers this pattern in depth.
disclose())?
Storing disclose(getBalance()) directly onchain publishes the balance in plaintext. This is appropriate when the value is intended to be public (e.g a display name, a transaction amount for a public ledger), but it is a complete disclosure of the underlying data.
Adding, subtracting, or scaling a private value before disclosure does not hide it cryptographically. The Compact compiler correctly rejects this without a disclose() wrapper, but the real risk is developers who add disclose() and assume the arithmetic hides the value. It does not. Here's what it does:
// Dangerous — disclosing an offset balance still reveals the balance
balance = disclose(getBalance() + 73);
If an attacker knows the offset which in this case is 73, they recover the original value immediately.
Exposing true/false results of comparisons on confidential values enables progressive disclosure of sensitive information. An attacker who can call the circuit many times with different values can binary-search their way to the exact private value through repeated queries. If you must expose a threshold check, consider rate-limiting at the DApp layer and documenting the tradeoff clearly.
Returning a value that is state constant or a State enum value such as 'Pending', 'Rejected', etc that was computed from sensitive data can leak information about the data depending on the number of states that exist and how they map to the input space.
One of the more delicate privacy risks in zero knowledge contracts is cross-property linkability which is what occurs when two different disclosures from the same private identity can be correlated by an observer even if neither disclosure reveals the identity itself. For example, Rahman reveals two things in separate transactions: a proof that his KYC tier is "premium" and a proof that his account is over 10 months old. Each proof reveals nothing individually, but if both proofs are signed with the same public key or derived from the same hash of the Rahman's identity, an observer can link the two transactions to Rahman, reconstructing a profile without ever learning the underlying identity.
How do we resolve this? Instead of deriving a single public key from a user's secret key, you derive a different public key for each purpose (or domain) by including a domain tag in the hash input. Two hashes with different domain tags, even for the same secret, produce completely unrelated outputs.
The previous examples use this pattern directly. The publicKey circuit in the lock example incorporates a domain tag as part of the hash:
pragma language_version >= 0.16 && <= 0.22;
import CompactStandardLibrary;
circuit publicKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "midnight:examples:lock:pk"), round as Bytes<32>, sk]
);
}
The string midnight:examples:lock:pk is the domain tag. It is namespaced to prevent collision with other contracts that might use the same hash function on the same secret key. This pattern can be extended across different properties of the same identity:
pragma language_version >= 0.16 && <= 0.22;
import CompactStandardLibrary;
// Produce an unlinkable key for KYC tier disclosure
circuit kycTierKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "myapp:identity:kyc-tier"), round as Bytes<32>, sk]
);
}
// Produce a separate, unlinkable key for account-age disclosure
circuit accountAgeKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "myapp:identity:account-age"), round as Bytes<32>, sk]
);
}
witness secretKey(): Bytes<32>;
export ledger kycAuthority: Bytes<32>;
export ledger ageAuthority: Bytes<32>;
export ledger round: Counter;
export circuit registerKycTier(): [] {
const sk = secretKey();
kycAuthority = disclose(kycTierKey(round.read() as Field, sk));
}
export circuit registerAccountAge(): [] {
const sk = secretKey();
ageAuthority = disclose(accountAgeKey(round.read() as Field, sk));
}
By doing this, an observer who sees both kycAuthority and ageAuthority on the ledger cannot determine whether they belong to the same user. The domain tags ensure the outputs are cryptographically unrelated, even though both were derived from the same secret key.
The round counter in the examples above serves a second purpose. Without it, a user who calls registerKycTier() twice would produce the same kycAuthority value both times, allowing an observer to link both calls to the same identity. By introducing round into the hash, each call produces a different output even with the same secret key. The linkability between rounds is broken when round is incremented between calls.
Something to note before moving on: never reuse a nonce in a commitment scheme. If you use persistentCommit(value, nonce) and the same nonce appears onchain in double transactions, an observer can link both commitments and if they can observe one opening, potentially infer the other.
Before you deploy a contract to a testnet or mainnet, you should go through this checklist first. It helps to identify, the most common privacy mistakes.
(i). Every disclose() call wraps the minimum expression necessary, not an entire struct when only one field needs to be public.
(ii). No disclose() is placed on a witness call whose output will be used in subsequent private computations. If a witness value travels both a public path and a private path, these should be two separate values.
(iii). All exported circuit return values are reviewed. If a return value is derived from witness data, confirm the disclosure is on purpose and document what information it reveals.
(i). No exported circuit returns a boolean, integer, or enum that is directly derived from an undisclosed private value without documenting the intended information disclosure.
(ii). For threshold checks (e.g. balance > n), consider whether repeated calls allow binary search over the private value. If so, document the tradeoff or implement rate limiting at the DApp layer.
(i). transientHash/transientCommit outputs should not be used to derive ledger state that needs to persist across network upgrades. Use persistentHash/persistentCommit for those values.
(ii). Nonces used in persistentCommit are never reused. Reusing a nonce with the same value allows on-chain commitment linking.
(iii). Hash inputs do not contain values that an attacker could enumerate to reverse the hash (e.g a small integer hashed alone without a nonce).
(i). Each distinct property of a private identity must use a separate, name-spaced domain tag in its key derivation hash.
(ii). Domain tag strings follow a consistent, collision-resistant naming convention (e.g., "firstproject:context:property").
(iii). A round counter or similar freshness mechanism is included in any public key derivation to prevent same-key linkability across multiple interactions.
(i). The contract compiles cleanly without suppressed warnings. Review any disclose() wrapper you added in response to a compiler error rather than from intentional design.
(ii). After any refactor, recompile the contract from scratch and re-review all disclose() sites, structural changes can introduce new disclosure paths.
(i). Each disclose() site is accompanied by an inline comment explaining what is being disclosed and why it is intentional.
(ii). The contract's README note or specification states clearly which fields in the ledger are public, which are derived from private data, and what information each derived field reveals.
Using Selective disclosure in Midnight is not just a feature, it is an architectural guarantee fused into the language. The compiler's witness protection program means you cannot accidentally publish private data without being forced to make a deliberate decision which is great for security of sensitive data.
The disclose() wrapper is that decision point, and treating it as such, rather than as a formality to satisfy the compiler, is what separates a contract with genuine privacy properties from one that only appears private.
In this tutorial we have covered the full practical arc of selective disclosure in Compact. We learned how the public/private state split works and why privacy is the default. We also learned how to use disclose() correctly, how to scope it precisely to avoid over-disclosure on structured values, and how the compiler catches indirect leaks through arithmetic and conditionals.
Domain-separated hashing was applied to break cross-property linkability, using the same pattern Midnight's own standard examples rely on. And now, there is a concrete privacy audit checklist to run against contracts before they reach the network.
If you have any questions on this article or you're confused somewhere, please engage and let me know. Share this with anyone you think it will help. Privacy is the default, disclosure is the exception.
2026-05-03 13:10:27
As Large Language Model (LLM) agents increasingly integrate numerous external systems, they suffer from Tool Space Interference (TSI), a phenomenon causing context bloat, attention dilution, and degraded reasoning accuracy. In this paper, we introduce the Agent-as-a-Tool paradigm—an evolutionary, practical implementation of the recently proposed Self-Optimizing Tool Caching Network (SOTCN) and Federated Context-Aware Routing Architecture (Federated CARA). By leveraging Retrieval-Augmented Generation (RAG) to dynamically discover and assemble stateful, autonomous sub-agents on the fly, this architecture completely eliminates TSI, enforces Zero-Trust execution boundaries, and achieves infinitely scalable AI orchestration.
The rapid evolution of LLMs and their standardized integration with external systems—notably through the Model Context Protocol (MCP)—has transformed LLM-based agents from simple conversational interfaces into advanced, autonomous AI systems capable of executing complex workflows. However, as the number of tools and external skills accessible to an agent increases, a critical performance bottleneck has emerged.
This phenomenon is defined as Tool Space Interference (TSI). Ref As officially highlighted at Google Cloud Next '26, the excessive use of MCP servers and tools leads to "context bloat," where massive amounts of data—such as verbose JSON schemas and system metadata—are loaded into the active context window. This not only exhausts token limits but also triggers "attention dilution." Overlapping tool functionalities and conflicting semantics become noise, severely impairing the model's reasoning capabilities and routing accuracy.
Current technical guidelines suggest a soft limit of approximately 20 tools per agent to maintain high selection accuracy. Exceeding this threshold causes context saturation, leading to a surge in fatal errors such as tool hallucination, the generation of invalid parameters, and the breakdown of execution flows. This creates an orchestration paradox: the more we attempt to scale the system's capabilities, the less reliable the agent paradoxically becomes.
To overcome the TSI problem, I have previously proposed and implemented several architectural approaches. These strategies strongly anticipate the concepts of "Agent Skills" (an agent-first solution to context bloat) and "A2A Orchestration" (a native capability for agents to mutually delegate and coordinate tasks) announced at Next '26:
Positioning of this Paper:
This paper serves as the ultimate convergence and practical realization of both SOTCN and Federated CARA. By fusing these theoretical models utilizing the Google ADK and TypeScript, we dramatically elevate the concept of dynamic tool injection into a new paradigm: Agent-as-a-Tool. This study demonstrates how these architectures can be seamlessly implemented as production-ready, highly scalable code for the Agentic Enterprise.
You can see the script in this article at https://github.com/tanaikech/agent-as-a-tool.
Building upon prior efforts in aggregation, decentralization, and dynamic routing, this study proposes a paradigm that fundamentally resolves the TSI problem while achieving infinite operational scalability and preserving high reasoning accuracy.
In the original SOTCN proposal, I theorized storing tool metadata in "cold storage" and dynamically injecting only the most relevant functions into the LLM's active context. The Agent-as-a-Tool paradigm elevates this concept to its logical extreme. Instead of merely injecting static "tools" (such as bare API endpoints or JSON schemas), the system dynamically retrieves and delegates tasks to stateful, fully autonomous sub-agents.
By utilizing a Retrieval-Augmented Generation (RAG) database known as an "Agent Bank," the orchestrator extracts only the minimum agentic resources required for a user's task to assemble a temporary task force on the fly. This provides two decisive advantages:
Implemented via the Google ADK and Gemini API, the proposed system processes tasks autonomously and dynamically through a structured workflow that codifies the Federated CARA and SOTCN principles into executable logic:
search_expert_agents tool to query the RAG system, analyzing the prompt's intent and extracting only the Top-K specialized agents essential for the task, keeping the active context pristine.execute_with_dynamic_subagents tool, it analyzes task dependencies and dynamically formulates an execution strategy—routing tasks using Single, Parallel (for independent sub-tasks), or Sequential (for dependent sub-tasks) execution patterns. To prevent state contamination, the system utilizes an InMemoryRunner to spawn a temporary entity (Temporal Coordinator), attaches the retrieved agents as sub-agents, executes the task, and immediately garbage-collects the entities from memory upon completion.File Operation Rules. For critical operations (e.g., creating, modifying, or deleting files), the orchestrator automatically pauses execution to enforce a strict "Human-in-the-Loop" (HITL) protocol, requiring explicit user approval before acting. By operating as an A2A Server, it establishes an enterprise-grade, Zero-Trust environment.Through this architecture, even if the total number of tools accessible to the enterprise scales to thousands, the main orchestrator's cognitive load remains virtually zero.
To illustrate how the theories of SOTCN and Federated CARA function in a live environment, the practical execution flow of the system is detailed below. This flow highlights the dynamic skill extraction and the ephemeral assembly of sub-agents.
agent-manager, establishing its semantic capability matrix.agent-manager is deployed as a standalone Agent-to-Agent (A2A) server.agent-manager's registered capabilities, the orchestrator is natively triggered by the CLI.agent-manager executes the search_expert_agents tool to query the Agent Bank's File Search Store. It retrieves one or multiple expert agents required for the task. If the requested task cannot be processed by any registered agent, the RAG system notifies the orchestrator, which gracefully forwards a transparent limitation message back to the user.execute_with_dynamic_subagents tool. This process spawns a fresh Temporal Coordinator using an in-memory runner, attaches the retrieved agents to form an ephemeral task force, and executes the complex task based on the optimal strategy (Single, Parallel, or Sequential).agent-manager. The manager synthesizes the final response, securely flushes the temporal team from memory to prevent state contamination, and delivers the comprehensive output back to the Gemini CLI.To demonstrate this architecture practically, I have prepared a complete Node.js/TypeScript implementation. You can view the full repository of sample scripts at https://github.com/tanaikech/agent-as-a-tool.
To follow along with this guide, ensure your environment meets the following requirements:
agent-as-a-tool
To retrieve and initialize the scripts, execute the following commands in your terminal:
git clone https://github.com/tanaikech/agent-as-a-tool
cd agent-as-a-tool
npm install
The directory structure is defined below. src/agentbank.ts acts as the registry for the sample agents. You can modify this file to add or remove agents tailored to your enterprise workflows.
agent-as-a-tool/
├── package.json
├── tsconfig.json
├── src/
│ ├── a2aserver.ts
│ ├── agent.ts
│ ├── agentbank.ts
│ ├── autonomous-google-workspace-agent.ts
│ └── store_manager.ts
└── test/
└── test_search.ts
In this sample, three sophisticated agents are included. The third agent is derived fromEmpowering Autonomous AI Agents through Dynamic Tool Creation. When utilizing this agent, please refer to the article for specific sandbox setup instructions.
1. Currency Exchange Agent (exchange_agent)
- Capabilities: A financial specialist providing accurate global currency exchange rates.
- Features: Dynamically handles current rates and relative date requests (e.g., "last Friday") by resolving the exact temporal context before API retrieval.
2. Weather Agent (weather_agent)
- Capabilities: A professional meteorologist providing precise weather forecasts.
- Features: Computes target dates, times, and geographic coordinates for relative requests (e.g., "in 3 hours") to deliver perfectly timed data.
3. Autonomous Google Workspace Agent (autonomous-google-workspace-agent)
- Capabilities: A Senior Orchestrator managing the full lifecycle of Google Apps Script (GAS) development.
- Internal Sub-Agents:
- Environment Checker: Validates local tool installations (e.g., @google/clasp).
- Script Writer: Generates GAS-compatible code using live official documentation via MCP.
- Script Executor: Simulates and tests scripts in a locally sandboxed environment.
- Script Uploader: Manages Drive project creation and secure uploading via clasp.
- Summary Agent: Consolidates technical results into structured execution reports.
To enable these agents, you must configure your Gemini API key as an environment variable:
export GEMINI_API_KEY=<YOUR_API_KEY_HERE>
To securely index the initial agents into the File Search Store (Agent Bank), execute the following command:
npm run regAgents
If you add new capabilities to agentbank.ts and re-run this command, the script dynamically compares existing metadata, ensuring only newly added agents are registered and entirely avoiding duplicate ingestion.
You can inspect the registered agent list by running npm run regAgentList, or selectively clear the stores using npm run deleteStores.
Once the store is created, map it to your environment session:
export AGENT_BANK="{your store name}"
This framework can function natively as a standalone web server or as a delegated sub-agent linked to the Gemini CLI. Let's first test it as a standalone server.
Launch the Web server:
npm run web
$ npm run web
> [email protected] web
> npx adk web src/agent.ts
+-----------------------------------------------------------------------------+
| ADK API Server started |
| |
| For local testing, access at http://localhost:8000. |
+-----------------------------------------------------------------------------+
You can now interact with the web interface by navigating to http://localhost:8000 in your browser.
A demonstration video for scenarios 1 through 4 in this section is available here:
Once you have executed npm run web and launched your browser at http://localhost:8000, evaluate the following test cases.
Prompt:
What is the latest exchange rate from USD to JPY?
Result: The orchestrator correctly processes the semantic intent, dynamically fetching and executing a single exchange_agent to retrieve the latest financial rates.
Prompt:
Please tell me the weather in Tokyo at noon tomorrow.
Result: The orchestrator interprets the temporal requirement, processes the task using the weather_agent, which correctly calculates "tomorrow at noon" before querying the weather API.
Prompt:
I am traveling to Paris. Please check the weather in Paris on 2026-05-01 12:00 (Latitude 48.85, Longitude 2.35, Timezone Europe/Paris). Also, I need to plan my budget, so please provide the latest exchange rate from JPY to EUR simultaneously.
Result: The orchestrator dissects this complex prompt, semantic-searches the Agent Bank, and identifies two required experts. Utilizing the CARA-inspired routing engine, it formulates a Parallel execution strategy, assembling and coordinating both the exchange_agent and weather_agent concurrently, seamlessly synthesizing their outputs.
Prompt:
Please tell me the weather in Tokyo tomorrow, and also book a flight from New York to Tokyo for next Monday.
Result: The orchestrator successfully procures the weather forecast via the weather_agent. However, recognizing that no flight-booking agent exists in the Agent Bank, the system strictly enforces its operational boundaries, returning the weather data while transparently explaining its limitation regarding the flight booking.
To test this architecture directly within your terminal as an enterprise sub-agent for the Gemini CLI, it is required to launch the Agent-to-Agent (A2A) server endpoint.
npm run a2a
When this command executes, the routing endpoint becomes active:
$ npm run a2a
> [email protected] a2a
> npx tsx src/a2aserver.ts
Server started on http://localhost:8000
Try: http://localhost:8000/.well-known/agent-card.json
To configure this A2A server as an accessible sub-agent for the Gemini CLI, create or update .gemini/agents/agent-as-a-tool.md with the following configuration:
---
kind: remote
name: agent-as-a-tool
agent_card_url: http://localhost:8000/.well-known/agent-card.json
---
You can inspect the generated agent card specifications by opening the provided URL (http://localhost:8000/.well-known/agent-card.json) in your browser.
A demonstration video for the A2A server using scenarios 1 through 4 is available here:
Once the server is configured and running, launch the Gemini CLI. We use the prefix @agent-as-a-tool to route the intent directly to our newly created orchestrator.
Prompt:
@agent-as-a-tool What is the latest exchange rate from USD to JPY?
Result: The orchestrator receives the delegated request from the Gemini CLI and successfully answers using the exchange_agent.
Prompt:
@agent-as-a-tool Please tell me the weather in Tokyo at noon tomorrow.
Result: Processed seamlessly via the dynamically loaded weather_agent.
Prompt:
@agent-as-a-tool I am traveling to Paris. Please check the weather in Paris on 2026-05-01 12:00 (Latitude 48.85, Longitude 2.35, Timezone Europe/Paris). Also, I need to plan my budget, so please provide the latest exchange rate from JPY to EUR simultaneously.
Result: Mirroring the web testing, the agent-as-a-tool orchestrator dynamically delegates sub-tasks to both the exchange_agent and weather_agent, returning a synthesized response directly to your CLI session.
Prompt:
@agent-as-a-tool Please tell me the weather in Tokyo tomorrow, and also book a flight from New York to Tokyo for next Monday.
Result: The A2A orchestrator accurately audits its capability matrix, retrieving the weather while explicitly refusing the flight booking due to the lack of an applicable sub-agent.
In this scenario, we utilize the advanced agent I previously published inEmpowering Autonomous AI Agents through Dynamic Tool Creation.
Prompt:
@agent-as-a-tool Create a new Google Spreadsheet by putting a formula `=GOOGLEFINANCE("CURRENCY:USDJPY")` in cell "A1" of the first sheet. Then, get and show the value of cell "A1". (Note: `gas-fakes` has no `getActiveSheet()` method. In this case, use `getSheets()[0]`.)
When this prompt is processed, the system retrieves the complex autonomous-google-workspace-agent from the Agent Bank. Because creating a new Google Spreadsheet demands specific Google Drive authorization, the agent correctly halts execution and requests explicit authorization to execute file creation workflows. Once approved, the orchestrator leverages local sandboxing (gas-fakes) to simulate, validate, and execute the generated Google Apps Script, perfectly achieving the goal. This also indicates that by using Agent-as-a-Tool, you can use existing agents as they are.
As autonomous agents assume greater operational responsibility in enterprise environments, security must be treated as a foundational architectural component rather than an afterthought. The Agent-as-a-Tool paradigm natively incorporates robust security measures, heavily inspired by the zero-trust principles of the Federated Context-Aware Routing Architecture (Federated CARA). This framework secures the execution environment through three primary mechanisms:
1. Attack Surface Minimization via Dynamic Injection
Traditional agents that load massive toolsets into their active context window inadvertently expand their attack surface. Malicious prompt injections can easily trick an over-privileged LLM into triggering unintended functions. By utilizing the RAG-based storage inherited from the Self-Optimizing Tool Caching Network (SOTCN), our orchestrator injects only the strictly necessary sub-agents for a specific task. If a tool or capability is not retrieved by the semantic search, it physically cannot be executed within that session, intrinsically shielding the system from arbitrary functional exploits.
2. Ephemeral Execution and State Isolation
To prevent data leakage across different tasks or users (state contamination), the architecture deeply integrates an InMemoryRunner for temporal generation. Sub-agents are never persistent entities. The orchestrator dynamically spawns a Temporal Coordinator and its assigned sub-agents strictly for the duration of the given task. Once the execution is complete and the synthesized result is returned, the temporal team is immediately garbage-collected from memory. This ephemeral execution model ensures that sensitive data processed in one session cannot cross-pollinate or influence subsequent AI interactions, maintaining strict state isolation.
3. Human-in-the-Loop (HITL) and Boundary Control
The Agent Manager also serves as a strict security triage engine. For non-destructive, read-only operations (such as fetching weather forecasts or exchange rates), it delegates and operates fully autonomously. However, for critical operations—specifically creating, modifying, or deleting files—the system enforces a mandatory Human-in-the-Loop (HITL) protocol. The orchestrator is explicitly instructed via its core system prompt (File Operation Rules) to halt execution and request direct user confirmation before proceeding with any action that crosses predefined security boundaries. This dual-layered approach—autonomous execution for read-only tasks and mandated HITL for write operations—establishes a scalable, enterprise-grade governance model without sacrificing system agility.
In this article, as an experimental approach to the dynamic use of AI agents, we successfully demonstrated the dynamic selection, assembly, and execution of pre-built, safety-verified agents from an Agent Bank based on specific task requirements. However, this foundational approach naturally paves the way for an even more advanced future capability.
In addition to utilizing an Agent Bank, we can envision a scenario where optimal agents are dynamically constructed from scratch by pulling highly granular resources from a "Tool Bank," an "MCP Server Bank," or a "Skill Bank" on the fly, tailoring the newly generated agent entirely to the context of the requested task. While this promises unprecedented orchestration flexibility, it is crucial to recognize that the dynamic combination of these unverified components may inadvertently introduce new security vulnerabilities or unpredictable agent behaviors. Therefore, establishing rigorous safety validation mechanisms and robust governance protocols for these dynamic combinations will undoubtedly be a critical challenge to address when realizing this ultimate vision of autonomous AI orchestration.
InMemoryRunner instances establish a secure, Zero-Trust environment perfectly tailored for production deployment.
2026-05-03 13:05:18
Every AI app I've shipped recently rewrote the same plumbing. The OAuth dance for Slack. Encrypted storage for an API key. Refresh-token logic that finally fails on the 3rd call after an hour. Wiring up an MCP client to a server behind a bearer token someone pasted into a Notion page.
I'd write it, copy-paste it into the next app, watch it rot. Each new agent built by a different teammate, slightly differently, with slightly different bugs. We were a small team and the integration code became most of the code.
## The pattern under all of it
Strip away the providers and the AI-specific bits, and every app needed the same four things from the platform:
.env file in a Docker image. Not in a CI secret. Somewhere the app can ask for at runtime.
client_id/secret; their app shouldn't.
That's the spine of the SDK we ended up shipping. Four primitives, every app uses some of them, none of them require integration code in the app.
## Register once at the org level
The flip is registration. The org owner registers their things one time on the dashboard:
client_id + client_secret into the "Custom OAuth providers" card. Encrypted with the org's KMS key. The app never sees it.
Now every app you deploy in that org gets typed access through four SDK calls.
## The four calls
import { LeashIntegrations } from '@leash/sdk/integrations'
const client = new LeashIntegrations({ apiKey: process.env.LEASH_API_KEY })
// 1. Env var (resolves through your configured secret source)
const dbUrl = await client.getEnv('DATABASE_URL')
// 2. Pre-built integration
const messages = await client.gmail.listMessages({ maxResults: 5 })
// 3. Custom OAuth — fresh access token for any provider you've registered
const slackToken = await client.getAccessToken('slack')
// 4. Custom MCP — { url, headers } including bearer Authorization
const mcp = await client.getCustomMcpConfig('acme-tools')
Same shape across TypeScript, Python, Go, Ruby, Rust, and Java. No client_secret in the app code. No refresh-token handler. No MCP boilerplate.
## Your .env collapses to one line
The thing we noticed only after living with it: once you're using this, the only secret your app's .env actually needs is the platform API key.
# .env (yes, this is the whole thing)
LEASH_API_KEY=lsk_live_...
That's it. No more .env.example drift. No more "did we set DATABASE_URL in staging?" debugging at 11pm. Rotation happens at the source — no rebuild, no redeploy.
## What it deliberately doesn't do
A few decisions that came up that I'll defend:
Doesn't proxy MCP traffic. We hand the app {url, headers} (with bearer Authorization already attached) and the app calls the MCP directly. Leash isn't in the request path. Tool calls are on the LLM's critical path; an extra hop hurts. We also
didn't want to reimplement every MCP transport (streamable HTTP, SSE, stdio) with our own bugs.
Doesn't force you to use the platform for secrets. If you'd rather hold them in Doppler or 1Password, point the platform at your existing source. getEnv resolves through whichever the org configured.
Doesn't pretend to be multi-cloud. Single-region GCP today. If you're betting on us, you're betting on a small surface area — not a multi-cloud promise.
## The why behind the shape
Customer apps can't hold credentials safely. Their AI agent runs on someone's laptop, in CI, on a Cloud Run revision someone's about to redeploy. Putting client_secret in the app means rotating it everywhere whenever it leaks. So we put the
credential in one place and gave the app a thin retrieval call instead.
Same logic for MCP. The bearer token for a customer's internal tool server isn't something we want their AI app to know. The app gets a config dictionary right before it calls the MCP. That's as long as the credential lives anywhere near user code.
The four-primitive surface area is small on purpose. Anything else (token caching, retries, pagination on Gmail, etc.) lives in the SDK or in the customer's code, not in the platform contract. We'd rather grow the SDK than the API.
## Try it
curl -fsSL https://leash.build/install.sh | sh
leash login
leash deploy
Or just sign up at leash.build, register a Slack app or an internal MCP, and call the SDK from any project. Custom OAuth + custom MCP are gated to the Growth plan; built-in integrations work on every plan including free.
Curious what others have done for this. Especially the proxy-vs-config-handoff call for MCP — I made the bet, but it's the architecture choice I'd most welcome a counterargument on.
2026-05-03 12:51:33
As we have talked about before, the Internet relies on numerical addresses, IP addresses to route data from one device to another. IPv4 offers around 4.3 billion addresses, we have discussed that that is not enough. While there is IPv6, another solution to this issue is through Network Address Translation (NAT)
NAT allows multiple devices on a private network to share a single public IP address. This not only helps conserve the limited pool of public IP addresses but also adds a layer of security to the internal network.
Private vs. Public IP Addresses
Public IP addresses are globally unique identifiers that are assigned by Internet Service Providers (ISPs). Devices with these IP addresses can be accessed from anywhere on the Internet, allowing them to communicate across the global network.
On the other hand, private IP addresses are designated for use within local networks such as homes, offices and schools. These are not routable on the global internet, so they cannot be forwarded by internet backbone routers. Defined by RFC 1918, common IPv4 private address ranges include 10.0.0.0 to 10.255.255.255, 172.16.0.0 to 172.31.255.255, and 192.168.0.0 to 192.168.255.255. This setup ensures that these private networks operate independently of the internet while facilitating internal communication and device connectivity.
Private IP addresses contribute to conserving public IP addresses. Using Network Address Translation (NAT), a local network can utilize private IP addresses while sharing a single public IP address, reducing the number of public IPs needed. This setup makes devices accessible from the internet without using multiple public addresses. Additionally, private IPs help secure the network by isolating internal devices from direct exposure to the internet, protecting them from potential external threats.
How does it work?
Network Address Translation (NAT) is a process carried out by a router or a similar device that modifies the source or destination IP address in the headers of IP packets as they pass through. This modification is used to translate the private IP addresses of devices within a local network to a single public IP address that is assigned to the router.
For example, say that your home network has a few devices, laptop, smartphone, tablet and a smart thermostat. All of these have their own private IP addresses which they can use to connect to eachother. But when, suppose the laptop wants to access a DNS Server on the internet, it will need a public IP address. As the packet passes though the router, the router will change the private IP address into a public one. This public IP address is the same for all of the devices in the network. As the response arrives, the router's NAT table, which keeps track of IP mappings, identifies that 203.0.113.50:4444 corresponds to the laptop at 192.168.1.10:5555 (ports 4444 and 5555 are dynamic). All of this is done by the NAT process.
Types of NAT
Static NAT - Involves a one-to-one mapping, where each private IP address corresponds directly to a public IP address.
Dynamic NAT - Assigns a public IP from a pool of available addresses to a private IP as needed, based on network demand.
Port Address Translation (PAT) - Also known as NAT Overload, is the most common form of NAT in home networks. Multiple private IP addresses share a single public IP address, differentiating connections by using unique port numbers. This method is widely used in home and small office networks, allowing multiple devices to share a single public IP address for internet access.
Benefits and Trade-Offs
Benefits
Trade-Offs
2026-05-03 12:42:30
I bought four NVIDIA CMP 100-210 cards off the secondhand market for about $130 each. They are ex-mining cards based on the
Volta GV100 die — same silicon as the V100 — with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for
the price of a single used 3090.
In practice, NVIDIA had crippled them in hardware.
The throttle
The CMP 100-210 has its tensor cores throttled 64×. HMMA latency is stretched from 8 cycles to 512. cuBLAS WMMA caps out
at about 5 TFLOP per card. PCIe is locked to Gen1 x1, no P2P, no NVLink. CUPTI is blocked, so you can't even use NVIDIA's
own profiler.
The throttle is enforced by an e-fuse + PMU bootrom double-lock on the die. This isn't a firmware switch — it's blown into
the silicon. There is no software unlock. (Yes, I tried.)
The result: anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. That's vLLM, llama.cpp's
default cuBLAS path, FlashAttention, bitsandbytes, PyTorch's default matmul. The standard LLM inference stack is unusable
on this hardware.
So I wrote my own.
The workaround
It turns out NVIDIA only throttled tensor cores. Two other paths on the same chip are full speed:
Neither is as fast as a healthy V100's tensor cores, but both are far above the 5 TFLOP cuBLAS WMMA ceiling. Routing all
of inference through these two paths gets you back to roughly half of what an unthrottled V100 would do, which is still
vastly better than nothing.
Building on that, qengine is a from-scratch CUDA inference engine for Qwen3.5 / Qwen3.6 hybrid models. (Worth noting:
Qwen3.5 / 3.6 are a different architecture from Qwen3 — they are dense GDN (Gated DeltaNet) + Attention hybrids, not pure
transformers. The kernels look quite different.)
The engine has:
It's not a fork. Every kernel is written for sm_70 + CMP constraints.
Honest benchmarks
I'm comparing against llama.cpp build 8462 with -fa 1, the same Q8_0 GGUFs, on the same hardware. Bigger numbers are
better.
▎ Qwen3.5-9B, single GPU prefill (qengine vs llama.cpp, tokens/sec):
▎ - 297 — 594 vs 199 (2.99x)
▎ - 1.16K — 683 vs 316 (2.16x)
▎ - 4.62K — 584 vs 361 (1.62x)
▎ - 18K — 393 vs 324 (1.22x)
qengine leads at the first three lengths and reaches parity at 18K.
Generation: qengine wins by +48–51% on both sizes (9B: ~70 t/s vs 46.6; 27B: 26.3 vs 17.7).
The honest weak point: 9B dual-GPU at 18K still trails llama.cpp (~0.48×). Their layer pipeline overlaps activation
transfer with compute; mine does the transfers sequentially through pinned host memory, because no P2P. Single-GPU 9B is
faster than either dual-GPU run anyway, so it's mostly a theoretical gap, but it's there.
What was hard
A few things that took real time to get right:
Multi-GPU without P2P. With CMP cards there's no peer-to-peer, no NVLink. Hidden state has to bounce through pinned host
memory between GPUs. I keep a pinned-host buffer per cross-GPU edge and a worker thread per GPU. It works, it's just
sequential.
Numerical drift killing Korean output. Qwopus3.5-9B distill has weak Korean circuits to begin with — small fp16 reorder
noise shifts argmax decisions and the model starts producing garbled Korean. I learned this the hard way after a
chunked-prefill kernel optimisation that "passed" my English greedy-argmax tests broke Korean entirely. Now every kernel
that touches the attention reduction order gets a Korean argmax-stability check before it ships.
Split-K FA without breaking determinism. The 64-block FA grid was under-utilising the SMs at long context (only 64 blocks
across 3×68 SM = 204), so each block was running a 575-iteration K/V tile loop in isolation. I added a split-K variant
that maps each (kv_head, t_idx) to N independent blocks, each handling a contiguous tile range, and merged the partials
with the standard log-sum-exp identity:
m_global = max_s m_s
l_global = Σ_s exp(m_s − m_global) · l_s
o_global = Σ_s exp(m_s − m_global) · acc_o_s
First version stored partial o accumulators as half. That truncation caused a small drift after about 31 generated tokens
at 4.6K prefill — not bit-exact with the base FA path. Korean argmax flipped. Storing partials as fp32 brings drift down
to fp32-reordering noise (~1e-7 per add), and greedy argmax is stable across 32+ generated tokens. That's the version I
shipped. 18K prefill went from 270 → 393 t/s on 9B and 104 → 139 t/s on 27B.
Speculative decoding I never got working. I have DFlash + DDTree code in the repo for the eventual fine-tuned drafter.
Right now the pretrained drafter (lucebox-hub/dflash) is trained on stock Qwen3.5, and the Qwopus distill output
distribution doesn't match — accept rate is roughly 0% and the chains degenerate. Listed in the README as broken on
purpose. MTP K=1 single-token spec works fine.
What this is and isn't for
If you have an RTX 30/40-series, A100, or H100, you should be using vLLM or SGLang. They are far more optimised for those
targets and have actual test coverage. qengine would be slower and weirder.
If you have:
— then qengine might be useful. It targets sm_70 specifically. sm_75 should work but isn't tuned. sm_60 won't work (no
DP4A). AMD and Apple Silicon definitely won't work.
Repo
https://github.com/Haru-neo/qengine — Apache 2.0.
The benchmarks in this post are reproducible with the bench_curl.sh script in the repo. The 27B 3-GPU numbers were
measured 2026-05-03 on my machine. If you have the hardware and try it, I'd love to know what you see.
Solo project. Heavy AI assist on the CUDA — I drove the architecture, profiling, and debugging across many sessions;
Claude did most of the kernel implementation. I'm a Korean high school student. Slow PR turnaround.
2026-05-03 12:39:04
Most AI projects are built backwards.
People start with the model and only later discover they needed a memory system, semantic retrieval, tool use, tests, and a fallback plan for when one provider decides to nap for no visible reason.
That is the part I care about now.
LLM Foundry is the workshop around an LLM — not the model itself. It is the layer that makes a model useful for actual work instead of just looking smart in a demo.
The current version now has a few things worth showing instead of just claiming:
That last bit matters more than people like to admit.
If a system cannot be tested, it is not “advanced”. It is just expensive.
A useful model stack is not one prompt and a prayer.
It is usually:
That is the difference between a chatbot and something you might actually trust on real work.
This part matters, because the AI world does itself a lot of damage by overpromising.
If a base model is bad at reasoning, orchestration will not magically make it frontier-grade. You can improve its behaviour, reliability, recall, and workflow quality. You cannot conjure missing intelligence out of nowhere.
That is not a flaw in the system. That is just reality.
What orchestration can do is make a decent model much more useful:
That is the real win.
Here is the validation package I used while testing the repo:
| Check | Result |
|---|---|
| Benchmark pass rate | 50% |
| Reasoning harness | 60% |
| Coding harness | 100% |
| Tool-use harness | 100% |
| Memory harness | 100% |
That benchmark pass rate is not a brag. It is a baseline. The point is that the system is measurable, and therefore improvable.
I wanted the memory system to work for normal tasks, not just demos.
So the retrieval layer is now embedding-based. That means the system can look for relevant context semantically, not just by literal word match.
That matters when the task wording changes but the meaning does not.
In plain English: it is much harder for the assistant to miss the useful note just because you phrased the request differently.
That is a small change with outsized effect.
The goal is not “a model wrapper”. The goal is a practical operating layer for LLM work:
That is the kind of infrastructure that makes a model usable for long jobs, research, and product workflows.