MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Selective Disclosure Patterns in Compact

2026-05-03 13:17:35

Nowadays blockchains treat privacy like it's just an extra. Just an additional feature. Everything is visible and public. This architecture works for systems where public transparency is the end goal. But what happens when you begin to handle big data, medical records or sensitive information? It collapses. Full transparency and accountability is not a feature in these contexts, it's a problem.

Table of contents

  • How does Midnight solve this problem?
  • Understanding the privacy model
  • Using disclose() correctly
    • Placing disclose() at the right scope
    • Disclosure through conditional expressions
  • What's safe to disclose vs what leaks privacy
    • Safe to disclose
    • What actually leaks privacy?
  • Applying domain-separated hashing for cross-property unlinkability
    • How does this relate to Compact?
    • The round counter and linkability
  • Running a privacy audit on your contract
  • Conclusion

Prerequisites

This is an intermediate tutorial. Before reading, you should be comfortable with:

  • Basic Compact syntax — you understand what witness, circuit, ledger, and export declarations do. If not, start with the Compact language reference.
  • The Compact compiler installed — follow the installation guide to set it up.
  • Zero-knowledge proofs at a conceptual level — you understand that a ZK proof lets you prove a statement is true without revealing the underlying data.
  • Familiarity with a typed language like TypeScript — Compact's syntax is modelled on TypeScript, so comfort with typed functions, structs, and return types will help.

How does Midnight solve this problem?

Midnight implements a solution known as "Selective Disclosure". This is the ability of users to choose what information to reveal and who this information can be revealed to without revealing the full data.

For example, instead of displaying "Rahman's balance is 200,000 USDT", a Midnight DApp can publish a zero-knowledge proof that says "Rahman's balance exceeds the required threshold" and nothing more. This has solved two major issues efficiently; The sensitive data stays private and the important and verifiable fact becomes public. This is the foundation of Programmable privacy on Midnight.

This tutorial walks you through how Selective disclosure works in Compact, which is Midnight's Typescript inspired smart contract language. The knowledge you'll get from reading this article include:

a). Learning the mechanics of the disclose() operator.
b). Understanding what is safe to disclose as opposed to what silently leaks privacy.
c). Applying domain-separated hashing to block cross-property linkability.
d). Working through a privacy audit checklist you can run on your own contracts before deployment.

Understanding the privacy model

Before we get into the coding technicalities, how does Compact even work?

Every Compact Contract operates across two kinds of state. The Public State and The Private State.

The Public State: This is the data that is written to the onchain ledger which is open and visible to all network participants.
The Private State: This is the data that lives only on the user's computer and is never exposed to the network. It comes from 'witness' functions that your DApp provides locally, and from any value derived from those witness values.
The zero-knowledge proof generated by a circuit proves that the computation was performed correctly, without leaking the witness values involved. The major guarantee is that the network learns that the rule was followed, not what values were used.

Compact uses a compiler's witness protection program which is an interpreter that tracks which values contain witness data and rejects any program that would disclose sensitive data without an explicit declaration. Privacy becomes the default and disclosure is an exception you must deliberately request.

Using disclose() correctly

The disclose() wrapper is how you tell the Compact compiler: "I know this value contains private data, and I am intentionally making it public." Without it, the compiler stops, displaying a detailed error. Below is a simple example of this; recording a private balance on-chain:

pragma language_version >= 0.16 && <= 0.22;

import CompactStandardLibrary;

witness getBalance(): Bytes<32>;
export ledger balance: Bytes<32>;

export circuit recordBalance(): [] {
balance = disclose(getBalance());
}

If you remove the disclose() wrapper, the compiler immediately rejects the program:

Exception: selective.compact line 9 char 9:
  potential witness-value disclosure must be declared but is not:
    witness value potentially disclosed:
      the return value of witness getBalance at line 5 char 1
    nature of the disclosure:
      ledger operation might disclose the witness value
    via this path through the program:
      the right-hand side of = at line 9 char 9

The error message is precise. It names the witness function that is the source of the data, explains the nature of the disclosure, and traces the full path through the program. This is because the compiler wants you to understand exactly what you are declaring when you add the disclose() wrapper.

It is important to note that placing a disclose() wrapper does not cause disclosure in itself. It simply tells the compiler to treat the wrapped expression as if it does not contain witness data. The actual disclosure happens at the storage or return point.

disclose() is the declaration, not the action.

Placing disclose() at the right scope

For simple values, place disclose() as close to the storage or return point as possible. This is important because it minimizes the scope of the disclosure declaration and makes it more difficult to accidentally reuse a disclosed value in a privacy sensitive context elsewhere in the circuit.

For structured values such as tuples, vectors or structs, wrap only the field you intend to disclose:

pragma language_version >= 0.16 && <= 0.22;

import CompactStandardLibrary;

struct UserRecord {
publicName: Bytes<32>;
privateScore: Uint<64>;
}

witness getUserRecord(): UserRecord;
export ledger displayName: Bytes<32>;

export circuit publishName(): [] {
const record = getUserRecord();
// Only disclose the name field, not the entire struct
displayName = disclose(record.publicName);
}

This pattern is important because wrapping the entire struct in a single disclose() would declare the privateScore field as publicly disclosed too even though you never store it onchain. The compiler would accept it, but you would be declaring an intention that does not match your actual privacy requirements.

Disclosure through conditional expressions

The compiler also catches indirect disclosure through boolean comparisons. This is one of the most common unintentional privacy leaks in zero-knowledge contracts:

import CompactStandardLibrary;
witness getBalance(): Uint<64>;

// This will NOT compile — the comparison result leaks the witness value
export circuit balanceExceeds(n: Uint<64>): Boolean {
return getBalance() > n;
}
Exception: selective.compact line 6 char 1:
  potential witness-value disclosure must be declared but is not:
    witness value potentially disclosed:
      the return value of witness getBalance at line 2 char 1
    nature of the disclosure:
      the value returned from exported circuit balanceExceeds might disclose the result of a
      comparison involving the witness value
    via this path through the program:
      the comparison at line 6 char 8

Returning a boolean derived from a private value is still a disclosure as the result narrows the range of possible the attacker needs to test. If you intend to return this boolean, you must declare it:

import CompactStandardLibrary;
witness getBalance(): Uint<64>;

export circuit balanceExceeds(n: Uint<64>): Boolean {
return disclose(getBalance()) > n;
}

Whether this is appropriate depends on your use case. If you are building a compliance gate (e.g "balance exceeds the minimum required to participate"), returning the boolean is intentional and acceptable. If you are building a privacy-preserving wallet, returning it leaks information about the user's balance range with every call.

What's safe to disclose vs what leaks privacy

Not all disclosures are equal. Some values are safe to publish because they carry no meaningful information about the underlying private data. Others appear harmless but silently narrow the privacy guarantees your DApp advertises. So how do we differentiate them?

Safe to disclose

Derived commitments and hashes (with correct library functions).

The Compact standard library includes functions the compiler recognizes as cryptographically sound transformations. When you wrap witness data in transientCommit(), the compiler treats the output as not containing witness data, which means no explicit disclose() is required for a commitment:

import CompactStandardLibrary;

witness getSecretValue(): Field;
witness getNonce(): Field;
export ledger commitment: Field;

export circuit commitValue(): [] {
const secret = getSecretValue();
const nonce = getNonce();
// transientCommit output is treated as non-witness by the compiler
commitment = transientCommit(secret, nonce);
}

An important distinction between these four functions: the compiler treats transientCommit() output as not containing witness data, so no disclose() is required on the result. transientHash(), however, is NOT considered sufficient to protect witness data — if the input came from a witness, the output still requires disclose() before it can be stored in the ledger or returned from an exported circuit.

The transient variants (transientCommit and transientHash) are circuit-efficient but their algorithm is not guaranteed to stay consistent across network upgrades, so they should not be used to derive state data that needs to be verified later. The persistent variants (persistentCommit and persistentHash) use SHA-256 and are guaranteed to remain consistent across upgrades, so use these for any values stored in ledger state.

Aggregate results that do not reveal individual values

Storing a count of transactions or a running total where the per transaction values remain private is generally safe, provided the aggregate does not itself narrow the input range to be exploitable.

Public keys derived from private keys through domain-separated hashing

This pattern is used throughout the official examples. You derive a public identifier from a private key, publish the public identifier, and keep the private key local. Because the hash function is one-way and hard to invert, the public key reveals nothing about the private key. The section on domain-separated hashing covers this pattern in depth.

What actually leaks privacy? (even with disclose())?

Raw witness values

Storing disclose(getBalance()) directly onchain publishes the balance in plaintext. This is appropriate when the value is intended to be public (e.g a display name, a transaction amount for a public ledger), but it is a complete disclosure of the underlying data.

Arithmetic on witness values

Adding, subtracting, or scaling a private value before disclosure does not hide it cryptographically. The Compact compiler correctly rejects this without a disclose() wrapper, but the real risk is developers who add disclose() and assume the arithmetic hides the value. It does not. Here's what it does:

// Dangerous — disclosing an offset balance still reveals the balance
balance = disclose(getBalance() + 73);

If an attacker knows the offset which in this case is 73, they recover the original value immediately.

Boolean outputs from comparisons on private data

Exposing true/false results of comparisons on confidential values enables progressive disclosure of sensitive information. An attacker who can call the circuit many times with different values can binary-search their way to the exact private value through repeated queries. If you must expose a threshold check, consider rate-limiting at the DApp layer and documenting the tradeoff clearly.

Enum variants derived from private logic

Returning a value that is state constant or a State enum value such as 'Pending', 'Rejected', etc that was computed from sensitive data can leak information about the data depending on the number of states that exist and how they map to the input space.

Applying domain-separated hashing for cross-property unlinkability

One of the more delicate privacy risks in zero knowledge contracts is cross-property linkability which is what occurs when two different disclosures from the same private identity can be correlated by an observer even if neither disclosure reveals the identity itself. For example, Rahman reveals two things in separate transactions: a proof that his KYC tier is "premium" and a proof that his account is over 10 months old. Each proof reveals nothing individually, but if both proofs are signed with the same public key or derived from the same hash of the Rahman's identity, an observer can link the two transactions to Rahman, reconstructing a profile without ever learning the underlying identity.

How do we resolve this? Instead of deriving a single public key from a user's secret key, you derive a different public key for each purpose (or domain) by including a domain tag in the hash input. Two hashes with different domain tags, even for the same secret, produce completely unrelated outputs.

How does this relate to Compact?

The previous examples use this pattern directly. The publicKey circuit in the lock example incorporates a domain tag as part of the hash:

pragma language_version >= 0.16 && <= 0.22;

import CompactStandardLibrary;

circuit publicKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "midnight:examples:lock:pk"), round as Bytes<32>, sk]
);
}

The string midnight:examples:lock:pk is the domain tag. It is namespaced to prevent collision with other contracts that might use the same hash function on the same secret key. This pattern can be extended across different properties of the same identity:

pragma language_version >= 0.16 && <= 0.22;

import CompactStandardLibrary;

// Produce an unlinkable key for KYC tier disclosure
circuit kycTierKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "myapp:identity:kyc-tier"), round as Bytes<32>, sk]
);
}

// Produce a separate, unlinkable key for account-age disclosure
circuit accountAgeKey(round: Field, sk: Bytes<32>): Bytes<32> {
return persistentHash<Vector<3, Bytes<32>>>(
[pad(32, "myapp:identity:account-age"), round as Bytes<32>, sk]
);
}

witness secretKey(): Bytes<32>;
export ledger kycAuthority: Bytes<32>;
export ledger ageAuthority: Bytes<32>;
export ledger round: Counter;

export circuit registerKycTier(): [] {
const sk = secretKey();
kycAuthority = disclose(kycTierKey(round.read() as Field, sk));
}

export circuit registerAccountAge(): [] {
const sk = secretKey();
ageAuthority = disclose(accountAgeKey(round.read() as Field, sk));
}

By doing this, an observer who sees both kycAuthority and ageAuthority on the ledger cannot determine whether they belong to the same user. The domain tags ensure the outputs are cryptographically unrelated, even though both were derived from the same secret key.

The round counter and linkability

The round counter in the examples above serves a second purpose. Without it, a user who calls registerKycTier() twice would produce the same kycAuthority value both times, allowing an observer to link both calls to the same identity. By introducing round into the hash, each call produces a different output even with the same secret key. The linkability between rounds is broken when round is incremented between calls.

Something to note before moving on: never reuse a nonce in a commitment scheme. If you use persistentCommit(value, nonce) and the same nonce appears onchain in double transactions, an observer can link both commitments and if they can observe one opening, potentially infer the other.

Running a privacy audit on your contract

Before you deploy a contract to a testnet or mainnet, you should go through this checklist first. It helps to identify, the most common privacy mistakes.

Privacy audit checklist

Disclosure scope

(i). Every disclose() call wraps the minimum expression necessary, not an entire struct when only one field needs to be public.
(ii). No disclose() is placed on a witness call whose output will be used in subsequent private computations. If a witness value travels both a public path and a private path, these should be two separate values.
(iii). All exported circuit return values are reviewed. If a return value is derived from witness data, confirm the disclosure is on purpose and document what information it reveals.

Boolean and comparison leaks

(i). No exported circuit returns a boolean, integer, or enum that is directly derived from an undisclosed private value without documenting the intended information disclosure.
(ii). For threshold checks (e.g. balance > n), consider whether repeated calls allow binary search over the private value. If so, document the tradeoff or implement rate limiting at the DApp layer.

Hashing and commitments

(i). transientHash/transientCommit outputs should not be used to derive ledger state that needs to persist across network upgrades. Use persistentHash/persistentCommit for those values.
(ii). Nonces used in persistentCommit are never reused. Reusing a nonce with the same value allows on-chain commitment linking.
(iii). Hash inputs do not contain values that an attacker could enumerate to reverse the hash (e.g a small integer hashed alone without a nonce).

Cross-property linkability

(i). Each distinct property of a private identity must use a separate, name-spaced domain tag in its key derivation hash.
(ii). Domain tag strings follow a consistent, collision-resistant naming convention (e.g., "firstproject:context:property").
(iii). A round counter or similar freshness mechanism is included in any public key derivation to prevent same-key linkability across multiple interactions.

Compiler verification

(i). The contract compiles cleanly without suppressed warnings. Review any disclose() wrapper you added in response to a compiler error rather than from intentional design.
(ii). After any refactor, recompile the contract from scratch and re-review all disclose() sites, structural changes can introduce new disclosure paths.

Documentation

(i). Each disclose() site is accompanied by an inline comment explaining what is being disclosed and why it is intentional.
(ii). The contract's README note or specification states clearly which fields in the ledger are public, which are derived from private data, and what information each derived field reveals.

Conclusion

Using Selective disclosure in Midnight is not just a feature, it is an architectural guarantee fused into the language. The compiler's witness protection program means you cannot accidentally publish private data without being forced to make a deliberate decision which is great for security of sensitive data.

The disclose() wrapper is that decision point, and treating it as such, rather than as a formality to satisfy the compiler, is what separates a contract with genuine privacy properties from one that only appears private.

In this tutorial we have covered the full practical arc of selective disclosure in Compact. We learned how the public/private state split works and why privacy is the default. We also learned how to use disclose() correctly, how to scope it precisely to avoid over-disclosure on structured values, and how the compiler catches indirect leaks through arithmetic and conditionals.

Domain-separated hashing was applied to break cross-property linkability, using the same pattern Midnight's own standard examples rely on. And now, there is a concrete privacy audit checklist to run against contracts before they reach the network.

If you have any questions on this article or you're confused somewhere, please engage and let me know. Share this with anyone you think it will help. Privacy is the default, disclosure is the exception.

Agent-as-a-Tool: A New Era of AI Orchestration

2026-05-03 13:10:27

Agent-as-a-Tool paradigm

Abstract

As Large Language Model (LLM) agents increasingly integrate numerous external systems, they suffer from Tool Space Interference (TSI), a phenomenon causing context bloat, attention dilution, and degraded reasoning accuracy. In this paper, we introduce the Agent-as-a-Tool paradigm—an evolutionary, practical implementation of the recently proposed Self-Optimizing Tool Caching Network (SOTCN) and Federated Context-Aware Routing Architecture (Federated CARA). By leveraging Retrieval-Augmented Generation (RAG) to dynamically discover and assemble stateful, autonomous sub-agents on the fly, this architecture completely eliminates TSI, enforces Zero-Trust execution boundaries, and achieves infinitely scalable AI orchestration.

Introduction

The rapid evolution of LLMs and their standardized integration with external systems—notably through the Model Context Protocol (MCP)—has transformed LLM-based agents from simple conversational interfaces into advanced, autonomous AI systems capable of executing complex workflows. However, as the number of tools and external skills accessible to an agent increases, a critical performance bottleneck has emerged.

This phenomenon is defined as Tool Space Interference (TSI). Ref As officially highlighted at Google Cloud Next '26, the excessive use of MCP servers and tools leads to "context bloat," where massive amounts of data—such as verbose JSON schemas and system metadata—are loaded into the active context window. This not only exhausts token limits but also triggers "attention dilution." Overlapping tool functionalities and conflicting semantics become noise, severely impairing the model's reasoning capabilities and routing accuracy.

Current technical guidelines suggest a soft limit of approximately 20 tools per agent to maintain high selection accuracy. Exceeding this threshold causes context saturation, leading to a surge in fatal errors such as tool hallucination, the generation of invalid parameters, and the breakdown of execution flows. This creates an orchestration paradox: the more we attempt to scale the system's capabilities, the less reliable the agent paradoxically becomes.

To overcome the TSI problem, I have previously proposed and implemented several architectural approaches. These strategies strongly anticipate the concepts of "Agent Skills" (an agent-first solution to context bloat) and "A2A Orchestration" (a native capability for agents to mutually delegate and coordinate tasks) announced at Next '26:

Positioning of this Paper:
This paper serves as the ultimate convergence and practical realization of both SOTCN and Federated CARA. By fusing these theoretical models utilizing the Google ADK and TypeScript, we dramatically elevate the concept of dynamic tool injection into a new paradigm: Agent-as-a-Tool. This study demonstrates how these architectures can be seamlessly implemented as production-ready, highly scalable code for the Agentic Enterprise.

Repository

You can see the script in this article at https://github.com/tanaikech/agent-as-a-tool.

The Agent-as-a-Tool Paradigm

Building upon prior efforts in aggregation, decentralization, and dynamic routing, this study proposes a paradigm that fundamentally resolves the TSI problem while achieving infinite operational scalability and preserving high reasoning accuracy.

In the original SOTCN proposal, I theorized storing tool metadata in "cold storage" and dynamically injecting only the most relevant functions into the LLM's active context. The Agent-as-a-Tool paradigm elevates this concept to its logical extreme. Instead of merely injecting static "tools" (such as bare API endpoints or JSON schemas), the system dynamically retrieves and delegates tasks to stateful, fully autonomous sub-agents.

By utilizing a Retrieval-Augmented Generation (RAG) database known as an "Agent Bank," the orchestrator extracts only the minimum agentic resources required for a user's task to assemble a temporary task force on the fly. This provides two decisive advantages:

  1. Encapsulation of Expertise (SOTCN Evolved): When raw tools are passed directly to an orchestrator, the primary LLM must independently interpret complex parameters, intricate instructions, and error-handling mechanisms every single time. By caching and injecting Agents instead of Tools, we encapsulate domain-specific system prompts, contextual history, self-reflection, and correction capabilities within the sub-entities. The main model is never exposed to the raw mechanics of the underlying tools.
  2. Distributed Cognitive Load: By dynamically invoking sub-agents, the primary orchestrator offloads the "How" (tool execution procedures) and focuses entirely on high-level meta-reasoning: the "Who and What" (task decomposition, planning, and dependency mapping). This perfectly aligns with standard specifications for seamless A2A coordination.

Architecture and Execution Workflow

Implemented via the Google ADK and Gemini API, the proposed system processes tasks autonomously and dynamically through a structured workflow that codifies the Federated CARA and SOTCN principles into executable logic:

  1. Agent Bank Preparation (SOTCN Storage): We construct diverse sub-agents that encapsulate specialized tools and domain knowledge. Their functional specifications (name, description, required skills) are vectorized and stored in a RAG-based search engine (Google Gen AI File Search Store). This acts as the resilient "cold storage" registry.
  2. Dynamic Discovery via Semantic Search: Upon receiving a user prompt, the main orchestrator (Agent Manager) does not process the task directly. Instead, it utilizes a search_expert_agents tool to query the RAG system, analyzing the prompt's intent and extracting only the Top-K specialized agents essential for the task, keeping the active context pristine.
  3. Context-Aware Dynamic Assembly & Execution (Federated CARA in Action): The orchestrator acts as a sophisticated routing engine. Using the execute_with_dynamic_subagents tool, it analyzes task dependencies and dynamically formulates an execution strategy—routing tasks using Single, Parallel (for independent sub-tasks), or Sequential (for dependent sub-tasks) execution patterns. To prevent state contamination, the system utilizes an InMemoryRunner to spawn a temporary entity (Temporal Coordinator), attaches the retrieved agents as sub-agents, executes the task, and immediately garbage-collects the entities from memory upon completion.
  4. Zero-Trust & Human-in-the-Loop Failsafes: Embodying the core philosophy of Federated CARA, securing execution boundaries is paramount. The system is programmed with strict File Operation Rules. For critical operations (e.g., creating, modifying, or deleting files), the orchestrator automatically pauses execution to enforce a strict "Human-in-the-Loop" (HITL) protocol, requiring explicit user approval before acting. By operating as an A2A Server, it establishes an enterprise-grade, Zero-Trust environment.

Through this architecture, even if the total number of tools accessible to the enterprise scales to thousands, the main orchestrator's cognitive load remains virtually zero.

System Execution Flow in Practice

To illustrate how the theories of SOTCN and Federated CARA function in a live environment, the practical execution flow of the system is detailed below. This flow highlights the dynamic skill extraction and the ephemeral assembly of sub-agents.

Mermaid Chart for Agent-as-a-Tool paradigm

Mermaid Chart Playground

Step-by-Step Breakdown

  1. Agent Extraction & Skill Identification: The system reads all registered agents from the Agent Bank (Registry) and dynamically extracts their overarching capabilities and functional skills using the Gemini API.
  2. Skill Registration: These extracted skills are systematically registered into the core description of the main agent-manager, establishing its semantic capability matrix.
  3. Server Launch: The main agent-manager is deployed as a standalone Agent-to-Agent (A2A) server.
  4. Client Activation: The user launches the Gemini CLI to serve as the local client interface.
  5. User Input: The user inputs a prompt defining their specific task, workflow, or goal.
  6. Task Routing: If the prompt's task falls within the functional scope of the agent-manager's registered capabilities, the orchestrator is natively triggered by the CLI.
  7. Semantic Search (SOTCN): The agent-manager executes the search_expert_agents tool to query the Agent Bank's File Search Store. It retrieves one or multiple expert agents required for the task. If the requested task cannot be processed by any registered agent, the RAG system notifies the orchestrator, which gracefully forwards a transparent limitation message back to the user.
  8. Dynamic Assembly & Execution (Federated CARA): If the required agents are found, the orchestrator triggers the execute_with_dynamic_subagents tool. This process spawns a fresh Temporal Coordinator using an in-memory runner, attaches the retrieved agents to form an ephemeral task force, and executes the complex task based on the optimal strategy (Single, Parallel, or Sequential).
  9. Result Synthesis: The temporal team processes the data and returns the execution results to the agent-manager. The manager synthesizes the final response, securely flushes the temporal team from memory to prevent state contamination, and delivers the comprehensive output back to the Gemini CLI.

Project Setup and Prerequisites

To demonstrate this architecture practically, I have prepared a complete Node.js/TypeScript implementation. You can view the full repository of sample scripts at https://github.com/tanaikech/agent-as-a-tool.

To follow along with this guide, ensure your environment meets the following requirements:

  • Node.js is installed and configured on your system.
  • Gemini CLI is installed and accessible via your terminal.

1. Install agent-as-a-tool

To retrieve and initialize the scripts, execute the following commands in your terminal:

git clone https://github.com/tanaikech/agent-as-a-tool
cd agent-as-a-tool
npm install

The directory structure is defined below. src/agentbank.ts acts as the registry for the sample agents. You can modify this file to add or remove agents tailored to your enterprise workflows.

agent-as-a-tool/
├── package.json
├── tsconfig.json
├── src/
│   ├── a2aserver.ts
│   ├── agent.ts
│   ├── agentbank.ts
│   ├── autonomous-google-workspace-agent.ts
│   └── store_manager.ts
└── test/
    └── test_search.ts

In this sample, three sophisticated agents are included. The third agent is derived fromEmpowering Autonomous AI Agents through Dynamic Tool Creation. When utilizing this agent, please refer to the article for specific sandbox setup instructions.

1. Currency Exchange Agent (exchange_agent)
 - Capabilities: A financial specialist providing accurate global currency exchange rates.
 - Features: Dynamically handles current rates and relative date requests (e.g., "last Friday") by resolving the exact temporal context before API retrieval.

2. Weather Agent (weather_agent)
 - Capabilities: A professional meteorologist providing precise weather forecasts.
 - Features: Computes target dates, times, and geographic coordinates for relative requests (e.g., "in 3 hours") to deliver perfectly timed data.

3. Autonomous Google Workspace Agent (autonomous-google-workspace-agent)
 - Capabilities: A Senior Orchestrator managing the full lifecycle of Google Apps Script (GAS) development.
 - Internal Sub-Agents:
     - Environment Checker: Validates local tool installations (e.g., @google/clasp).
     - Script Writer: Generates GAS-compatible code using live official documentation via MCP.
     - Script Executor: Simulates and tests scripts in a locally sandboxed environment.
     - Script Uploader: Manages Drive project creation and secure uploading via clasp.
     - Summary Agent: Consolidates technical results into structured execution reports.

To enable these agents, you must configure your Gemini API key as an environment variable:

export GEMINI_API_KEY=<YOUR_API_KEY_HERE>

2. Store Agents to RAG

To securely index the initial agents into the File Search Store (Agent Bank), execute the following command:

npm run regAgents

If you add new capabilities to agentbank.ts and re-run this command, the script dynamically compares existing metadata, ensuring only newly added agents are registered and entirely avoiding duplicate ingestion.

You can inspect the registered agent list by running npm run regAgentList, or selectively clear the stores using npm run deleteStores.

Once the store is created, map it to your environment session:

export AGENT_BANK="{your store name}"

3. Launch Web Server

This framework can function natively as a standalone web server or as a delegated sub-agent linked to the Gemini CLI. Let's first test it as a standalone server.

Launch the Web server:

npm run web
$ npm run web

> [email protected] web
> npx adk web src/agent.ts

+-----------------------------------------------------------------------------+
| ADK API Server started                                                      |
|                                                                             |
| For local testing, access at http://localhost:8000.                         |
+-----------------------------------------------------------------------------+

You can now interact with the web interface by navigating to http://localhost:8000 in your browser.

4. Testing: Web Server

A demonstration video for scenarios 1 through 4 in this section is available here:

Once you have executed npm run web and launched your browser at http://localhost:8000, evaluate the following test cases.

Scenario 1

Prompt:

What is the latest exchange rate from USD to JPY?

Result: The orchestrator correctly processes the semantic intent, dynamically fetching and executing a single exchange_agent to retrieve the latest financial rates.

Scenario 2

Prompt:

Please tell me the weather in Tokyo at noon tomorrow.

Result: The orchestrator interprets the temporal requirement, processes the task using the weather_agent, which correctly calculates "tomorrow at noon" before querying the weather API.

Scenario 3

Prompt:

I am traveling to Paris. Please check the weather in Paris on 2026-05-01 12:00 (Latitude 48.85, Longitude 2.35, Timezone Europe/Paris). Also, I need to plan my budget, so please provide the latest exchange rate from JPY to EUR simultaneously.

Result: The orchestrator dissects this complex prompt, semantic-searches the Agent Bank, and identifies two required experts. Utilizing the CARA-inspired routing engine, it formulates a Parallel execution strategy, assembling and coordinating both the exchange_agent and weather_agent concurrently, seamlessly synthesizing their outputs.

Scenario 4

Prompt:

Please tell me the weather in Tokyo tomorrow, and also book a flight from New York to Tokyo for next Monday.

Result: The orchestrator successfully procures the weather forecast via the weather_agent. However, recognizing that no flight-booking agent exists in the Agent Bank, the system strictly enforces its operational boundaries, returning the weather data while transparently explaining its limitation regarding the flight booking.

5. Launch A2A Server

To test this architecture directly within your terminal as an enterprise sub-agent for the Gemini CLI, it is required to launch the Agent-to-Agent (A2A) server endpoint.

npm run a2a

When this command executes, the routing endpoint becomes active:

$ npm run a2a

> [email protected] a2a
> npx tsx src/a2aserver.ts

Server started on http://localhost:8000
Try: http://localhost:8000/.well-known/agent-card.json

To configure this A2A server as an accessible sub-agent for the Gemini CLI, create or update .gemini/agents/agent-as-a-tool.md with the following configuration:

---
kind: remote
name: agent-as-a-tool
agent_card_url: http://localhost:8000/.well-known/agent-card.json
---

You can inspect the generated agent card specifications by opening the provided URL (http://localhost:8000/.well-known/agent-card.json) in your browser.

6. Testing: A2A Server

A demonstration video for the A2A server using scenarios 1 through 4 is available here:

Once the server is configured and running, launch the Gemini CLI. We use the prefix @agent-as-a-tool to route the intent directly to our newly created orchestrator.

Scenario 1

Prompt:

@agent-as-a-tool What is the latest exchange rate from USD to JPY?

Result: The orchestrator receives the delegated request from the Gemini CLI and successfully answers using the exchange_agent.

Scenario 2

Prompt:

@agent-as-a-tool Please tell me the weather in Tokyo at noon tomorrow.

Result: Processed seamlessly via the dynamically loaded weather_agent.

Scenario 3

Prompt:

@agent-as-a-tool I am traveling to Paris. Please check the weather in Paris on 2026-05-01 12:00 (Latitude 48.85, Longitude 2.35, Timezone Europe/Paris). Also, I need to plan my budget, so please provide the latest exchange rate from JPY to EUR simultaneously.

Result: Mirroring the web testing, the agent-as-a-tool orchestrator dynamically delegates sub-tasks to both the exchange_agent and weather_agent, returning a synthesized response directly to your CLI session.

Scenario 4

Prompt:

@agent-as-a-tool Please tell me the weather in Tokyo tomorrow, and also book a flight from New York to Tokyo for next Monday.

Result: The A2A orchestrator accurately audits its capability matrix, retrieving the weather while explicitly refusing the flight booking due to the lack of an applicable sub-agent.

Scenario 5

In this scenario, we utilize the advanced agent I previously published inEmpowering Autonomous AI Agents through Dynamic Tool Creation.

Prompt:

@agent-as-a-tool Create a new Google Spreadsheet by putting a formula `=GOOGLEFINANCE("CURRENCY:USDJPY")` in cell "A1" of the first sheet. Then, get and show the value of cell "A1". (Note: `gas-fakes` has no `getActiveSheet()` method. In this case, use `getSheets()[0]`.)

When this prompt is processed, the system retrieves the complex autonomous-google-workspace-agent from the Agent Bank. Because creating a new Google Spreadsheet demands specific Google Drive authorization, the agent correctly halts execution and requests explicit authorization to execute file creation workflows. Once approved, the orchestrator leverages local sandboxing (gas-fakes) to simulate, validate, and execute the generated Google Apps Script, perfectly achieving the goal. This also indicates that by using Agent-as-a-Tool, you can use existing agents as they are.

Result of scenario 5

Security Considerations and Zero-Trust Governance

As autonomous agents assume greater operational responsibility in enterprise environments, security must be treated as a foundational architectural component rather than an afterthought. The Agent-as-a-Tool paradigm natively incorporates robust security measures, heavily inspired by the zero-trust principles of the Federated Context-Aware Routing Architecture (Federated CARA). This framework secures the execution environment through three primary mechanisms:

1. Attack Surface Minimization via Dynamic Injection

Traditional agents that load massive toolsets into their active context window inadvertently expand their attack surface. Malicious prompt injections can easily trick an over-privileged LLM into triggering unintended functions. By utilizing the RAG-based storage inherited from the Self-Optimizing Tool Caching Network (SOTCN), our orchestrator injects only the strictly necessary sub-agents for a specific task. If a tool or capability is not retrieved by the semantic search, it physically cannot be executed within that session, intrinsically shielding the system from arbitrary functional exploits.

2. Ephemeral Execution and State Isolation

To prevent data leakage across different tasks or users (state contamination), the architecture deeply integrates an InMemoryRunner for temporal generation. Sub-agents are never persistent entities. The orchestrator dynamically spawns a Temporal Coordinator and its assigned sub-agents strictly for the duration of the given task. Once the execution is complete and the synthesized result is returned, the temporal team is immediately garbage-collected from memory. This ephemeral execution model ensures that sensitive data processed in one session cannot cross-pollinate or influence subsequent AI interactions, maintaining strict state isolation.

3. Human-in-the-Loop (HITL) and Boundary Control

The Agent Manager also serves as a strict security triage engine. For non-destructive, read-only operations (such as fetching weather forecasts or exchange rates), it delegates and operates fully autonomously. However, for critical operations—specifically creating, modifying, or deleting files—the system enforces a mandatory Human-in-the-Loop (HITL) protocol. The orchestrator is explicitly instructed via its core system prompt (File Operation Rules) to halt execution and request direct user confirmation before proceeding with any action that crosses predefined security boundaries. This dual-layered approach—autonomous execution for read-only tasks and mandated HITL for write operations—establishes a scalable, enterprise-grade governance model without sacrificing system agility.

Future Perspectives on Dynamic Orchestration

In this article, as an experimental approach to the dynamic use of AI agents, we successfully demonstrated the dynamic selection, assembly, and execution of pre-built, safety-verified agents from an Agent Bank based on specific task requirements. However, this foundational approach naturally paves the way for an even more advanced future capability.

In addition to utilizing an Agent Bank, we can envision a scenario where optimal agents are dynamically constructed from scratch by pulling highly granular resources from a "Tool Bank," an "MCP Server Bank," or a "Skill Bank" on the fly, tailoring the newly generated agent entirely to the context of the requested task. While this promises unprecedented orchestration flexibility, it is crucial to recognize that the dynamic combination of these unverified components may inadvertently introduce new security vulnerabilities or unpredictable agent behaviors. Therefore, establishing rigorous safety validation mechanisms and robust governance protocols for these dynamic combinations will undoubtedly be a critical challenge to address when realizing this ultimate vision of autonomous AI orchestration.

Summary

  • Practical Implementation of SOTCN and Federated CARA: This architecture translates advanced theoretical frameworks into highly scalable, production-ready code utilizing the Google ADK and TypeScript.
  • Resolution of Tool Space Interference: By utilizing a RAG-based Agent Bank (File Search Store), the architecture only loads necessary sub-agents at runtime, entirely eliminating context bloat and tool hallucination.
  • Enhanced Task Resolution via Encapsulation: Transitioning from "raw tools" to the "Agent-as-a-Tool" model ensures that domain-specific prompts, temporal logic, and self-reflection mechanics are preserved within specialized agents, greatly enhancing execution accuracy.
  • Distributed Cognitive Load: The main orchestrator is freed from the mechanics of individual tool execution. By leveraging Single, Parallel, or Sequential strategies dynamically, it dedicates its entire token allowance and reasoning capacity to high-level context-aware routing and task planning.
  • Infinite Scalability: Adding new capabilities to the ecosystem simply involves pushing new agent profiles to the semantic Agent Bank, enabling the orchestration system to scale endlessly without degrading core model performance.
  • Enterprise-Grade Failsafes: Built-in Human-in-the-Loop (HITL) checkpoints for file operations and rigorous state separation via ephemeral InMemoryRunner instances establish a secure, Zero-Trust environment perfectly tailored for production deployment.

Explanatory video

Your .env file should have one line

2026-05-03 13:05:18

Every AI app I've shipped recently rewrote the same plumbing. The OAuth dance for Slack. Encrypted storage for an API key. Refresh-token logic that finally fails on the 3rd call after an hour. Wiring up an MCP client to a server behind a bearer token someone pasted into a Notion page.

I'd write it, copy-paste it into the next app, watch it rot. Each new agent built by a different teammate, slightly differently, with slightly different bugs. We were a small team and the integration code became most of the code.

## The pattern under all of it

Strip away the providers and the AI-specific bits, and every app needed the same four things from the platform:

  1. Env vars — a database URL, a Stripe key, the boring stuff. Not in a .env file in a Docker image. Not in a CI secret. Somewhere the app can ask for at runtime.
  2. Pre-built integrations — Gmail, Calendar, Drive. The user logs in once on the platform; every app gets typed access on their behalf.
  3. Custom OAuth — the providers no platform pre-builds. Slack, Notion, the company's SSO. The customer holds the client_id/secret; their app shouldn't.
  4. Custom MCP — internal MCP servers, third-party MCPs. The customer holds the URL and the bearer token; their app shouldn't.

That's the spine of the SDK we ended up shipping. Four primitives, every app uses some of them, none of them require integration code in the app.

## Register once at the org level

The flip is registration. The org owner registers their things one time on the dashboard:

  • Drop a Slack client_id + client_secret into the "Custom OAuth providers" card. Encrypted with the org's KMS key. The app never sees it.
  • Drop the URL of an internal MCP server + a bearer token into the "Custom MCP servers" card. Same treatment.
  • Connect Doppler / 1Password / GCP Secret Manager as a secret source — or just type secrets into the dashboard.

Now every app you deploy in that org gets typed access through four SDK calls.

## The four calls

  import { LeashIntegrations } from '@leash/sdk/integrations'                                                                                                                                                                                             

  const client = new LeashIntegrations({ apiKey: process.env.LEASH_API_KEY })                                                                                                                                                                             

  // 1. Env var (resolves through your configured secret source)                                                                                                                                                                                          
  const dbUrl = await client.getEnv('DATABASE_URL')                                                                                                                                                                                                       

  // 2. Pre-built integration                                                                                                                                                                                                                             
  const messages = await client.gmail.listMessages({ maxResults: 5 })                                                                                                                                                                                     

  // 3. Custom OAuth — fresh access token for any provider you've registered                                                                                                                                                                              
  const slackToken = await client.getAccessToken('slack')

  // 4. Custom MCP — { url, headers } including bearer Authorization                                                                                                                                                                                    
  const mcp = await client.getCustomMcpConfig('acme-tools')                                                                                                                                                                                               

Same shape across TypeScript, Python, Go, Ruby, Rust, and Java. No client_secret in the app code. No refresh-token handler. No MCP boilerplate.

## Your .env collapses to one line

The thing we noticed only after living with it: once you're using this, the only secret your app's .env actually needs is the platform API key.

  # .env  (yes, this is the whole thing)                                                                                                                                                                                                                  
  LEASH_API_KEY=lsk_live_...                                                                                                                                                                                                                              

That's it. No more .env.example drift. No more "did we set DATABASE_URL in staging?" debugging at 11pm. Rotation happens at the source — no rebuild, no redeploy.

## What it deliberately doesn't do

A few decisions that came up that I'll defend:

Doesn't proxy MCP traffic. We hand the app {url, headers} (with bearer Authorization already attached) and the app calls the MCP directly. Leash isn't in the request path. Tool calls are on the LLM's critical path; an extra hop hurts. We also
didn't want to reimplement every MCP transport (streamable HTTP, SSE, stdio) with our own bugs.

Doesn't force you to use the platform for secrets. If you'd rather hold them in Doppler or 1Password, point the platform at your existing source. getEnv resolves through whichever the org configured.

Doesn't pretend to be multi-cloud. Single-region GCP today. If you're betting on us, you're betting on a small surface area — not a multi-cloud promise.

## The why behind the shape

Customer apps can't hold credentials safely. Their AI agent runs on someone's laptop, in CI, on a Cloud Run revision someone's about to redeploy. Putting client_secret in the app means rotating it everywhere whenever it leaks. So we put the

credential in one place and gave the app a thin retrieval call instead.

Same logic for MCP. The bearer token for a customer's internal tool server isn't something we want their AI app to know. The app gets a config dictionary right before it calls the MCP. That's as long as the credential lives anywhere near user code.

The four-primitive surface area is small on purpose. Anything else (token caching, retries, pagination on Gmail, etc.) lives in the SDK or in the customer's code, not in the platform contract. We'd rather grow the SDK than the API.

## Try it

  curl -fsSL https://leash.build/install.sh | sh                                                                                                                                                                                                        
  leash login                                                                                                                                                                                                                                             
  leash deploy                            

Or just sign up at leash.build, register a Slack app or an internal MCP, and call the SDK from any project. Custom OAuth + custom MCP are gated to the Growth plan; built-in integrations work on every plan including free.

Curious what others have done for this. Especially the proxy-vs-config-handoff call for MCP — I made the bet, but it's the architecture choice I'd most welcome a counterargument on.

Network Address Translation (NAT)

2026-05-03 12:51:33

As we have talked about before, the Internet relies on numerical addresses, IP addresses to route data from one device to another. IPv4 offers around 4.3 billion addresses, we have discussed that that is not enough. While there is IPv6, another solution to this issue is through Network Address Translation (NAT)

NAT allows multiple devices on a private network to share a single public IP address. This not only helps conserve the limited pool of public IP addresses but also adds a layer of security to the internal network.

Private vs. Public IP Addresses

Public IP addresses are globally unique identifiers that are assigned by Internet Service Providers (ISPs). Devices with these IP addresses can be accessed from anywhere on the Internet, allowing them to communicate across the global network.

On the other hand, private IP addresses are designated for use within local networks such as homes, offices and schools. These are not routable on the global internet, so they cannot be forwarded by internet backbone routers. Defined by RFC 1918, common IPv4 private address ranges include 10.0.0.0 to 10.255.255.255, 172.16.0.0 to 172.31.255.255, and 192.168.0.0 to 192.168.255.255. This setup ensures that these private networks operate independently of the internet while facilitating internal communication and device connectivity.

Private IP addresses contribute to conserving public IP addresses. Using Network Address Translation (NAT), a local network can utilize private IP addresses while sharing a single public IP address, reducing the number of public IPs needed. This setup makes devices accessible from the internet without using multiple public addresses. Additionally, private IPs help secure the network by isolating internal devices from direct exposure to the internet, protecting them from potential external threats.

How does it work?

Network Address Translation (NAT) is a process carried out by a router or a similar device that modifies the source or destination IP address in the headers of IP packets as they pass through. This modification is used to translate the private IP addresses of devices within a local network to a single public IP address that is assigned to the router.

For example, say that your home network has a few devices, laptop, smartphone, tablet and a smart thermostat. All of these have their own private IP addresses which they can use to connect to eachother. But when, suppose the laptop wants to access a DNS Server on the internet, it will need a public IP address. As the packet passes though the router, the router will change the private IP address into a public one. This public IP address is the same for all of the devices in the network. As the response arrives, the router's NAT table, which keeps track of IP mappings, identifies that 203.0.113.50:4444 corresponds to the laptop at 192.168.1.10:5555 (ports 4444 and 5555 are dynamic). All of this is done by the NAT process.

Types of NAT

Static NAT - Involves a one-to-one mapping, where each private IP address corresponds directly to a public IP address.

Dynamic NAT - Assigns a public IP from a pool of available addresses to a private IP as needed, based on network demand.

Port Address Translation (PAT) - Also known as NAT Overload, is the most common form of NAT in home networks. Multiple private IP addresses share a single public IP address, differentiating connections by using unique port numbers. This method is widely used in home and small office networks, allowing multiple devices to share a single public IP address for internet access.

Benefits and Trade-Offs

Benefits

  • Conserves the limited IPv4 address space.
  • Provides a basic layer of security by not exposing internal network structure directly.
  • Flexible for internal IP addressing schemes.

Trade-Offs

  • Complex services like hosting a public server behind NAT can require additional configuration (e.g., port forwarding).
  • NAT can break certain protocols that rely on end-to-end connectivity without special handling.
  • Adds complexity to troubleshooting connectivity issues.

I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

2026-05-03 12:42:30

I bought four NVIDIA CMP 100-210 cards off the secondhand market for about $130 each. They are ex-mining cards based on the
Volta GV100 die — same silicon as the V100 — with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for
the price of a single used 3090.

In practice, NVIDIA had crippled them in hardware.

The throttle

The CMP 100-210 has its tensor cores throttled 64×. HMMA latency is stretched from 8 cycles to 512. cuBLAS WMMA caps out
at about 5 TFLOP per card. PCIe is locked to Gen1 x1, no P2P, no NVLink. CUPTI is blocked, so you can't even use NVIDIA's
own profiler.

The throttle is enforced by an e-fuse + PMU bootrom double-lock on the die. This isn't a firmware switch — it's blown into
the silicon. There is no software unlock. (Yes, I tried.)

The result: anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. That's vLLM, llama.cpp's
default cuBLAS path, FlashAttention, bitsandbytes, PyTorch's default matmul. The standard LLM inference stack is unusable
on this hardware.

So I wrote my own.

The workaround

It turns out NVIDIA only throttled tensor cores. Two other paths on the same chip are full speed:

  • DP4A (4-way packed int8 dot product): ~17 TFLOP, no throttle
  • HFMA2 (2-way packed fp16 fused multiply-add): ~24 TFLOP, no throttle

Neither is as fast as a healthy V100's tensor cores, but both are far above the 5 TFLOP cuBLAS WMMA ceiling. Routing all
of inference through these two paths gets you back to roughly half of what an unthrottled V100 would do, which is still
vastly better than nothing.

Building on that, qengine is a from-scratch CUDA inference engine for Qwen3.5 / Qwen3.6 hybrid models. (Worth noting:
Qwen3.5 / 3.6 are a different architecture from Qwen3 — they are dense GDN (Gated DeltaNet) + Attention hybrids, not pure
transformers. The kernels look quite different.)

The engine has:

  • A hand-written Q8_0 GEMM tile path for prefill, all DP4A
  • A fused FlashAttention kernel (score + softmax + value online)
  • Split-K FlashAttention for long context (more on this below)
  • 3-bit Walsh-Hadamard + Lloyd-Max KV cache so 27B fits 256K context on three 16 GB cards
  • An OpenAI-compatible HTTP API with streaming, tool calls, vision, continuous batching, and per-slot prefix caching

It's not a fork. Every kernel is written for sm_70 + CMP constraints.

Honest benchmarks

I'm comparing against llama.cpp build 8462 with -fa 1, the same Q8_0 GGUFs, on the same hardware. Bigger numbers are
better.

▎ Qwen3.5-9B, single GPU prefill (qengine vs llama.cpp, tokens/sec):
▎ - 297 — 594 vs 199 (2.99x)
▎ - 1.16K — 683 vs 316 (2.16x)
▎ - 4.62K — 584 vs 361 (1.62x)
▎ - 18K — 393 vs 324 (1.22x)

qengine leads at the first three lengths and reaches parity at 18K.

Generation: qengine wins by +48–51% on both sizes (9B: ~70 t/s vs 46.6; 27B: 26.3 vs 17.7).

The honest weak point: 9B dual-GPU at 18K still trails llama.cpp (~0.48×). Their layer pipeline overlaps activation
transfer with compute; mine does the transfers sequentially through pinned host memory, because no P2P. Single-GPU 9B is
faster than either dual-GPU run anyway, so it's mostly a theoretical gap, but it's there.

What was hard

A few things that took real time to get right:

Multi-GPU without P2P. With CMP cards there's no peer-to-peer, no NVLink. Hidden state has to bounce through pinned host
memory between GPUs. I keep a pinned-host buffer per cross-GPU edge and a worker thread per GPU. It works, it's just
sequential.

Numerical drift killing Korean output. Qwopus3.5-9B distill has weak Korean circuits to begin with — small fp16 reorder
noise shifts argmax decisions and the model starts producing garbled Korean. I learned this the hard way after a
chunked-prefill kernel optimisation that "passed" my English greedy-argmax tests broke Korean entirely. Now every kernel
that touches the attention reduction order gets a Korean argmax-stability check before it ships.

Split-K FA without breaking determinism. The 64-block FA grid was under-utilising the SMs at long context (only 64 blocks
across 3×68 SM = 204), so each block was running a 575-iteration K/V tile loop in isolation. I added a split-K variant
that maps each (kv_head, t_idx) to N independent blocks, each handling a contiguous tile range, and merged the partials
with the standard log-sum-exp identity:

m_global = max_s m_s
l_global = Σ_s exp(m_s − m_global) · l_s
o_global = Σ_s exp(m_s − m_global) · acc_o_s
First version stored partial o accumulators as half. That truncation caused a small drift after about 31 generated tokens
at 4.6K prefill — not bit-exact with the base FA path. Korean argmax flipped. Storing partials as fp32 brings drift down
to fp32-reordering noise (~1e-7 per add), and greedy argmax is stable across 32+ generated tokens. That's the version I
shipped. 18K prefill went from 270 → 393 t/s on 9B and 104 → 139 t/s on 27B.

Speculative decoding I never got working. I have DFlash + DDTree code in the repo for the eventual fine-tuned drafter.
Right now the pretrained drafter (lucebox-hub/dflash) is trained on stock Qwen3.5, and the Qwopus distill output
distribution doesn't match — accept rate is roughly 0% and the chains degenerate. Listed in the README as broken on
purpose. MTP K=1 single-token spec works fine.

What this is and isn't for

If you have an RTX 30/40-series, A100, or H100, you should be using vLLM or SGLang. They are far more optimised for those
targets and have actual test coverage. qengine would be slower and weirder.

If you have:

  • Ex-mining cards (CMP 100-210, ex-mining V100, P104-100, etc.)
  • Older Volta workstations (V100 16/32 GB, Titan V, Quadro GV100)
  • A T4 or RTX 20-series and the standard stacks have been disappointing

— then qengine might be useful. It targets sm_70 specifically. sm_75 should work but isn't tuned. sm_60 won't work (no
DP4A). AMD and Apple Silicon definitely won't work.

Repo

https://github.com/Haru-neo/qengine — Apache 2.0.

The benchmarks in this post are reproducible with the bench_curl.sh script in the repo. The 27B 3-GPU numbers were
measured 2026-05-03 on my machine. If you have the hardware and try it, I'd love to know what you see.

Solo project. Heavy AI assist on the CUDA — I drove the architecture, profiling, and debugging across many sessions;
Claude did most of the kernel implementation. I'm a Korean high school student. Slow PR turnaround.

LLM Foundry: the boring stack that makes an LLM actually useful

2026-05-03 12:39:04

LLM Foundry: the boring stack that makes an LLM actually useful

Most AI projects are built backwards.

People start with the model and only later discover they needed a memory system, semantic retrieval, tool use, tests, and a fallback plan for when one provider decides to nap for no visible reason.

That is the part I care about now.

LLM Foundry is the workshop around an LLM — not the model itself. It is the layer that makes a model useful for actual work instead of just looking smart in a demo.

What changed

The current version now has a few things worth showing instead of just claiming:

  • semantic retrieval backed by embeddings, so memory search is not just keyword matching
  • multi-provider support for OpenAI-compatible endpoints, Anthropic, Hugging Face, and failover bundles
  • compression + memory so long tasks can be shrunk into a compact working context
  • agent traces that can be exported into training data
  • benchmark + harness runs so the system is testable instead of vibes-based

That last bit matters more than people like to admit.

If a system cannot be tested, it is not “advanced”. It is just expensive.

The core idea

A useful model stack is not one prompt and a prayer.

It is usually:

  1. read the task
  2. recover relevant memory
  3. compress the clutter
  4. ask the model
  5. check the answer
  6. use tools if needed
  7. save traces
  8. benchmark the result

That is the difference between a chatbot and something you might actually trust on real work.

The honest part: orchestration helps, but it does not create capability from thin air

This part matters, because the AI world does itself a lot of damage by overpromising.

If a base model is bad at reasoning, orchestration will not magically make it frontier-grade. You can improve its behaviour, reliability, recall, and workflow quality. You cannot conjure missing intelligence out of nowhere.

That is not a flaw in the system. That is just reality.

What orchestration can do is make a decent model much more useful:

  • it sees less irrelevant text
  • it retrieves the right context more often
  • it can call tools instead of guessing
  • it can be checked and scored
  • its traces can become training data later

That is the real win.

Proof, not poetry

Here is the validation package I used while testing the repo:

The numbers

Check Result
Benchmark pass rate 50%
Reasoning harness 60%
Coding harness 100%
Tool-use harness 100%
Memory harness 100%

That benchmark pass rate is not a brag. It is a baseline. The point is that the system is measurable, and therefore improvable.

Screenshots

Validation report top

Validation report middle

Validation report bottom

Why semantic retrieval matters here

I wanted the memory system to work for normal tasks, not just demos.

So the retrieval layer is now embedding-based. That means the system can look for relevant context semantically, not just by literal word match.

That matters when the task wording changes but the meaning does not.

In plain English: it is much harder for the assistant to miss the useful note just because you phrased the request differently.

That is a small change with outsized effect.

What I’m actually trying to build

The goal is not “a model wrapper”. The goal is a practical operating layer for LLM work:

  • a model can be local or remote
  • the backend can be OpenAI-compatible or Anthropic
  • memory can be compacted and reused
  • traces can become training data
  • benchmarks can tell you whether anything improved

That is the kind of infrastructure that makes a model usable for long jobs, research, and product workflows.

Code and proof

Find me here too