MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

日期、货币和单位的标准化

2026-02-22 12:29:37

在全球化的数字经济中,数据通常来自各种国际来源。然而,如果没有严格的标准化措施,这些数据很快就会变成负担。“10/12/26”这个日期可能代表10月12日,也可能代表12月10日,具体取决于数据来源国;而“100”这个价格如果不了解货币单位,就毫无意义。日期、货币和单位的标准化,就是将这些不同的数值整合为单一的统一格式,从而确保数据的准确性、可比性,并使其能够用于全球分析。

全球约会困境
日期可以说是世界上格式最不一致的数据点之一。美式日期格式日期格式以及各种简写版本,难免会造成混淆。为了解决这个问题,数据工程师使用ISO 8601标准。这种格式不仅清晰明确,而且计算机也能够对其进行排序。通过将所有传入的日期字符串转换为此国际标准,组织可以防止“月份差”错误,这些错误可能会对日程安排、财务报告和历史趋势分析造成毁灭性影响。

处理时区和UTC
日期标准化通常需要考虑时区问题。如果一个全球电商网站记录到纽约时间晚上11点的一笔销售,那么伦敦时间已经是第二天了。为了维护“单一数据源”,许多组织会将所有时间戳标准化为 列表到数据 协调世界时(UTC)。这可以防止跨区域数据碎片化,使分析人员能够在统一的时间线上查看事件。如果没有这种同步,计算全球运输速度或系统正常运行时间等实时指标将变成一个不可能完成的数学难题。

为提高财务透明度而进行的货币正常化
在处理国际销售时,仅仅列出数值是不够的。货币列必须附上其ISO 4217代码(例如,USD、EUR、JPY)。为了生成高层报告,公司通常会进行“货币标准化”,即使用交易时的汇率将所有交易转换为“基础货币”。这样可以确保管理人员在查看“总收入”仪表板时,看到的是一个统一的数值,而不是不同世界货币的不匹配总和。

** 货币符号和小数的精度**
标准化也延伸至货币的视觉表示。虽然美国使用句点(. $1,200.50)作为小数分隔符,但许多欧洲国家使用逗号(€1.200,50.)。在数据转换阶段,必须去除这些符号或将其转换为统一的数值格式(通常为浮点数或十进制),以便进行数学计算。确保“千位分隔符”不干扰原始数字是维护财务数据完整性的关键步骤。

单位换算的必要性
公制单位与英制单位之间的冲突是数据管理中的经典难题。例如,如果电子表格中包含以“千克”和“磅”为单位的重量数据,或者以“公里”和“英里”为单位的距离数据,则必须进行单位换算才能进行汇总。标准化流程包括选择一个主要单位制(通常为了国际兼容性而选择公制),并对所有输入数据应用转换因子。这在航空航天或医疗等领域尤为重要,因为简单的单位不匹配都可能导致灾难性的后果。

管理单位标签和速记
数据经常出现单位标签不一致的情况,例如“千克”、“kg”或“kilos”。 标准化需要一个“清理”阶段,在这个阶段,这些变体将被映射到一个单一的、经授权的缩写。通过强制执行严格的单位数据字典,组织可以确保自动化系统能够可靠地解析信息。这可以防止因软件无法识别非标准单位名称而导致分析流程中出现“空”值或错误。

自动化转换工作流程
在现代数据管道中,标准化很少是手动完成的。ETL(提取、转换、加载)工具配置有“查找表”和转换逻辑,可在数据从源流向目标时自动检测和转换格式。例如,脚本可以识别英镑符号(£),查找当前的英镑兑美元汇率,并将转换后的值写入数据库。这种自动化流程确保数据“干净”,可供利益相关者立即使用。

统一数据的战略价值
标准化数据是可扩展性的基础。当日期、货币和单位统一时,整合新的数据源、进入新的国际市场以及部署先进的人工智能模型都会变得容易得多。 它消除了手动“数据处理”的摩擦,使领导层能够信任仪表盘上的洞察结果。归根结底,标准化不仅仅是格式问题;它关乎为你的商业智能创建一个可靠的全球语言。

DoraHacks Hackathon Newsletter 2026 February

2026-02-22 12:15:48

‣ Last Chance!

BCH-1 Hackcelerator

  • Bitcoin Cash · BCH | Dec.10, 2025 — Feb.26, 2026 (Extended)
  • Register Now >

RE{DEFINE} HACKATHON

  • Starknet · Bitcoin · Ethereum · Cairo · ZK-rollups | Feb.1 — 28, 2026
  • Register now >

‣ New Hackathons

AWS Prompt the Planet Challenge

  • Amazon Web Services | Feb.28 — May.31, 2026
  • Register now >

BUIDL BATTLE #2 | The Bitcoin Builders Tournament

Polkadot Solidity Hackathon 2026

StableHacks: Building Institutional Stablecoin Infrastructure on Solana

‣ Ongoing Hackathons

Stellar Hacks: ZK Gaming

Build on Stellar Chile Ideatón

The Self-Driving Yield Engine

BUIDL CTC Hackathon

  • Creditcoin · CTC · EVM | Feb.1 — Mar.7, 2026
  • Submit BUIDL >

OneHack 3.0 | AI & GameFi Edition

  • OneChain · MOVE | Feb.8 — Mar.9, 2026
  • Register now >

FlagOS Open Computing Global Challenge

  • Triton · Qwen · FlagGem | Jan.9 — May.20, 2026
  • Submit BUIDL >

‣ IRL/Hybrid Hackathons

UK AI Agent Hackathon EP4 x OpenClaw

Explore More Hackathons >>

About DoraHacks

DoraHacks is the leading global hackathon community and open source developer incentive platform. DoraHacks provides toolkits for anyone to organize hackathons and fund early-stage ecosystem startups.

DoraHacks creates a global hacker movement in Web3, AI, Quantum Computing and Space Tech. So far, more than 30,000 startup teams from the DoraHacks community have received over $92M in funding, and a large number of open source communities, companies and tech ecosystems are actively using DoraHacks together with its BUIDL AI capabilities for organizing hackathons and funding open source initiatives.

WebsiteTwitterDiscordTelegramBinance LiveYoutube

JWT vs PASETO v2 vs TECTO: Choosing the Right Token Protocol in 2026

2026-02-22 12:13:10

Tokens are everywhere in modern auth flows. But not all tokens are created equal. In this post we'll compare three approaches side by side — classic JWTs, the more modern PASETO v2, and the brand-new TECTO — across security, ergonomics, and real code.

🔍 The Quick Summary

Property JWT (HS256) PASETO v2 TECTO
Payload visible? ✅ Yes (base64) ✅ Yes (signed, not encrypted) ❌ Fully encrypted
Cipher None (HMAC) Ed25519 (sign) / XChaCha20 (encrypt) XChaCha20-Poly1305
Nonce N/A 24-byte per token 24-byte CSPRNG per token
Key size Variable Variable Exactly 256-bit (enforced)
Tamper detection HMAC signature Ed25519 / Poly1305 tag Poly1305 auth tag
Error specificity Reveals reason Reveals reason Generic "Invalid token"
Algo confusion attacks ⚠️ Yes (the alg: none problem) ✅ No ✅ No
Key rotation built-in ❌ DIY ❌ DIY ✅ Native (kid in token)

1️⃣ JWT — The Industry Standard

jsonwebtoken is the most widely used token library in Node.js. It's battle-tested, has a massive ecosystem, and is dead-simple to start with.

npm install jsonwebtoken

Creating and verifying a JWT

import jwt from "jsonwebtoken";

const SECRET = "my-secret-key"; // ← This is the problem

// Sign
const token = jwt.sign(
  { userId: 42, role: "admin" },
  SECRET,
  { expiresIn: "1h", issuer: "my-app" }
);

console.log(token);
// eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOjQyLCJyb2xlIjoiYWRtaW4iLCJpYXQiOjE3MDAwMDAwMDAsImV4cCI6MTcwMDAwMzYwMCwiaXNzIjoibXktYXBwIn0.SIGNATURE

// Verify
const payload = jwt.verify(token, SECRET) as { userId: number; role: string };
console.log(payload.userId); // 42

⚠️ The Payload is Just Base64

Here's the catch — take the middle segment of any JWT and decode it:

const [, payload] = token.split(".");
const decoded = Buffer.from(payload, "base64url").toString("utf-8");
console.log(decoded);
// {"userId":42,"role":"admin","iat":1700000000,"exp":1700003600,"iss":"my-app"}

Anyone who intercepts the token can read the payload. No key needed. This is by design — JWTs are signed, not encrypted — but many developers don't realize this at first.

⚠️ The alg: none / Algorithm Confusion Problem

JWT allows the algorithm to be specified in the header. This led to the infamous alg: none attack where attackers could forge tokens by setting the algorithm to none. Even with modern libraries that block none, HMAC vs RSA confusion attacks are still a real concern if you accept tokens from multiple issuers.

✅ When to use JWT

  • Public, non-sensitive payloads (user IDs, roles)
  • Integrating with third-party services that require JWT (OAuth, OIDC)
  • When your team already has JWT infrastructure

2️⃣ PASETO v2 — The JWT Successor

PASETO (Platform-Agnostic Security Tokens) was designed to fix JWT's footguns. It removes algorithm agility entirely — you pick a version and you get a fixed, well-chosen algorithm. No alg: none, no confusion attacks.

npm install paseto

Local tokens (symmetric, encrypted)

PASETO v2 comes in two flavors: local (symmetric, encrypted) and public (asymmetric, signed). v2.local is the encrypted one.

import { V2 } from "paseto";

const key = await V2.generateKey("local");

// Encrypt
const token = await V2.encrypt(
  { userId: 42, role: "admin" },
  key,
  { expiresIn: "1h", issuer: "my-app" }
);

console.log(token);
// v2.local.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

// Decrypt
const payload = await V2.decrypt(token, key);
console.log(payload.userId); // 42

Public tokens (asymmetric, signed but NOT encrypted)

const { privateKey, publicKey } = await V2.generateKey("public");

// Sign (payload is visible, like JWT)
const token = await V2.sign({ userId: 42 }, privateKey, { expiresIn: "1h" });

// Verify
const payload = await V2.verify(token, publicKey);

✅ What PASETO gets right

  • No algorithm confusion — the version (v2) pins the algorithm
  • v2.local encrypts the payload (XChaCha20-Poly1305)
  • Clean, modern API with async support

⚠️ What PASETO still lacks

  • No native key rotation story — kid is not part of the token format
  • You must manage key versioning yourself
  • Error messages can still reveal failure mode
  • No entropy validation on keys — you can pass a weak key and it'll silently accept it

3️⃣ TECTO — Encrypted by Default, Security-First

TECTO (Transport Encrypted Compact Token Object) takes a different philosophy: every token is fully encrypted, always. There's no "signed but readable" mode.

bun add tecto

Creating and decrypting a TECTO token

import { generateSecureKey, MemoryKeyStore, TectoCoder } from "tecto";

// 1. Generate a cryptographically secure 256-bit key
const key = generateSecureKey();

// 2. Set up the key store
const store = new MemoryKeyStore();
store.addKey("my-key-2026", key);

// 3. Create a coder
const coder = new TectoCoder(store);

// 4. Encrypt
const token = coder.encrypt(
  { userId: 42, role: "admin" },
  { expiresIn: "1h", issuer: "my-app" }
);

console.log(token);
// tecto.v1.my-key-2026.base64url_nonce.base64url_ciphertext

// 5. Decrypt
const payload = coder.decrypt<{ userId: number; role: string }>(token);
console.log(payload.userId); // 42

The token format is self-describing

tecto.v1.<kid>.<nonce>.<ciphertext>

The kid (key ID) is embedded in the token itself. This enables native key rotation — no extra metadata or headers needed.

Key rotation is a first-class citizen

// Old key still decrypts old tokens
store.addKey("key-2026-01", oldKey);

// Rotate: new tokens use the new key
store.rotate("key-2026-06", newKey);

// Old tokens still work
const oldPayload = coder.decrypt(oldToken); // ✅ uses key-2026-01

// New tokens use new key
const newToken = coder.encrypt({ userId: 99 });
// tecto.v1.key-2026-06.xxxxx.xxxxx

// Remove old key when all old tokens have expired
store.removeKey("key-2026-01");

Entropy validation on every key

import { assertEntropy } from "tecto";

// These all throw KeyError
assertEntropy(new Uint8Array(32));                 // all zeros
assertEntropy(new Uint8Array(32).fill(0xaa));      // repeating byte
assertEntropy(new Uint8Array(16));                 // wrong length

// This is safe
const key = generateSecureKey(); // always produces a valid, high-entropy key
assertEntropy(key); // ✅ passes

Generic errors prevent oracle attacks

try {
  coder.decrypt(tamperedToken);
} catch (err) {
  if (err instanceof InvalidSignatureError) {
    // err.message === "Invalid token"
    // You don't know WHY it failed — and that's the point
    // Attackers can't probe the system by watching error messages
  }
  if (err instanceof TokenExpiredError) {
    // Safe to throw distinctly because we already decrypted successfully
    // err.expiredAt is available, but NOT in err.message (timing protection)
  }
}

🔬 Side-by-Side: Payload Visibility

Let's make this concrete. Suppose you encode { userId: 42, role: "admin" } in each format:

JWT — Fully readable

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
.eyJ1c2VySWQiOjQyLCJyb2xlIjoiYWRtaW4ifQ  ← base64 of { userId: 42, role: "admin" }
.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c

Anyone can atob() the middle segment. No key needed.

PASETO v2.local — Encrypted ✅

v2.local.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFULL_CIPHERTEXT

Encrypted with XChaCha20-Poly1305. Opaque without the key.

TECTO — Encrypted ✅

tecto.v1.my-key-2026.NONCE_BASE64URL.CIPHERTEXT_BASE64URL

Encrypted with XChaCha20-Poly1305. The kid is visible (it's just a label), but the payload is mathematically unreadable without the key.

🔄 Key Rotation Comparison

Feature JWT PASETO v2 TECTO
Key ID in token ❌ Not standard ❌ Not standard ✅ Built-in kid
Old token decryptable after rotation DIY DIY store.rotate() handles it
Revoke old key DIY DIY store.removeKey() zeroes memory
Entropy validation ❌ No ❌ No assertEntropy() enforced

🛡️ Security Properties at a Glance

JWT HS256

  • ✅ Tamper-evident (HMAC)
  • ❌ Payload readable by anyone
  • ❌ Algorithm confusion attack surface
  • ❌ No entropy enforcement on secret

PASETO v2.local

  • ✅ Payload encrypted (XChaCha20-Poly1305)
  • ✅ No algorithm agility (version pins algo)
  • ✅ Authenticated encryption
  • ❌ No native key rotation
  • ❌ No entropy enforcement

TECTO

  • ✅ Payload always encrypted (XChaCha20-Poly1305)
  • ✅ Native key rotation with kid
  • ✅ Entropy validation on all keys
  • ✅ Generic errors (no oracle attacks)
  • ✅ Timing-safe comparisons
  • ✅ Memory zeroing on key removal
  • ✅ Payload size limits (DoS prevention)
  • ✅ Type-checked registered claims (prevents type confusion)

🤔 Which Should You Choose?

Use JWT if:

  • You need compatibility with OAuth / OIDC / existing infrastructure
  • Your payload contains no sensitive data (just user IDs, roles)
  • You're integrating with third-party services

Use PASETO v2.local if:

  • You want a well-audited, standardized encrypted token
  • You need interoperability across multiple languages/platforms
  • You don't need native key rotation

Use TECTO if:

  • You want encrypted-by-default with zero configuration mistakes possible
  • You need native key rotation without extra infrastructure
  • You're building a greenfield TypeScript/Bun project
  • You want defense-in-depth: entropy validation, generic errors, timing safety, memory zeroing

🚀 Getting Started with TECTO

bun add tecto
import { generateSecureKey, MemoryKeyStore, TectoCoder } from "tecto";

const store = new MemoryKeyStore();
store.addKey("v1", generateSecureKey());

const coder = new TectoCoder(store);

// Encrypt
const token = coder.encrypt({ userId: 42 }, { expiresIn: "1h" });

// Decrypt
const { userId } = coder.decrypt<{ userId: number }>(token);

For persistent key storage, TECTO ships with adapters for SQLite, PostgreSQL, and MariaDB — all following the same KeyStoreAdapter interface.

Final Thoughts

JWT will remain the standard for federated auth and OAuth flows for a long time — and that's fine. But for internal service-to-service tokens, session tokens, or any scenario where payload privacy matters, you should reach for an encrypted token format.

PASETO v2.local is a solid, standardized choice. TECTO goes a step further with batteries-included key rotation, entropy enforcement, and a security-first error model.

The best token protocol is the one you can't misconfigure. TECTO makes a strong case for that.

I Built a Tiny MCP That Understands Your Code and Saves 70% Tokens

2026-02-22 12:11:28

Every coding agent demo looks magical... until you point it at a real codebase. Then it either:

  • Chokes on context windows
  • Hallucinates around stale code
  • Or becomes so slow you might as well just grep

I hit this wall building AI workflows with large Rust/Python/TS repos, so I built something I actually wanted for my own stack: a super light-weight, AST-based embedded MCP that just works on your codebase. It's called cocoindex-code and it's already saving me ~70% tokens and a lot of waiting time.

If you're using Claude, Codex, Cursor, or any MCP-friendly coding agent, this post is for you.

The Core Idea: AST + Incremental Indexing

Most "code RAG" setups feel like infra projects: spin up a vector DB, write ETL, fight schema drift, tune chunking, maintain workers. Then you pray it all stays in sync.

cocoindex-code takes the opposite approach:

  • Embedded MCP: It runs locally as an MCP server, no separate DB to run or maintain.
  • AST-based indexing: It understands code structure via Tree-sitter, so you get meaningful chunks (functions, classes, blocks) instead of random 200-line windows.
  • Incremental updates: Built on top of the Rust-based CocoIndex engine, it only re-indexes changed files.
  • Real multi-language support: Python, JS/TS, Rust, Go, Java, C/C++, C#, SQL, Shell, and more.

The goal: you ask an agent a question, it pulls precisely the code it needs, without blowing up your context window.

What You Get Out of the Box

Here's what you get by just adding the MCP:

  • Semantic code search tool: search(query, limit, offset, refresh_index) as an MCP tool.
  • Instant token savings: Because only relevant code chunks go into prompts, not entire files or folders.
  • Speed: Incremental indexing + Rust engine means updates feel near-instant on typical dev repos.
  • No-key local embeddings by default: Uses sentence-transformers/all-MiniLM-L6-v2 locally via SentenceTransformers.
  • Optional power-ups: Swap in any LiteLLM-supported embedding model (OpenAI, Gemini, Mistral, Voyage for code, Ollama, etc.).

This means you can go from "plain coding agent" to "coding agent that actually knows your codebase" in about a minute.

1-Minute Setup for Claude, Codex, and OpenCode

First, install uv if you don't have it yet:

curl -LsSf https://astral.sh/uv/install.sh | sh

Claude

claude mcp add cocoindex-code \
  -- uvx --prerelease=explicit --with \
  "cocoindex>=1.0.0a16" \
  cocoindex-code@latest

Codex

codex mcp add cocoindex-code \
  -- uvx --prerelease=explicit --with \
  "cocoindex>=1.0.0a16" \
  cocoindex-code@latest

OpenCode

You can do it interactively:

opencode mcp add
# MCP server name: cocoindex-code
# type: local
# command:
# uvx --prerelease=explicit --with cocoindex>=1.0.0a16 cocoindex-code@latest

That's it. Point your agent at your repo, and you now have semantic search over your codebase as an MCP tool.

How the search MCP Tool Works

Once connected, the MCP exposes a search tool:

search(
  query: str,        # natural language or code snippet
  limit: int = 10,   # 1-100
  offset: int = 0,   # pagination
  refresh_index: bool = True  # re-index before querying
)

Each result comes back with:

  • File path
  • Language
  • Code content
  • Start/end line numbers
  • Similarity score

I've found three killer use cases:

  1. "Where is the actual implementation of X?" - when the repo has 5 similarly named functions.
  2. "Show me all the auth-related logic touching JWT refresh."
  3. "Find the code that matches this stack trace snippet."

Because the index is kept up to date incrementally, you can refactor, run tests, and immediately use the agent against the new code layout without re-running some giant offline job.

Supported Languages and Smart Defaults

cocoindex-code ships with a very practical language matrix:

C, C++, C#, CSS/SCSS, Go, HTML, Java, JavaScript/TypeScript/TSX, JSON/YAML/TOML, Kotlin, Markdown/MDX, Pascal, PHP, Python, R, Ruby, Rust, Scala, Solidity, SQL, Swift, XML

It also auto-excludes noisy directories like __pycache__, node_modules, target, dist, and vendored dependencies.

Root path is auto-discovered via .cocoindex_code/, .git/, or falling back to current working directory. In practice, you usually don't set any env vars at all - it just finds your repo root.

Embeddings: Start Free, Scale Later

Out of the box, the project uses a local SentenceTransformers model:

  • Default: sbert/sentence-transformers/all-MiniLM-L6-v2
  • No API key, no billing surprises, completely local.

If you want stronger semantic understanding for code-heavy repos, you can point COCOINDEX_CODE_EMBEDDING_MODEL to any LiteLLM-supported embedding model:

  • Ollama (local)
  • OpenAI / Azure OpenAI
  • Gemini
  • Mistral
  • Voyage (code-optimized)
  • Cohere
  • AWS Bedrock
  • Nebius

Basically: start with free local, upgrade only if/when you actually need it.

What About Huge / Enterprise Codebases?

Under the hood, cocoindex-code uses CocoIndex, a Rust-based indexing engine built for large-scale, incremental data workflows.

For big org setups, you can:

  • Share indexes across teammates instead of re-indexing on every machine.
  • Take advantage of features like branch dedupe to avoid duplicate work.
  • Run it as part of a larger data/indexing platform on top of CocoIndex.

If You Want to Try It, Here's the Ask

If this sounds useful, here's a small but meaningful way you can help:

  1. Star the repo: cocoindex-code and the underlying cocoindex.
  2. Try it on your main project (the messy one, not the toy one).
  3. Drop feedback, issues, or ideas in the GitHub repo.

I'm especially interested in:

  • Repos where existing "code RAG" tools failed you
  • Languages or frameworks you want better support for
  • Workflows where you want your coding agent to feel 10x more context-aware

If you do try it, let me know in the comments what stack you used it on - I'd love to feature a few real-world examples in a follow-up post.

How to QA Test Your AI Agent: A Practical Playbook for 2026

2026-02-22 12:03:40

How to QA Test Your AI Agent: A Practical Playbook for 2026

You shipped your AI agent. It works great in demos. Then it hits production and starts hallucinating tool arguments, ignoring instructions it followed last week, and confidently doing the wrong thing at 3 AM when no one is watching.

This is the current state of AI agent development: teams are shipping faster than they're testing. Traditional QA doesn't map to LLM-powered systems. Unit tests pass. Integration tests pass. Then your agent loops forever on an edge case your test suite never touched.

LLM QA testing is an emerging discipline, and right now almost nobody is doing it properly. This guide is a practical playbook for engineers who need to build a real testing framework for AI agents — not a theoretical overview, but the actual framework, the failure modes, and the tooling that makes it work.

Why AI Agent Testing Is Different From Regular Software Testing

If you've tried applying standard QA practices to an AI agent, you've already felt the friction. The fundamental problem is non-determinism: run the same input twice and get two different outputs. That breaks the entire premise of assertion-based testing.

But non-determinism is just the start. Here's what makes AI agent testing structurally different:

Prompt sensitivity. A change to three words in your system prompt can shift your agent's behavior across thousands of scenarios you didn't anticipate. There's no compiler warning. There's no stack trace. The behavior just drifts.

Context window dynamics. Agents that work perfectly with short conversation histories silently degrade as context grows. The model starts "forgetting" instructions, misattributing earlier tool outputs, or losing track of its own state. You won't see this in unit tests.

Tool call failures cascade. When a tool call returns unexpected data — a null, a timeout, a schema mismatch — agents often don't fail loudly. They hallucinate a plausible response and keep going. This is worse than a crash. A crash is visible. A confident wrong answer is invisible until it causes damage downstream.

Evaluation is the hard part. With traditional software, you assert output === expected. With LLMs, the output might be semantically correct in ten different phrasings, or subtly wrong in ways that require a human (or another LLM) to detect. Your test suite needs an evaluator, not just an assertion.

Regression is non-obvious. Model provider updates, prompt tweaks, temperature changes, and dependency upgrades can all silently shift behavior. You need a baseline to regress against.

The discipline of AI agent testing requires you to shift from "did it pass?" to "did it behave within acceptable bounds?"

The 5 Core Test Types for AI Agents

1. Output Consistency Tests

These verify that for a given input, your agent produces outputs that fall within an acceptable semantic range across multiple runs. You're not asserting exact output — you're asserting behavioral consistency.

Run each test case 5–10 times. Compute a semantic similarity score between runs (cosine similarity on embeddings works well). Flag cases where variance exceeds your threshold.

def test_output_consistency(agent, prompt, runs=7, threshold=0.85):
    outputs = [agent.run(prompt) for _ in range(runs)]
    embeddings = [embed(o) for o in outputs]
    scores = pairwise_cosine(embeddings)
    avg_similarity = scores.mean()
    assert avg_similarity >= threshold, (
        f"Consistency failure: avg similarity {avg_similarity:.2f} < {threshold}"
    )

This gives you a concrete, trackable metric for how "stable" your agent is on any given input class.

2. Prompt Regression Tests

Every time you change a prompt — system prompt, tool description, few-shot example — run a full suite against your golden dataset. A golden dataset is a curated set of (input, expected behavior) pairs that cover your core use cases and known failure modes.

Track behavioral metrics per prompt version. A regression test isn't a binary pass/fail — it's a delta. "We changed the system prompt and response accuracy on edge cases dropped 8%. Revert or investigate before shipping."

3. Tool Call Validation Tests

This is the most underbuilt category in most agent frameworks. You need to test:

  • Correct tool selection: Did the agent call the right tool for the job?
  • Correct argument schema: Are the arguments valid and well-formed?
  • Handling of tool errors: When the tool returns an error, does the agent fail gracefully or hallucinate a recovery?
  • Tool call ordering: For multi-step workflows, did the agent sequence calls correctly?

Mock your tools. Inject failures — 500 errors, malformed responses, empty results, timeouts. Verify the agent's downstream behavior for each failure type.

4. Context Window Stress Tests

Build test cases that simulate long conversation histories. Load the context with 2K, 4K, 8K, and 16K tokens of prior conversation, then run your standard test suite. Measure how behavioral metrics degrade as context grows.

Most teams are surprised to find their agents start ignoring key system prompt instructions around the 8K-12K context mark. You want to discover this in tests, not in production support tickets.

5. Failure Mode Tests

Explicitly enumerate how your agent should fail. Ambiguous input. Impossible requests. Contradictory instructions. Attempts to jailbreak or manipulate via user input. Missing required context.

For each failure mode, define the expected behavior — refusal, clarification request, graceful error, fallback — and assert against it. A well-tested agent should fail loudly and cleanly, not silently and confidently.

Building Your Testing Framework

Step 1: Test Harness Setup

Your test harness needs to:

  1. Inject controlled inputs and capture full outputs + tool call traces
  2. Support replay — run the same sequence deterministically (where possible) against different models/prompts
  3. Log everything: input, output, tool calls, latency, token usage, model version
class AgentTestHarness:
    def __init__(self, agent, tools=None, mock_tools=False):
        self.agent = agent
        self.tools = MockToolRegistry(tools) if mock_tools else tools
        self.trace = []

    def run(self, input, context=None):
        result = self.agent.run(
            input=input,
            context=context or [],
            tool_registry=self.tools
        )
        self.trace.append({
            "input": input,
            "output": result.output,
            "tool_calls": result.tool_calls,
            "tokens": result.token_usage,
            "latency_ms": result.latency_ms
        })
        return result

    def assert_tool_called(self, tool_name, with_args=None):
        calls = [t for t in self.trace[-1]["tool_calls"] if t["name"] == tool_name]
        assert len(calls) > 0, f"Expected tool '{tool_name}' to be called"
        if with_args:
            assert any(args_match(c["args"], with_args) for c in calls)

Step 2: Build Your Golden Dataset

Start small. 50–100 test cases covering:

  • Happy path core workflows
  • Edge cases you've hit in production
  • Known failure modes
  • Adversarial inputs

Label expected behaviors, not exact outputs. "Should call search_tool before answering" is a better assertion than "should output exactly X."

Step 3: CI Integration

Run your agent test suite on every prompt change, dependency update, and model provider version bump. Gate deployments on test suite pass rate, not just binary pass/fail — a 5% accuracy drop is still a regression.

Treat your golden dataset like source code. Version it. Review changes to it as carefully as you review changes to prompts.

Common AI Agent QA Mistakes to Avoid

1. Testing only the happy path.
Production users don't follow happy paths. Invest 40%+ of your test cases in edge cases, bad input, and failure scenarios. If you're finding bugs in production, your edge case coverage is too low.

2. Asserting exact string matches.
LLMs produce variable output. Exact string matching creates a test suite that's both brittle and slow to maintain. Use semantic assertions: does the output contain the correct information? Does it call the right tool? Did it refuse when it should have?

3. Ignoring tool call traces.
The output might look right while the reasoning path is completely wrong. An agent that got the right answer for the wrong reason will fail on the next variation. Always inspect tool call traces, not just final outputs.

4. No baseline versioning.
You can't detect regression without a baseline. Every time you ship a prompt change or upgrade a model version, snapshot your test suite results. Without version-locked baselines, you're flying blind.

5. Treating evaluation as a one-time task.
Your agent's behavior drifts over time — model providers push updates, your data changes, edge cases accumulate. Evaluation is continuous, not a checkbox before launch. Schedule weekly automated test runs even when you haven't changed anything.

Tools for AI Agent QA Testing

ClawhHub is built specifically for AI agent QA automation. It provides a test harness for LLM-powered agents, golden dataset management, semantic assertion scoring, tool call trace inspection, and CI/CD integration out of the box. If you're building agents and need a testing platform that understands the AI agent lifecycle — not just general LLM evaluation — ClawhHub is the purpose-built option.

LangSmith (by LangChain) is a strong option if your stack is LangChain-based. It provides tracing, evaluation datasets, and a feedback loop for prompt iteration. The evaluation tooling is solid. Weaker on CI-native workflows and tool call–specific assertions.

Langfuse is an open-source LLM observability and evaluation platform. Good for teams that want self-hosted control and already instrument their agents with structured traces. Strong on cost/latency tracking, lighter on assertion-based testing.

Evidently AI is primarily an ML monitoring tool that's expanded into LLM evaluation. Excellent for teams with existing ML monitoring infrastructure or those who need drift detection on production traffic. Less focused on the pre-deployment testing workflow.

The honest comparison: if you're starting from scratch building an agent test suite, ClawhHub gives you the most direct path to LLM QA testing with the least glue code. The others are excellent complementary tools — especially for production monitoring — but require more assembly for pre-deployment testing workflows.

Conclusion

AI agent QA testing is not optional. It's the difference between agents that work reliably in production and agents that erode user trust the first time a real edge case arrives.

The framework is straightforward: build a test harness, build a golden dataset, write tests across all five categories, integrate into CI, and establish version-locked baselines. None of this is exotic. It's just applying engineering rigor to a new class of non-deterministic systems.

The teams that define this discipline now will ship more reliable agents, faster. The teams that skip it will spend their time on production incidents.

If you want to get started with AI agent QA automation without building the harness from scratch, ClawhHub is purpose-built for this workflow. Get your first test suite running in under an hour.

Have a QA pattern that's worked well for your agent setup? Drop it in the comments — this is a new discipline and we're all figuring it out together.

Why Defense-Specific LLM Testing is a Game-Changer for AI Safety

2026-02-22 12:03:16

In an era where AI models are increasingly deployed in high-stakes environments, generic evaluation tools no longer cut it. That’s why Justin Norman’s new open-source framework, DoDHaluEval, is such a standout contribution—it zeroes in on a critical niche: defense-domain hallucinations in large language models (LLMs).

What caught my eye immediately is the framework’s focus on context-aware hallucination testing. Instead of using generic prompts or public-domain benchmarks, DoDHaluEval includes over 92 military-specific templates and identifies seven distinct hallucination patterns unique to defense knowledge. This approach recognizes that not all inaccuracies are equal—a misstatement about troop movements or equipment specs can have far more severe consequences than a fictional movie plot.

Justin and his team didn’t just stop at domain-specific data. They implemented an ensemble detection system combining HuggingFace HHEM, G-Eval, and SelfCheckGPT, offering multiple layers of validation. This multi-method approach is smart—it acknowledges that no single tool can catch every type of error, especially in nuanced, high-risk domains like defense.

For developers and organizations working with LLMs in regulated or sensitive sectors, this framework is a blueprint for building safer, more reliable systems. It’s a reminder that effective AI safety isn’t just about scaling model size—it’s about tailoring evaluation to real-world contexts and consequences.

If you're working on LLM trust and safety—whether in defense, healthcare, finance, or beyond—this is a must-read project. Check out the full details and code on GitHub.

Read the full post here

Follow Justin's work: Bluesky | GitHub | LinkedIn | Blog