MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Claude Code Agent Teams Can Spawn Agents. It Just Doesn't Know Which Ones to Use.

2026-03-19 22:43:35

Claude Code's Agent Teams feature is genuinely impressive. You type a complex prompt, it spawns subagents, they work in parallel, each with its own context window. The architecture is there.

But there's a gap nobody talks about.

Agent Teams doesn't know which agents to use. It doesn't know that a game project needs a game designer, a physics engineer, and a QA specialist. It doesn't know that a SaaS dashboard needs a frontend dev, a backend architect, and someone who understands Stripe webhooks. It spawns blank subagents with no identity, no rules, no specialization.

You're supposed to define them yourself. Every time. With --agents JSON. For every project.

I built the missing piece.

144 Agents, One Command

npx agentcrow init

That's it. This installs 144 specialized agent definitions into your project — 9 hand-crafted builtin agents with strict MUST/MUST NOT rules, plus 135 community agents from agency-agents covering 15 divisions: engineering, game dev, design, marketing, testing, DevOps, and more.

When you run claude after init, it reads the agent roster from .claude/CLAUDE.md and automatically decomposes your prompt into tasks, matches the right agents, and dispatches them.

No API key needed. No separate server. No configuration. Just Claude Code doing what it already does — but now it knows who to call.

What Actually Happens

I typed this:

Build a SaaS dashboard with Stripe billing, user auth, and API docs

Claude decomposed it into 5 tasks and dispatched 5 specialized agents:

🐦 AgentCrow — 5 agents dispatched:
1. @ui_designer      → dashboard layout, component hierarchy
2. @frontend_developer → React components, charts, responsive UI
3. @backend_architect  → Auth system, REST API, Stripe webhooks
4. @qa_engineer        → billing flow E2E tests, auth edge cases
5. @technical_writer   → API reference, onboarding guide

Each agent has a defined personality, communication style, and critical rules. The QA engineer, for example, has rules like "MUST cover happy path, edge cases, and error paths" and "MUST NOT skip error handling tests." These aren't suggestions — they're baked into the agent's identity.

The key difference: without AgentCrow, Claude spawns generic subagents. With AgentCrow, each subagent knows its job.

The Comparison Nobody Asked For (But Everyone Needs)

Agent Teams alone + AgentCrow
Spawn subagents
Know which agents to use
144 pre-built agent roles
Auto-decompose prompts
Agent identity & rules
Zero config

Agent Teams is the engine. AgentCrow is the brain.

Under the Hood

The architecture is deliberately simple. When you run npx agentcrow init, three things happen.

First, 9 builtin agents get copied into .agr/agents/builtin/. These are YAML files I wrote by hand. Each one defines a role, personality, MUST rules, MUST NOT rules, deliverables, and success metrics. The QA engineer agent, for example, has 5 MUST rules and 5 MUST NOT rules covering everything from test independence to mock usage.

Second, 135 external agents get cloned from agency-agents into .agr/agents/external/. These cover game development (Unreal, Unity, Godot specialists), marketing, sales, spatial computing — domains I wouldn't have thought to include.

Third, a .claude/CLAUDE.md file gets generated. This is where the magic happens. CLAUDE.md is Claude Code's project-level instruction file — it reads this every session. The generated file contains the complete agent roster and dispatch rules: when to decompose, how to match agents, what format to use for subagent prompts.

A SessionStart hook also gets installed. When you open claude, you see:

🐦 AgentCrow active
──────────────────────────────────────────────────
9 builtin agents:
  · qa_engineer
  · korean_tech_writer
  · security_auditor_deep
  ...
135 external agents (15 divisions)
──────────────────────────────────────────────────
Complex prompts → auto agent dispatch

The entire system is a CLAUDE.md file and some YAML. No runtime dependencies, no background processes, no API keys. Turn it off with agentcrow off, turn it back on with agentcrow on.

What I Learned Building This

The hardest part wasn't the agent matching or the decomposition logic. It was figuring out what shouldn't be automated.

My first attempt tried to intercept every prompt, decompose it with an LLM call, and auto-dispatch. The decomposition alone took 15 seconds. For a simple "fix this typo" prompt, that's absurd.

The current design is smarter: CLAUDE.md tells Claude to only decompose multi-task requests. "Fix this bug" runs normally. "Build a dashboard with auth, billing, tests, and docs" triggers the agent system. Claude makes that judgment call — and it's surprisingly good at it.

I also learned that agent identity matters more than I expected. A generic "write tests" subagent produces generic tests. A QA engineer agent with "MUST cover edge cases" and "MUST NOT use sleep for async waits" produces professional tests. The rules constrain the model in the right direction.

Try It

npx agentcrow init
claude

Then type something complex. Watch it decompose and dispatch.

The source is on GitHub. The npm package is agentcrow.

Agent Teams is powerful. It just needed someone to tell it who to call.

I Spent 4 Hours Debugging Google OAuth… Then I Deleted the Feature (A Lesson While Building My First SaaS)

2026-03-19 22:35:27

A few days ago I shared the architecture flow diagram of my SaaS project, LeadIt.

After that, I also shared a post explaining the core backend components I built — things like the company search API, the analysis engine, lead scoring logic, and the AI outreach generator.

Since then I’ve been slowly turning those pieces into an actual working product.

Most of the backend core is now working, and I’ve also been building the landing page that explains the product and its workflow.

But while trying to implement the next feature — email automation using Gmail OAuth — I ran into a problem that ended up teaching me one of the most important lessons about building MVPs.

The Plan: Let Users Send Emails From the Platform

Once the backend intelligence layer was working, the next step felt obvious.
Users should be able to send outreach emails directly from LeadIt.

The flow I imagined was simple.
A user connects their Gmail account → LeadIt generates an AI outreach message → the user can send the email directly from the platform.
To make that work, I needed Google OAuth with Gmail API access.

The expected flow looked something like this:

  • User logs in using Google OAuth
  • LeadIt asks for permission to access Gmail
  • Google returns an access token and refresh token
  • Those tokens get stored in my Supabase database
  • LeadIt uses the Gmail API to send outreach emails automatically

On paper, this looked straightforward.
But implementing it was a completely different experience.

Setting Up Google Cloud

I started by setting up everything inside Google Cloud Console (GCP).

The usual steps:

  • Created a Google Cloud project
  • Enabled the Gmail API
  • Configured the OAuth consent screen
  • Generated Client ID and Client Secret
  • Added redirect URIs

Something like this:
http://localhost:3000/api/auth/callback/google

Then I added Gmail permissions using scopes like:
https://www.googleapis.com/auth/gmail.send

At this point everything looked correct.
The configuration seemed fine.
So I moved forward with integrating authentication in the app.

Authentication Worked… But the Token Didn’t

Here’s the weird part.
The Google login itself worked perfectly.
Users could sign in successfully.

That means:

  • OAuth redirect worked
  • Client ID was valid
  • Consent screen worked
  • Authentication succeeded

But the Gmail provider token never got stored in my database.
And that token is the most important part.

Without it:

  • no Gmail API calls
  • no sending emails
  • no automation

So even though authentication worked, the feature itself was useless.

The 4-Hour Debugging Spiral

This is where things started getting frustrating. I spent around 3–4 hours debugging the issue. And I checked almost everything I could think of.

Database Layer
First I checked Supabase.
Things I verified:

  • database connection
  • user table structure
  • provider token fields
  • row level security policies

Everything looked correct.

Server-Side Logic

Then I checked the backend logic.

Things like:

  • OAuth callback handler
  • parsing provider tokens
  • inserting tokens into the database
  • session handling

Still nothing.

Client-Side Flow

Then I checked the frontend flow.

Things like:

  • authentication session
  • provider response
  • token availability

Again… nothing.

Google Cloud Configuration

At this point I went back to Google Cloud.

Checked again:

  • OAuth scopes
  • redirect URLs
  • Gmail API permissions
  • client ID configuration

Everything looked fine.
Yet somehow the provider token was never reaching my database.

The Realization

After spending hours debugging this, I finally asked myself a simple question.
Do I actually need this feature right now?
And the honest answer was: No

The real core of LeadIt is not Gmail automation.

The real core is:

  • discovering companies
  • analyzing their websites
  • detecting opportunity signals
  • generating AI outreach messages

Sending emails automatically is definitely useful.
But it is not required to validate the product idea.
And that’s when I made a decision.

The Founder Decision: Cut the Feature

Instead of wasting more time on OAuth complexity, I decided to remove the feature from the MVP.
Gmail automation will move to Version 2.
For the MVP, the product will focus only on the core features:

  • company search
  • website analysis
  • lead discovery
  • lead scoring
  • AI outreach message generation

Instead of automatic email sending, users can simply: copy the generated message and send it manually.

It’s simple.
But for an MVP, simple is actually good.

Simplifying Authentication

While thinking about this, I also made another decision.

Instead of implementing complex OAuth systems right now, I’ll use simple Next.js authentication for the MVP.

OAuth systems bring a lot of complexity:

  • client IDs
  • redirect flows
  • token storage
  • refresh tokens
  • permission scopes

All of that takes time to build and even more time to debug.
Right now my focus is simple.
Ship the product.

Landing Page Progress

Alongside the backend work, I’ve also been building the LeadIt landing page.
It’s almost complete now.
Soon I’ll share a preview of the landing page and I’d genuinely love feedback from other builders here.
Things like:

  • does the product idea make sense?
  • is the value proposition clear?
  • would you try a tool like this?

Getting early feedback is honestly one of the most useful things when building something from scratch.

What Today Actually Taught Me

At first, today felt like a wasted day.
I didn’t ship the feature I planned.
But in reality, I learned something much more important.
Startups don’t need perfect products.
They need working MVPs.
That means:

  • shipping fast
  • avoiding unnecessary complexity
  • cutting features when needed
  • Sometimes the smartest engineering decision isn’t fixing the problem.
  • Sometimes it’s removing the problem entirely.

Final Thought

Building your first SaaS is messy.
You’ll spend hours debugging things that might not even matter in the final product.
But every one of those moments teaches you something about building, prioritizing, and shipping faster.
LeadIt is still very early.
But every small step is slowly turning the idea into a real product.
And honestly, that’s the fun part of the journey.

If you're building something right now, I’m curious:
Have you ever spent hours debugging a feature… only to realize you didn’t actually need it?

What Makes a CRUD Generator Actually Usable?

2026-03-19 22:34:49

There are many code generators that can create CRUD resources from a config file.

On paper, that sounds great.

In practice, the real question is not:

“Can it generate code?”

The real question is:

“Can I actually trust the generated project enough to use it?”

That is where many generators start to fall apart.

Generating a few entities, repositories, and controllers is the easy part. The harder part is making the generated result feel reliable, understandable, and usable in a real workflow.

After working on my own Spring Boot CRUD generator, I realized that usability does not come from generation alone. It comes from all the small things around the generation that reduce friction and increase trust.

Here are a few lessons that stand out.

1. Code generation is not enough

A generator can create a lot of files and still feel incomplete.

You can generate entities, services, controllers, mappers, tests, migrations, and configuration files, and the first impression may still be:

“Okay, but will this actually work in my project?”

That hesitation is normal.

Developers do not trust generated code just because it exists. They trust it when they can see that:

  • the input format is clear
  • the output is predictable
  • the generated application can actually run
  • the project will not break because of missing setup
  • the API surface behaves the way they expect

A useful generator is not just a code emitter.

2. Validation matters more than people think

A spec-driven generator lives or dies by the quality of its validation.

If a spec file is invalid, users should know before generation.
If the spec is incomplete, they should know before generation.
If the target project setup is missing required dependencies, they should know before generation.

That is why dry-run validation is so valuable.

A dry run gives the user a chance to validate the spec and project setup without committing to a full generation step. It turns generation into a safer workflow:

  1. write or update the CRUD spec
  2. run validation
  3. fix issues early
  4. generate resources only when the setup is correct

That one layer of feedback makes a generator feel much more professional.

3. Dependency checks are part of usability

This is one of those features that is not flashy, but it solves a real pain.

A generator may support optional features like GraphQL, caching, OpenAPI, Docker integration, or workflow generation. But if the target project is missing required dependencies, the user often discovers that only after compilation fails.

That is a bad experience.

A dependency check in the validation step gives immediate feedback:

  • your spec enables X
  • your project is missing Y
  • fix that first

This is not a “marketing feature,” but it is exactly the kind of thing that makes a tool feel mature.

4. Sorting is a small feature with big practical value

If a tool generates list endpoints, sorting quickly becomes one of those things users expect by default.

Basic CRUD is rarely enough on its own.

Even a very simple generated API becomes much more useful if it supports controlled sorting like:

  • sort by name
  • sort by price
  • sort by releaseDate

The important part is not just adding sorting, but adding it in a controlled way.

A good generator should not blindly allow sorting by anything. It should let the user explicitly define which fields are sortable. That gives a better balance between flexibility and predictability.

For example:

sort:
  allowedFields: [name, price, releaseDate]
  defaultDirection: ASC

This kind of design keeps the spec simple while still making the generated API more realistic.

Sorting may not sound like a major release feature, but it makes generated CRUD resources feel much closer to something you would actually expose and use.

5. Documentation is not just documentation

Good documentation does more than explain features.

It answers silent questions like:

  • What does this tool actually generate?
  • How much setup is required?
  • What does the generated result look like?
  • Can I trust the output?
  • Is this maintained?
  • Is there a concrete example I can follow?

A README is not just a description. It is part of the product.

A clearer README reduces confusion, shortens onboarding time, and makes it easier for users to move from curiosity to actual usage.

That is especially important for generators, because they involve a bigger trust jump than a typical library.

With a normal library, a developer adds one dependency and tries one API.

With a generator, they are letting a tool create part of their codebase.

That requires more confidence.

6. A demo video changes how people evaluate the project

This may be the most underrated part.

Developers believe much faster when they can see the flow.

A short demo video can answer in under a minute what a long README sometimes cannot fully communicate:

  • here is the CRUD spec
  • here is the generation step
  • here is the generated project
  • here is the app running
  • here are the API calls working

That does two things at once:

  • it proves that the tool works
  • it makes the project feel real

For generator-style tools, that visual proof matters a lot.

People do not just want to see source code. They want to see the end-to-end path from input to working result.

In many cases, a demo video is not “extra content.”
It is part of trust-building.

7. The best improvements are often not the loudest ones

Not every useful release looks impressive in a headline.

Some improvements are exciting and obvious.
Others are quieter, but they make the product much better:

  • better validation
  • better docs
  • better defaults
  • safer generation flow
  • fewer compile-time surprises
  • more realistic generated endpoints

These things may not always look like major product announcements, but they are often what move a project from “interesting” to “actually usable.”

That distinction matters.

Because many tools can generate code.

Far fewer tools make the full workflow feel safe, clear, and dependable.

Final thought

If you are building a generator, think beyond output quantity.

Do not focus only on how many files you can generate.

Focus on things like:

  • how easy it is to validate input
  • how safe it is to run generation
  • how realistic the generated result is
  • how quickly a new user can trust what they are seeing

That is where usability really comes from.

I recently pushed a new release of my own project, Spring CRUD Generator, with exactly that mindset in mind: improving validation, adding dependency checks, introducing entity-level sorting, cleaning up the README, and adding a demo video plus the demo CRUD spec used in that video.

Repository: https://github.com/mzivkovicdev/spring-crud-generator

It is not the flashiest kind of release, but it is the kind of release that makes a tool more usable in real life.

GitLab CI Caching Didn’t Speed Up My Pipeline — Here’s Why

2026-03-19 22:33:36

Most DevOps guides say:

“Enable caching — it will speed up your CI pipelines.”

I’ve done that many times in my career. Here I'd like to share with you some of my thoughts on the topic illustrating it with a little experiment.

I built a small GitLab CI lab, added dependency caching. Are you expecting faster runs?

The result might surprise you:

My pipeline didn’t get faster at all.

In fact, in some cases, it was slightly slower.

Before jumping to conclusions — this is not a post against caching.

Caching worked exactly as expected.
It just didn’t translate into faster pipeline duration in this particular setup.

And that’s the part worth understanding.

This article is not about how to enable caching.
It’s about what actually happens after you enable it — and why the outcome might not match expectations.

What I Wanted to Test

I wanted to validate a simple assumption:

  • Does dependency caching really reduce pipeline duration?
  • Where does the improvement come from?
  • When is caching actually worth it?

So I built a small Python project with a multi-stage GitLab CI pipeline and measured the results.

The Setup

The pipeline has three stages:

  • prepare → install dependencies
  • quality → compile/lint
  • test → run tests

Each job installs dependencies independently — just like in many real-world pipelines.

To make the effect visible, I used slightly heavier dependencies:

  • pandas
  • scipy
  • scikit-learn
  • matplotlib

Baseline: No Cache

Each job runs:

time pip install -r requirements.txt

As expected:

  • dependencies are downloaded in every job
  • work is repeated across stages
  • every pipeline run starts from scratch

Results

Run Duration
#1 ~38s
#2 ~34s

Adding Cache

I introduced GitLab cache:

.cache:
  cache:
    key:
      files:
        - requirements.txt
    paths:
      - .cache/pip
    policy: pull-push

And configured pip:

PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

Now dependencies should be reused between jobs and runs.

The Result

Mode Run Duration
No cache 1 ~38s
No cache 2 ~34s
With cache 1 ~40s
With cache 2 ~38s

Almost no difference.

Why Didn’t It Get Faster?

1. Fast package source

If your runner uses a nearby mirror (for example, Hetzner), downloads are already fast.

2. pip is efficient

Modern Python packaging uses prebuilt wheels, making installs quick.

3. Cache has overhead

  • archive creation
  • upload/download
  • extraction

This overhead can cancel the benefit.

4. CI jobs spend time elsewhere

  • container startup
  • image pulling
  • repo checkout

The Real Takeaway

Dependency caching is not automatically a performance optimization.

Its impact depends on:

  • dependency size
  • network conditions
  • runner configuration
  • pipeline structure

When Caching Helps

  • large dependency trees
  • slow networks
  • distributed runners
  • frequent pipeline runs

When It Might Not Help

  • small projects
  • fast mirrors
  • short pipelines
  • high cache overhead

Not Just About Speed

Caching can still:

  • reduce outbound traffic
  • improve resilience
  • reduce dependency on external registries

What’s Next

Next step:

  • testing shared cache with S3-compatible storage

Repo

You can find the full lab here:
👉 https://github.com/ic-devops-lab/devops-labs/tree/main/GitLabCIPipelinesWithDependencyCaching

Final Thought

Not every best practice gives a measurable improvement — but understanding why is where real DevOps begins.

Related Articles

Echoes of Experience: My Journey in Tech

2026-03-19 22:31:19

Echoes of Experience: My Journey in Tech

The tech industry thrives on innovation, yet behind every product and line of code lies a journey of growth, resilience, and learning. Over the past year, I’ve experienced firsthand the challenges, triumphs, and lessons that have shaped my journey as an aspiring software engineer.

Challenges

Early in my journey, I often struggled to balance rapidly evolving technologies with academic and personal commitments. Adapting to languages and frameworks like Python, React, AI/ML tools, and SQL required persistence, patience, and continuous learning. Imposter syndrome occasionally crept in, making me question if I was keeping up with peers.

Triumphs

Through persistence, I achieved milestones that gave me both confidence and direction. Projects like a full-stack AI invoice automation system reduced manual effort by 75%, while my AI/ML voice and image automation assistant streamlined repetitive tasks by 40%. These achievements were more than technical—they taught me problem-solving, OOP principles, and collaborative project design.

Insights

The most valuable lesson I’ve learned is that technical skills alone are not enough. Growth comes from embracing challenges, asking questions, mentoring others, and building community. Sharing knowledge fosters inclusion and empowers peers from underrepresented backgrounds to thrive in tech.

Message to Others

To anyone beginning their journey: embrace challenges, seek guidance, and never underestimate your ability to learn and innovate. Diversity in tech is not just about representation—it is about sharing experiences, lifting each other, and building inclusive solutions together.

Impact

By sharing my journey, I hope to inspire underrepresented individuals and educate allies on the importance of mentorship, collaboration, and perseverance. The future of tech is stronger when every voice is heard, every story valued, and every talent empowered.

Context Quality is the New Model Quality: An Open Memory Provider Standard with Zero-Downtime Compaction for LLM Agents

2026-03-19 22:29:33

Context Quality is the New Model Quality: An Open Memory Provider Standard with Zero-Downtime Compaction for LLM Agents

Short title: How We Eliminated 77% Entity Loss and Agent Freeze with an Open Memory Standard

Author: L. Zamazal, GLG, a.s.

Date: March 2026

Keywords: LLM memory, context compaction, agent memory, information loss, on-demand recall, UAML, structured memory, MCP, zero-downtime, open standard

Abstract

Large language models have no persistent memory. Every inference call receives the entire conversation context as input, and the model's response quality depends directly on the quality of that input. When context grows beyond the window limit, platforms perform compaction — summarizing the conversation to fit. We measured that standard compaction loses 77% of named entities (people, decisions, tools, dates) in production multi-agent deployments, directly degrading agent decision quality.

We present a three-layer recall architecture that achieves 100% entity recovery while keeping per-turn token costs minimal through on-demand retrieval. Deployed in production across two agent instances processing 16,000+ messages, our approach demonstrates that improving memory quality is more cost-effective than upgrading models. We survey 27 recent papers from Western, Chinese, Korean, and Japanese research communities and show that no existing system combines selective on-demand recall with post-quantum encryption, audit trails, and temporal validity — capabilities required for deployment in regulated industries.

Our central claim: you don't pay for a better model; you pay for better memory.

1. Introduction

The dominant assumption in the LLM industry is that performance scales with model capability — larger models, longer context windows, better reasoning. This assumption drives a hardware arms race and an API pricing war. But it obscures a fundamental architectural problem: the context window is not memory. It is cache.

Every API call to a large language model sends the complete conversation context — system instructions, tool schemas, prior messages, injected documents — as input. There is no persistent state between calls. The model sees only what is explicitly present in that single request. When the accumulated context exceeds the model's window, frameworks apply compaction: lossy summarization that discards what doesn't fit.

An AI agent is only as good as what it remembers. Yet the dominant paradigm for handling context overflow is the computational equivalent of asking a colleague to shred their notes and work from a one-paragraph summary.

The consequences are measurable. In our production deployment, we observed that after standard compaction, agents could recall only 23% of named entities from prior conversations — losing decisions, attributions, tool references, and temporal context. When asked about a payment processor integration discussed 90 minutes earlier, the agent responded "I don't know," despite having processed that exact information before compaction occurred.

This paper presents evidence from academic research and production measurements that:

  1. Context quality determines output quality — not model size, not context window length
  2. Standard compaction is fundamentally lossy — and the loss is quantifiable
  3. On-demand recall from structured memory solves this without inflating per-turn costs
  4. No existing system combines selective recall with the security and auditability required for regulated deployment
  5. The approach works today — no model modifications required, no cloud dependencies

2. The Context Quality Problem

2.1 The "Lost in the Middle" Effect

Liu et al. (2024) demonstrated that transformer-based models exhibit significant performance degradation when relevant information is positioned in the middle of the input context. In multi-document QA experiments, accuracy dropped by more than 20 percentage points when the answer-bearing document moved from the beginning or end to the center of the context [1]. This finding has been replicated across model families including GPT-4, Claude, and Llama.

Crucially, this is not a generational bug being patched away. Esmi et al. (2026) benchmarked GPT-5 and found that long-context performance still degrades compared to short-context baselines [2]. Salvatore et al. (2025) argue this is an emergent property of the attention mechanism itself [3]. The problem is architectural.

2.2 Context Rot

Chroma Research (2026) coined the term "Context Rot" to describe the non-uniform performance degradation that occurs as input length increases [4]. Testing 18 LLMs on controlled tasks where only input length varied (not task complexity), they found that standard benchmarks like Needle-in-a-Haystack give a false sense of security — they test simple lexical retrieval, not the semantic reasoning that real applications demand.

The implication: a 1M-token context window filled with marginally relevant information will produce worse results than a 50K-token window with precisely the right information.

2.3 The Compaction Trap

When context exceeds the window limit, agent platforms perform compaction — typically by sending the entire context to an LLM with instructions to summarize. Wang et al. (2026) describe this as a "fundamentally lossy" operation [5]: truncation and summarization compress or discard evidence that may be critical for future decisions.

The problem compounds over time. Each compaction cycle loses detail from the previous cycle's summary. After several cycles, the agent retains only the broadest themes — a phenomenon we term progressive context amnesia.

2.4 Context Conflicts

Research from the Harbin Institute of Technology (Tan et al., ACL 2024) demonstrated that when retrieved context conflicts with the model's parametric knowledge, LLMs frequently ignore the provided context entirely [6]. Poorly curated context injection doesn't just waste tokens — it can actively mislead the model by triggering parametric override of correct but poorly positioned external evidence. This makes provenance and temporal validity of injected context a first-order concern.

3. Measuring Information Loss: Production Evidence

3.1 Experimental Setup

We deployed UAML (Universal Agent Memory Layer) on two production agent instances processing real-world multi-agent team conversations over a period of several weeks. Over 16,000 chat messages were indexed, producing 6,400+ structured knowledge entries. We then measured entity recovery — the ability to recall named entities (people, tools, decisions, dates, configurations) after compaction events.

3.2 Three-Layer Architecture

Our architecture provides three recovery layers:

Layer Source Function
L1 Platform native compaction Summarized conversation context
L2 UAML knowledge base Structured entities extracted from conversations
L3 SQL archive Complete, unmodified message history

3.3 Results

Configuration Entity Recovery What Survives
L1 only (standard compaction) 23% Broad themes, recent topics
L1 + L2 (+ structured knowledge) 50% Named entities, key decisions, facts
L1 + L2 + L3 (full architecture) 100% Everything — zero data loss

Standard compaction loses 77% of named entities. After compaction, the agent cannot reliably recall:

  • Who made a specific decision
  • What tool or configuration was discussed
  • When a change was agreed upon
  • Why a particular approach was chosen

These are precisely the facts that determine whether an agent's next response is helpful or hallucinated.

3.4 Real-World Impact

During production operation, an agent was asked about a prior decision regarding a payment processor integration — a topic discussed 90 minutes earlier but lost to compaction. Without structured memory, the agent could not answer. With on-demand recall (a single query taking <100ms), the full decision context was recovered and the agent responded correctly with complete attribution.

This is not an edge case. In multi-session, multi-day agent deployments, every compaction cycle creates potential blind spots that accumulate over time.

3.5 Structural Waste

Mason (2026) independently confirmed the scale of the problem by analyzing 857 production LLM sessions comprising 4.45 million effective input tokens [7]. The finding: 21.8% of all context was structural waste — tool definitions that were never invoked, system prompt repetitions, stale tool results from completed subtasks. This waste is not merely an efficiency problem; it actively competes with relevant information for the model's limited attention capacity.

4. On-Demand Recall vs. Context Stuffing

A naive solution would be to inject all available memory into every context. This fails for three reasons:

  1. Token cost scales linearly — every additional token in context is paid for on every turn, whether or not it's needed
  2. Context Rot degrades quality — more tokens ≠ better results [4]
  3. Distractors harm accuracy — Iratni et al. (2025) showed that irrelevant retrieved passages actively degrade output quality [8]

On-demand recall avoids all three problems:

Approach Tokens/turn Cost Pattern Accuracy
No compaction (full history) 200K+ Very high, every turn Degrades with length
Standard compaction 20–50K Low 23% entity recovery
Auto-inject all memory 50–80K High, every turn High but with noise
On-demand recall 20–50K base + 2–5K when needed Low (recall only when needed) 100% recoverable

The mechanism works in two phases within a single turn:

  1. Phase 1: The agent receives a query, recognizes it needs historical context, and calls a memory retrieval tool
  2. Phase 2: Relevant entries are returned into context, and the agent formulates a response with complete information

This mirrors how professionals work: you don't carry every document to every meeting, but you know where to find them when needed.

5. The Research Landscape

5.1 Context Compression

The compression approach tackles the problem at the infrastructure level — making context smaller without (ideally) losing information.

Activation Beacon (Zhang, Liu, Xiao et al., BAAI/FlagOpen, 2024) compresses KV cache activations rather than text, achieving 8× KV cache reduction and 2× inference speedup on 128K+ contexts [9]. Recurrent Context Compression (Huang, Zhu, Wang et al., 2024) achieves 32× compression with BLEU4 near 0.95 [10]. Semantic Compression (Fei, Niu, Zhou et al., Huawei, ACL 2024) applies information-theoretic source coding to extend context 6–8× without fine-tuning [11].

These approaches address hardware efficiency but do not solve the semantic quality problem: compressed content still carries all information indiscriminately.

5.2 Memory Agent Architectures

A more promising direction treats memory as a managed system rather than a compression target.

MemAgent (Yu, Chen, Feng et al., 2025) uses RL-trained agents that read text in segments and update memory via an overwrite strategy, extrapolating from 8K training context to 3.5M token QA with <5% loss [12]. With 81 citations, it represents the current state of the art in Chinese memory agent research.

SimpleMem (Liu, Su, Xia et al., 2026) implements a three-stage pipeline — semantic compression, online synthesis, intent-aware retrieval — achieving +26.4% F1 on LoCoMo while reducing token consumption by 30× [13]. Its architecture is conceptually closest to our on-demand recall approach.

MemOS (MemTensor, Shanghai Jiao Tong University, Renmin University, China Telecom, 2025) proposes a Memory Operating System with OS-inspired lifecycle management, achieving state-of-the-art across benchmarks [14]. It operates primarily in latent space — complementary to text-level structured approaches.

M+ (ICML 2025) extends MemoryLLM with scalable long-term memory, confirming that "retaining information from the distant past remains a challenge" [15] — exactly what SQL-backed archival solves without model modification.

5.3 Agent-Centric Context Management

ACON (Kang et al., Microsoft Research, 2025) optimizes context compression for long-horizon agents, reducing memory usage by 26–54% [16]. Its key finding validates our approach: "generic summarization easily loses critical details" — task-aware, selective retrieval is essential.

Pichay (Mason, 2026) takes the most radical approach, treating the context window as L1 cache and implementing demand paging with eviction and fault detection [7]. In production, it reduces context consumption by up to 93%. Mason's framing captures the field's core insight: "the problems — context limits, attention degradation, cost scaling, lost state across sessions — are virtual memory problems wearing different clothes."

Focus (Verma, 2026) implements autonomous context compression where agents decide when to consolidate learnings and prune history, achieving 22.7% token reduction without accuracy loss [17].

MemArt (2025) demonstrates that structured retrieval improves accuracy by 11.8–39.4% over plaintext memory methods with 91–135× reduction in prefill tokens [18] — direct validation of the principle that targeted recall outperforms brute-force injection.

5.4 Korean and Japanese Contributions

InfiniGen (Lee et al., Seoul National University, OSDI 2024) addresses KV cache management for long-text generation, achieving up to 3× speedup over existing methods [23]. Funded by Samsung Advanced Institute of Technology and cited 253 times, it represents the hardware-level approach to context scaling that complements software-level memory management.

THEANINE (Ong, Kim, Gwak et al., NAACL 2025) introduces timeline-based memory management for lifelong dialogue agents [24]. Its key insight — don't delete old memories, but connect them temporally and causally — directly validates UAML's temporal validity mechanism. Memories form evolving timelines rather than static snapshots.

LRAgent (Jeon et al., Korea, 2026) tackles KV cache sharing for multi-LoRA agents [25], addressing the exact overhead problem that emerges when multiple agents share a backbone but maintain separate caches.

Most striking is "Store then On-Demand Extract" (Yamanaka et al., Japan, 2026), which argues against the dominant "extract then store" paradigm and advocates storing raw data with on-demand extraction at query time [26]. This is philosophically identical to UAML's three-layer approach: preserve everything (L3), extract structured knowledge (L2), and retrieve on demand. Yamanaka's framing — "uplifting the world with memory" — captures the same conviction that memory infrastructure is foundational, not auxiliary.

AIM-RM (Yoshizato, Shimizu et al., Japan, AAMAS 2026) demonstrates practical deployment of memory retrieval in industrial supply chain agents [27], confirming that memory-augmented agents are moving beyond chatbot applications into production enterprise systems.

5.5 Foundational References

MemGPT/Letta (Packer et al., 2023) pioneered virtual context management inspired by OS memory hierarchies [19]. Mem0 (2025) offers scalable memory for multi-session dialogues [20]. Memex(RL) (Wang et al., 2026) is the closest academic parallel — indexed memory with compact summaries plus full-fidelity external storage [5].

5.6 Surveys

Two comprehensive surveys map the field: Cognitive Memory in LLMs (Shan, Luo, Zhu et al., 2025) provides the most complete taxonomy of memory mechanisms with 34 citations [21], and A Comprehensive Survey on Long Context Language Modeling (Liu, Zhu, Bai et al., 2025) covers the full spectrum with contributions from 35+ researchers and 88 citations [22].

5.7 The Gap

Across all surveyed approaches, several critical capabilities are consistently absent:

Capability MemAgent SimpleMem MemOS Pichay THEANINE Mem0 MemGPT UAML
On-demand selective recall Partial Partial Partial
Cross-session memory
End-to-end encryption ✓ (PQC)
Audit trail / provenance
Temporal validity
Multi-agent isolation
Local-first / self-hosted Partial
Certifiable for regulated use
Zero-downtime compaction Partial
MCP integration (drop-in)
Production-deployed Research Research Research Research Partial

No existing system combines selective on-demand recall with the security, auditability, and temporal reasoning required for deployment in regulated environments — healthcare, legal, financial services, government.

6. Security and Compliance: The Enterprise Gap

Academic memory systems universally neglect security. For enterprises in regulated industries, this is a deployment blocker. UAML addresses this gap with:

  • Post-quantum cryptography (ML-KEM-768, NIST FIPS 203) for all stored memories — future-proof against quantum computing threats
  • Complete audit trail: every write, read, and recall is logged with actor, timestamp, and context
  • Provenance chain: each knowledge entry tracks its source, extraction confidence, and modification history
  • Temporal validity: memories carry valid-from and valid-until metadata, preventing stale information from entering active context
  • Client isolation: strict per-client, per-project memory boundaries prevent cross-contamination
  • Local-first deployment: all data stays on the user's infrastructure — no cloud dependency, no data exfiltration risk

These are not features for a roadmap. They are implemented, tested, and deployed in production.

7. Toward an Open Standard: External Memory Provider API

We believe memory infrastructure should be standardized, not proprietary. We have proposed an External Memory Provider API (RFC #49233) [28] that addresses three interconnected problems: agent downtime during compaction, information loss after compaction, and the lack of a standard integration path for external memory systems.

7.1 The Problem: Compaction Blackouts

When an agent platform's context window fills up, the platform performs synchronous in-band compaction: the agent stops responding, the entire context is sent to an LLM for summarization, and the summary replaces the original context. In our measurements, this creates a 30–60 second blackout — the agent is completely unresponsive. For production use cases (customer support, financial services, healthcare), this is a deployment blocker.

7.2 Proposed Architecture: Hot-Swap Context

RFC #49233 proposes a hot-swap architecture with continuous background synchronization:

┌──────────────────────────────────────────┐
│           Agent Platform                 │
│  Context Slot A (active) ←→ Slot B       │
│              ↕               ↕           │
│  ┌────────────────────────────────┐      │
│  │   Memory Provider Interface    │      │
│  └──────────┬─────────────────────┘      │
└─────────────┼────────────────────────────┘
              ▼
┌─────────────────────────────┐
│  External Memory Provider   │
│  • Continuous async sync    │
│  • Background compression   │
│  • Pre-built context ready  │
│  • Full audit trail         │
└─────────────────────────────┘

The mechanism has four stages:

  1. Continuous sync — Every message is written to the memory provider asynchronously (fire-and-forget, local socket, ~1ms). The provider never blocks the agent.
  2. Background compression — The provider maintains a compressed context summary, updated asynchronously using a lightweight local model. The compressed context is always ready before the agent needs it.
  3. Atomic swap — When capacity threshold is reached, the platform reads the pre-built context from the provider (~50ms) and swaps it in during the inter-message gap. The agent never pauses.
  4. On-demand recall — When the agent needs details that were compressed away, it queries the provider via a tool call (~10–50ms), receiving targeted knowledge entries.

This is analogous to double-buffering in graphics: while the agent uses buffer A, the provider prepares buffer B in the background. When it's time to compact, the platform atomically swaps A for B.

7.3 The API

The proposed interface is deliberately minimal — three core operations plus lifecycle hooks:

interface MemoryProvider {
  // Fire-and-forget after each message (async, not in critical path)
  onMessage(sessionId: string, message: Message): Promise<void>;

  // Returns pre-built compressed context
  getCompressedContext(sessionId: string, maxChars: number): Promise<CompressedContext>;

  // On-demand recall tool
  recall(sessionId: string, query: string, limit?: number): Promise<MemoryEntry[]>;

  ping(): Promise<{ ok: boolean; latencyMs: number }>;
}

interface SessionHooks {
  'pre-compaction': (session: Session) => Promise<void>;
  'post-compaction': (session: Session, newContext: CompressedContext) => Promise<void>;
  'session-start': (session: Session) => Promise<CompressedContext | null>;  // Cold-start recovery
  'session-end': (session: Session) => Promise<void>;
}

Configuration is opt-in with full backward compatibility:

{
  "memory": {
    "provider": "uaml",
    "endpoint": "http://localhost:8770",
    "compaction": {
      "strategy": "hot-swap",
      "threshold": 0.85,
      "targetSize": 0.40
    }
  }
}

7.4 Performance Impact

Metric Current (builtin) With Memory Provider
Compaction duration 30–60s <100ms
Agent downtime 30–60s 0 (between messages)
Information loss Significant (77%) None (full DB)
Audit trail None Complete
Cost per compaction ~$0.10–0.50 ~$0.001 (async local)

7.5 Seamless Integration via MCP

A critical design decision is the choice of integration protocol for the recall tool. UAML implements the Model Context Protocol (MCP) connector, which means any existing LLM agent that supports MCP tool calls can connect to UAML's full memory capabilities — recall, indexing, knowledge extraction — without any code changes to the agent itself. The agent simply gains a new tool in its toolbox. No forking, no framework lock-in, no migration.

In our production deployment, two agents from different platforms were connected to UAML via MCP within minutes, immediately gaining access to structured memory recall while retaining all their existing functionality. The MCP approach turns memory from a platform feature into a universal service layer that any agent can consume.

7.6 Dual-Track Architecture: Never a Single Point of Failure

A critical design principle: the platform's builtin compaction pipeline always runs in parallel — it is never disabled, even when a memory provider is active.

Builtin Compaction  ─────────────────────► ALWAYS running (shadow/backup)
UAML Memory Provider ────────────────────► Enriches when available (overlay)

This guarantees 100% functionality at all times:

  • If the memory provider fails → agent continues with builtin context (no interruption)
  • If the memory provider succeeds → agent gets enriched context (better recall)
  • Switchover is automatic — no operator intervention required
  • The memory provider enhances the pipeline; it does not replace it

7.7 Input Quality Classification

Not all information is equally important. The memory provider classifies every entry:

Level Criteria % of data Examples
HIGH Decisions, rules, architecture choices 0.5% "Chose ML-KEM-768 for encryption"
MEDIUM Entity mentions, config changes, results 7% "VPS IP: 5.189.139.221"
LOW Debug output, heartbeats, transient status 93% Tool output, NO_REPLY messages

This filtering ensures the recall API returns signal, not noise. The getCompressedContext endpoint prioritizes HIGH and MEDIUM entries, keeping injected context compact and relevant.

7.8 Provenance Chain

Every knowledge entry maintains a verifiable provenance chain from recalled fact to original message:

UAML Entry #4521
├── content: "Decided on hot-swap compaction strategy"
├── importance: HIGH (score: 7)
├── source: session:a7e0260c:msg_hash_abc123
├── chat_history_id: 28451  ← links to SQL archive
└── created_at: 2026-03-18T14:23:00Z
        
        
SQL chat_messages #28451
├── text: [full original message, verbatim]
├── source_file: a7e0260c-...jsonl
└── source_line: 14832

This enables audit (trace any fact to its source), verification (compare summary against verbatim record), and compliance (demonstrate data provenance for regulated environments).

7.9 Cross-Session Memory

In production, agents operate across multiple sessions (Discord channels, messaging platforms, scheduled tasks). The proposed API extension enables:

  • Global recall — find knowledge from any session, not just the current one
  • Session grouping — the broker identifies related sessions automatically
  • Our implementation already indexes 807+ session fragments with recall chain linking across session boundaries

7.10 Error Handling and Backward Compatibility

The design prioritizes graceful degradation:

  • Provider failure → automatic fallback to builtin compaction (never block the agent)
  • Cold startsession-start hook enables context reconstruction from historical data
  • Message during swap → swap occurs in inter-message gap; incoming messages queued during ~100ms swap
  • Default behavior unchanged → builtin compaction remains the default; memory provider is opt-in

7.11 Phased Roadmap

Phase Capability Estimated Timeline
1 Pre/post-compaction hooks 1–2 weeks
2 Memory Provider Interface 2–3 weeks
3 Hot-swap compaction strategy 3–4 weeks
4 Background sync + config + docs 2–3 weeks

Phase 1 alone would already enable external memory integration and demonstrate value. The full RFC proposal is publicly available at github.com/openclaw/openclaw/issues/49233 and has received community attention and independent analysis, confirming demand for standardized memory infrastructure.

8. Practical Implications

For Agent Developers

  • Standard compaction is a silent quality killer — measure your entity recovery rate before and after compaction
  • On-demand recall is more cost-effective than larger context windows or more expensive models
  • Memory quality compounds: each correct recall enables better future decisions

For Platform Builders

  • Expose compaction hooks to external providers — one-size-fits-all compaction is insufficient
  • Compliance requirements will drive enterprise adoption of structured, auditable memory
  • The open standard approach (RFC #49233) reduces integration risk

For Researchers

  • Context quality vs. context quantity deserves more attention as an evaluation axis
  • Entity recovery rate is a practical, reproducible metric for comparing memory approaches
  • The gap between academic prototypes and production requirements (encryption, audit, compliance) represents an underexplored research opportunity

For the Industry

Research groups across East Asia are producing world-class work on context compression and memory architectures. Chinese institutions (BAAI, Shanghai Jiao Tong, Harbin Institute, Huawei, China Telecom, MemTensor), Korean groups (Seoul National University, Samsung, KAIST), and Japanese researchers are collectively advancing the field at a pace that exceeds Western output in volume and increasingly matches it in impact. Any serious memory infrastructure product must engage with this global research base.

9. Conclusion

The evidence converges from multiple independent sources: context quality determines output quality. A smaller context with the right information produces better results than a larger context with noise. Standard compaction loses 77% of named entities — a measurable degradation that directly affects agent decision quality.

On-demand recall from structured memory resolves this without the token cost of context stuffing. Our three-layer architecture (compaction + knowledge base + archive) achieves 100% entity recovery while maintaining minimal per-turn costs. The approach requires no model modifications, no cloud dependencies, and works as an overlay on existing agent platforms.

We are not arguing for replacing compaction — it serves a useful purpose in maintaining manageable context sizes. We are arguing that compaction alone is insufficient, and that a structured external memory layer transforms it from a lossy compression into a lossless one.

The organizations that understand this will build agents that actually work. The rest will keep buying bigger models and wondering why they still forget.

Memory quality is the new model quality.

References

[1] Liu, N.F. et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL, 12, 157–173. arXiv:2307.03172

[2] Esmi, N. et al. (2026). "GPT-5 vs Other LLMs in Long Short-Context Performance." arXiv, February 2026.

[3] Salvatore, N. et al. (2025). "Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs." arXiv, October 2025.

[4] Chroma Research (2026). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." research.trychroma.com/context-rot

[5] Wang, Z. et al. (2026). "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory." arXiv:2603.04257

[6] Tan, Y. et al. (2024). "When Retrieved Context Conflicts with Parametric Knowledge." ACL 2024, Harbin Institute of Technology.

[7] Mason, T. (2026). "The Missing Memory Hierarchy: Demand Paging for LLM Context Windows." arXiv:2603.09023

[8] Iratni, M. et al. (2025). "Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias." arXiv, December 2025.

[9] Zhang, P. et al. (2024). "Long Context Compression with Activation Beacon." BAAI/FlagOpen. arXiv:2401.03462

[10] Huang, C. et al. (2024). "Recurrent Context Compression: Efficiently Expanding the Context Window of LLM." arXiv:2406.06110

[11] Fei, W. et al. (2024). "Extending Context Window of Large Language Models via Semantic Compression." Huawei. Findings of ACL 2024, 5169–5181.

[12] Yu, H. et al. (2025). "MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent." arXiv:2507.02259

[13] Liu, J. et al. (2026). "SimpleMem: Efficient Lifelong Memory for LLM Agents." arXiv:2601.02553

[14] MemTensor et al. (2025). "MemOS: A Memory OS for AI System." Shanghai Jiao Tong University, Renmin University, China Telecom. arXiv:2507.03724

[15] M+ (2025). "Extending MemoryLLM with Scalable Long-Term Memory." ICML 2025. arXiv:2502.00592

[16] Kang, M. et al. (2025). "ACON: Optimizing Context Compression for Long-Horizon LLM Agents." arXiv:2510.00615

[17] Verma, N. (2026). "Active Context Compression: Autonomous Memory Management in LLM Agents." arXiv:2601.07190

[18] MemArt (2025). "KVCache-Centric Memory for LLM Agents." OpenReview.

[19] Packer, C. et al. (2023). "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560

[20] Mem0 (2025). "Building Production-Ready AI Agents with Scalable Long-Term Memory." arXiv:2504.19413

[21] Shan, L. et al. (2025). "Cognitive Memory in Large Language Models." arXiv:2504.02441

[22] Liu, J. et al. (2025). "A Comprehensive Survey on Long Context Language Modeling." arXiv:2503.17407

[23] Lee, S. et al. (2024). "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management." OSDI 2024, Seoul National University.

[24] Ong, D., Kim, H., Gwak, S. et al. (2025). "THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation." NAACL 2025.

[25] Jeon, H. et al. (2026). "LRAgent: Multi-LoRA Agents with KV Cache Sharing." arXiv:2602.01053

[26] Yamanaka, Y. et al. (2026). "Store then On-Demand Extract: A Memory Architecture for LLM Agents." arXiv:2602.16192

[27] Yoshizato, T., Shimizu, H. et al. (2026). "AIM-RM: Agent-based Inventory Management with Retrieval Memory." AAMAS 2026. arXiv:2602.05524

[28] Zamazal, L. / GLG, a.s. (2026). "RFC: External Memory Provider API for OpenClaw." GitHub Issue #49233. github.com/openclaw/openclaw/issues/49233

GLG, a.s. — UAML (Universal Agent Memory Layer) is available at uaml-memory.com. Technical documentation and API reference at smart-memory.ai.