2026-03-19 22:43:35
Claude Code's Agent Teams feature is genuinely impressive. You type a complex prompt, it spawns subagents, they work in parallel, each with its own context window. The architecture is there.
But there's a gap nobody talks about.
Agent Teams doesn't know which agents to use. It doesn't know that a game project needs a game designer, a physics engineer, and a QA specialist. It doesn't know that a SaaS dashboard needs a frontend dev, a backend architect, and someone who understands Stripe webhooks. It spawns blank subagents with no identity, no rules, no specialization.
You're supposed to define them yourself. Every time. With --agents JSON. For every project.
I built the missing piece.
npx agentcrow init
That's it. This installs 144 specialized agent definitions into your project — 9 hand-crafted builtin agents with strict MUST/MUST NOT rules, plus 135 community agents from agency-agents covering 15 divisions: engineering, game dev, design, marketing, testing, DevOps, and more.
When you run claude after init, it reads the agent roster from .claude/CLAUDE.md and automatically decomposes your prompt into tasks, matches the right agents, and dispatches them.
No API key needed. No separate server. No configuration. Just Claude Code doing what it already does — but now it knows who to call.
I typed this:
Build a SaaS dashboard with Stripe billing, user auth, and API docs
Claude decomposed it into 5 tasks and dispatched 5 specialized agents:
🐦 AgentCrow — 5 agents dispatched:
1. @ui_designer → dashboard layout, component hierarchy
2. @frontend_developer → React components, charts, responsive UI
3. @backend_architect → Auth system, REST API, Stripe webhooks
4. @qa_engineer → billing flow E2E tests, auth edge cases
5. @technical_writer → API reference, onboarding guide
Each agent has a defined personality, communication style, and critical rules. The QA engineer, for example, has rules like "MUST cover happy path, edge cases, and error paths" and "MUST NOT skip error handling tests." These aren't suggestions — they're baked into the agent's identity.
The key difference: without AgentCrow, Claude spawns generic subagents. With AgentCrow, each subagent knows its job.
| Agent Teams alone | + AgentCrow | |
|---|---|---|
| Spawn subagents | ✅ | ✅ |
| Know which agents to use | ❌ | ✅ |
| 144 pre-built agent roles | ❌ | ✅ |
| Auto-decompose prompts | ❌ | ✅ |
| Agent identity & rules | ❌ | ✅ |
| Zero config | ❌ | ✅ |
Agent Teams is the engine. AgentCrow is the brain.
The architecture is deliberately simple. When you run npx agentcrow init, three things happen.
First, 9 builtin agents get copied into .agr/agents/builtin/. These are YAML files I wrote by hand. Each one defines a role, personality, MUST rules, MUST NOT rules, deliverables, and success metrics. The QA engineer agent, for example, has 5 MUST rules and 5 MUST NOT rules covering everything from test independence to mock usage.
Second, 135 external agents get cloned from agency-agents into .agr/agents/external/. These cover game development (Unreal, Unity, Godot specialists), marketing, sales, spatial computing — domains I wouldn't have thought to include.
Third, a .claude/CLAUDE.md file gets generated. This is where the magic happens. CLAUDE.md is Claude Code's project-level instruction file — it reads this every session. The generated file contains the complete agent roster and dispatch rules: when to decompose, how to match agents, what format to use for subagent prompts.
A SessionStart hook also gets installed. When you open claude, you see:
🐦 AgentCrow active
──────────────────────────────────────────────────
9 builtin agents:
· qa_engineer
· korean_tech_writer
· security_auditor_deep
...
135 external agents (15 divisions)
──────────────────────────────────────────────────
Complex prompts → auto agent dispatch
The entire system is a CLAUDE.md file and some YAML. No runtime dependencies, no background processes, no API keys. Turn it off with agentcrow off, turn it back on with agentcrow on.
The hardest part wasn't the agent matching or the decomposition logic. It was figuring out what shouldn't be automated.
My first attempt tried to intercept every prompt, decompose it with an LLM call, and auto-dispatch. The decomposition alone took 15 seconds. For a simple "fix this typo" prompt, that's absurd.
The current design is smarter: CLAUDE.md tells Claude to only decompose multi-task requests. "Fix this bug" runs normally. "Build a dashboard with auth, billing, tests, and docs" triggers the agent system. Claude makes that judgment call — and it's surprisingly good at it.
I also learned that agent identity matters more than I expected. A generic "write tests" subagent produces generic tests. A QA engineer agent with "MUST cover edge cases" and "MUST NOT use sleep for async waits" produces professional tests. The rules constrain the model in the right direction.
npx agentcrow init
claude
Then type something complex. Watch it decompose and dispatch.
The source is on GitHub. The npm package is agentcrow.
Agent Teams is powerful. It just needed someone to tell it who to call.
2026-03-19 22:35:27
A few days ago I shared the architecture flow diagram of my SaaS project, LeadIt.
After that, I also shared a post explaining the core backend components I built — things like the company search API, the analysis engine, lead scoring logic, and the AI outreach generator.
Since then I’ve been slowly turning those pieces into an actual working product.
Most of the backend core is now working, and I’ve also been building the landing page that explains the product and its workflow.
But while trying to implement the next feature — email automation using Gmail OAuth — I ran into a problem that ended up teaching me one of the most important lessons about building MVPs.
The Plan: Let Users Send Emails From the Platform
Once the backend intelligence layer was working, the next step felt obvious.
Users should be able to send outreach emails directly from LeadIt.
The flow I imagined was simple.
A user connects their Gmail account → LeadIt generates an AI outreach message → the user can send the email directly from the platform.
To make that work, I needed Google OAuth with Gmail API access.
The expected flow looked something like this:
On paper, this looked straightforward.
But implementing it was a completely different experience.
Setting Up Google Cloud
I started by setting up everything inside Google Cloud Console (GCP).
The usual steps:
Something like this:
http://localhost:3000/api/auth/callback/google
Then I added Gmail permissions using scopes like:
https://www.googleapis.com/auth/gmail.send
At this point everything looked correct.
The configuration seemed fine.
So I moved forward with integrating authentication in the app.
Authentication Worked… But the Token Didn’t
Here’s the weird part.
The Google login itself worked perfectly.
Users could sign in successfully.
That means:
But the Gmail provider token never got stored in my database.
And that token is the most important part.
Without it:
So even though authentication worked, the feature itself was useless.
The 4-Hour Debugging Spiral
This is where things started getting frustrating. I spent around 3–4 hours debugging the issue. And I checked almost everything I could think of.
Database Layer
First I checked Supabase.
Things I verified:
Everything looked correct.
Server-Side Logic
Then I checked the backend logic.
Things like:
Still nothing.
Client-Side Flow
Then I checked the frontend flow.
Things like:
Again… nothing.
Google Cloud Configuration
At this point I went back to Google Cloud.
Checked again:
Everything looked fine.
Yet somehow the provider token was never reaching my database.
The Realization
After spending hours debugging this, I finally asked myself a simple question.
Do I actually need this feature right now?
And the honest answer was: No
The real core of LeadIt is not Gmail automation.
The real core is:
Sending emails automatically is definitely useful.
But it is not required to validate the product idea.
And that’s when I made a decision.
The Founder Decision: Cut the Feature
Instead of wasting more time on OAuth complexity, I decided to remove the feature from the MVP.
Gmail automation will move to Version 2.
For the MVP, the product will focus only on the core features:
Instead of automatic email sending, users can simply: copy the generated message and send it manually.
It’s simple.
But for an MVP, simple is actually good.
Simplifying Authentication
While thinking about this, I also made another decision.
Instead of implementing complex OAuth systems right now, I’ll use simple Next.js authentication for the MVP.
OAuth systems bring a lot of complexity:
All of that takes time to build and even more time to debug.
Right now my focus is simple.
Ship the product.
Landing Page Progress
Alongside the backend work, I’ve also been building the LeadIt landing page.
It’s almost complete now.
Soon I’ll share a preview of the landing page and I’d genuinely love feedback from other builders here.
Things like:
Getting early feedback is honestly one of the most useful things when building something from scratch.
What Today Actually Taught Me
At first, today felt like a wasted day.
I didn’t ship the feature I planned.
But in reality, I learned something much more important.
Startups don’t need perfect products.
They need working MVPs.
That means:
Final Thought
Building your first SaaS is messy.
You’ll spend hours debugging things that might not even matter in the final product.
But every one of those moments teaches you something about building, prioritizing, and shipping faster.
LeadIt is still very early.
But every small step is slowly turning the idea into a real product.
And honestly, that’s the fun part of the journey.
If you're building something right now, I’m curious:
Have you ever spent hours debugging a feature… only to realize you didn’t actually need it?
2026-03-19 22:34:49
There are many code generators that can create CRUD resources from a config file.
On paper, that sounds great.
In practice, the real question is not:
“Can it generate code?”
The real question is:
“Can I actually trust the generated project enough to use it?”
That is where many generators start to fall apart.
Generating a few entities, repositories, and controllers is the easy part. The harder part is making the generated result feel reliable, understandable, and usable in a real workflow.
After working on my own Spring Boot CRUD generator, I realized that usability does not come from generation alone. It comes from all the small things around the generation that reduce friction and increase trust.
Here are a few lessons that stand out.
A generator can create a lot of files and still feel incomplete.
You can generate entities, services, controllers, mappers, tests, migrations, and configuration files, and the first impression may still be:
“Okay, but will this actually work in my project?”
That hesitation is normal.
Developers do not trust generated code just because it exists. They trust it when they can see that:
A useful generator is not just a code emitter.
A spec-driven generator lives or dies by the quality of its validation.
If a spec file is invalid, users should know before generation.
If the spec is incomplete, they should know before generation.
If the target project setup is missing required dependencies, they should know before generation.
That is why dry-run validation is so valuable.
A dry run gives the user a chance to validate the spec and project setup without committing to a full generation step. It turns generation into a safer workflow:
That one layer of feedback makes a generator feel much more professional.
This is one of those features that is not flashy, but it solves a real pain.
A generator may support optional features like GraphQL, caching, OpenAPI, Docker integration, or workflow generation. But if the target project is missing required dependencies, the user often discovers that only after compilation fails.
That is a bad experience.
A dependency check in the validation step gives immediate feedback:
This is not a “marketing feature,” but it is exactly the kind of thing that makes a tool feel mature.
If a tool generates list endpoints, sorting quickly becomes one of those things users expect by default.
Basic CRUD is rarely enough on its own.
Even a very simple generated API becomes much more useful if it supports controlled sorting like:
name
price
releaseDate
The important part is not just adding sorting, but adding it in a controlled way.
A good generator should not blindly allow sorting by anything. It should let the user explicitly define which fields are sortable. That gives a better balance between flexibility and predictability.
For example:
sort:
allowedFields: [name, price, releaseDate]
defaultDirection: ASC
This kind of design keeps the spec simple while still making the generated API more realistic.
Sorting may not sound like a major release feature, but it makes generated CRUD resources feel much closer to something you would actually expose and use.
Good documentation does more than explain features.
It answers silent questions like:
A README is not just a description. It is part of the product.
A clearer README reduces confusion, shortens onboarding time, and makes it easier for users to move from curiosity to actual usage.
That is especially important for generators, because they involve a bigger trust jump than a typical library.
With a normal library, a developer adds one dependency and tries one API.
With a generator, they are letting a tool create part of their codebase.
That requires more confidence.
This may be the most underrated part.
Developers believe much faster when they can see the flow.
A short demo video can answer in under a minute what a long README sometimes cannot fully communicate:
That does two things at once:
For generator-style tools, that visual proof matters a lot.
People do not just want to see source code. They want to see the end-to-end path from input to working result.
In many cases, a demo video is not “extra content.”
It is part of trust-building.
Not every useful release looks impressive in a headline.
Some improvements are exciting and obvious.
Others are quieter, but they make the product much better:
These things may not always look like major product announcements, but they are often what move a project from “interesting” to “actually usable.”
That distinction matters.
Because many tools can generate code.
Far fewer tools make the full workflow feel safe, clear, and dependable.
If you are building a generator, think beyond output quantity.
Do not focus only on how many files you can generate.
Focus on things like:
That is where usability really comes from.
I recently pushed a new release of my own project, Spring CRUD Generator, with exactly that mindset in mind: improving validation, adding dependency checks, introducing entity-level sorting, cleaning up the README, and adding a demo video plus the demo CRUD spec used in that video.
Repository: https://github.com/mzivkovicdev/spring-crud-generator
It is not the flashiest kind of release, but it is the kind of release that makes a tool more usable in real life.
2026-03-19 22:33:36
Most DevOps guides say:
“Enable caching — it will speed up your CI pipelines.”
I’ve done that many times in my career. Here I'd like to share with you some of my thoughts on the topic illustrating it with a little experiment.
I built a small GitLab CI lab, added dependency caching. Are you expecting faster runs?
The result might surprise you:
My pipeline didn’t get faster at all.
In fact, in some cases, it was slightly slower.
Before jumping to conclusions — this is not a post against caching.
Caching worked exactly as expected.
It just didn’t translate into faster pipeline duration in this particular setup.
And that’s the part worth understanding.
This article is not about how to enable caching.
It’s about what actually happens after you enable it — and why the outcome might not match expectations.
I wanted to validate a simple assumption:
So I built a small Python project with a multi-stage GitLab CI pipeline and measured the results.
The pipeline has three stages:
Each job installs dependencies independently — just like in many real-world pipelines.
To make the effect visible, I used slightly heavier dependencies:
Each job runs:
time pip install -r requirements.txt
As expected:
| Run | Duration |
|---|---|
| #1 | ~38s |
| #2 | ~34s |
I introduced GitLab cache:
.cache:
cache:
key:
files:
- requirements.txt
paths:
- .cache/pip
policy: pull-push
And configured pip:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
Now dependencies should be reused between jobs and runs.
| Mode | Run | Duration |
|---|---|---|
| No cache | 1 | ~38s |
| No cache | 2 | ~34s |
| With cache | 1 | ~40s |
| With cache | 2 | ~38s |
Almost no difference.
If your runner uses a nearby mirror (for example, Hetzner), downloads are already fast.
Modern Python packaging uses prebuilt wheels, making installs quick.
This overhead can cancel the benefit.
Dependency caching is not automatically a performance optimization.
Its impact depends on:
Caching can still:
Next step:
You can find the full lab here:
👉 https://github.com/ic-devops-lab/devops-labs/tree/main/GitLabCIPipelinesWithDependencyCaching
Not every best practice gives a measurable improvement — but understanding why is where real DevOps begins.
2026-03-19 22:31:19
The tech industry thrives on innovation, yet behind every product and line of code lies a journey of growth, resilience, and learning. Over the past year, I’ve experienced firsthand the challenges, triumphs, and lessons that have shaped my journey as an aspiring software engineer.
Early in my journey, I often struggled to balance rapidly evolving technologies with academic and personal commitments. Adapting to languages and frameworks like Python, React, AI/ML tools, and SQL required persistence, patience, and continuous learning. Imposter syndrome occasionally crept in, making me question if I was keeping up with peers.
Through persistence, I achieved milestones that gave me both confidence and direction. Projects like a full-stack AI invoice automation system reduced manual effort by 75%, while my AI/ML voice and image automation assistant streamlined repetitive tasks by 40%. These achievements were more than technical—they taught me problem-solving, OOP principles, and collaborative project design.
The most valuable lesson I’ve learned is that technical skills alone are not enough. Growth comes from embracing challenges, asking questions, mentoring others, and building community. Sharing knowledge fosters inclusion and empowers peers from underrepresented backgrounds to thrive in tech.
To anyone beginning their journey: embrace challenges, seek guidance, and never underestimate your ability to learn and innovate. Diversity in tech is not just about representation—it is about sharing experiences, lifting each other, and building inclusive solutions together.
By sharing my journey, I hope to inspire underrepresented individuals and educate allies on the importance of mentorship, collaboration, and perseverance. The future of tech is stronger when every voice is heard, every story valued, and every talent empowered.
2026-03-19 22:29:33
Short title: How We Eliminated 77% Entity Loss and Agent Freeze with an Open Memory Standard
Author: L. Zamazal, GLG, a.s.
Date: March 2026
Keywords: LLM memory, context compaction, agent memory, information loss, on-demand recall, UAML, structured memory, MCP, zero-downtime, open standard
Large language models have no persistent memory. Every inference call receives the entire conversation context as input, and the model's response quality depends directly on the quality of that input. When context grows beyond the window limit, platforms perform compaction — summarizing the conversation to fit. We measured that standard compaction loses 77% of named entities (people, decisions, tools, dates) in production multi-agent deployments, directly degrading agent decision quality.
We present a three-layer recall architecture that achieves 100% entity recovery while keeping per-turn token costs minimal through on-demand retrieval. Deployed in production across two agent instances processing 16,000+ messages, our approach demonstrates that improving memory quality is more cost-effective than upgrading models. We survey 27 recent papers from Western, Chinese, Korean, and Japanese research communities and show that no existing system combines selective on-demand recall with post-quantum encryption, audit trails, and temporal validity — capabilities required for deployment in regulated industries.
Our central claim: you don't pay for a better model; you pay for better memory.
The dominant assumption in the LLM industry is that performance scales with model capability — larger models, longer context windows, better reasoning. This assumption drives a hardware arms race and an API pricing war. But it obscures a fundamental architectural problem: the context window is not memory. It is cache.
Every API call to a large language model sends the complete conversation context — system instructions, tool schemas, prior messages, injected documents — as input. There is no persistent state between calls. The model sees only what is explicitly present in that single request. When the accumulated context exceeds the model's window, frameworks apply compaction: lossy summarization that discards what doesn't fit.
An AI agent is only as good as what it remembers. Yet the dominant paradigm for handling context overflow is the computational equivalent of asking a colleague to shred their notes and work from a one-paragraph summary.
The consequences are measurable. In our production deployment, we observed that after standard compaction, agents could recall only 23% of named entities from prior conversations — losing decisions, attributions, tool references, and temporal context. When asked about a payment processor integration discussed 90 minutes earlier, the agent responded "I don't know," despite having processed that exact information before compaction occurred.
This paper presents evidence from academic research and production measurements that:
Liu et al. (2024) demonstrated that transformer-based models exhibit significant performance degradation when relevant information is positioned in the middle of the input context. In multi-document QA experiments, accuracy dropped by more than 20 percentage points when the answer-bearing document moved from the beginning or end to the center of the context [1]. This finding has been replicated across model families including GPT-4, Claude, and Llama.
Crucially, this is not a generational bug being patched away. Esmi et al. (2026) benchmarked GPT-5 and found that long-context performance still degrades compared to short-context baselines [2]. Salvatore et al. (2025) argue this is an emergent property of the attention mechanism itself [3]. The problem is architectural.
Chroma Research (2026) coined the term "Context Rot" to describe the non-uniform performance degradation that occurs as input length increases [4]. Testing 18 LLMs on controlled tasks where only input length varied (not task complexity), they found that standard benchmarks like Needle-in-a-Haystack give a false sense of security — they test simple lexical retrieval, not the semantic reasoning that real applications demand.
The implication: a 1M-token context window filled with marginally relevant information will produce worse results than a 50K-token window with precisely the right information.
When context exceeds the window limit, agent platforms perform compaction — typically by sending the entire context to an LLM with instructions to summarize. Wang et al. (2026) describe this as a "fundamentally lossy" operation [5]: truncation and summarization compress or discard evidence that may be critical for future decisions.
The problem compounds over time. Each compaction cycle loses detail from the previous cycle's summary. After several cycles, the agent retains only the broadest themes — a phenomenon we term progressive context amnesia.
Research from the Harbin Institute of Technology (Tan et al., ACL 2024) demonstrated that when retrieved context conflicts with the model's parametric knowledge, LLMs frequently ignore the provided context entirely [6]. Poorly curated context injection doesn't just waste tokens — it can actively mislead the model by triggering parametric override of correct but poorly positioned external evidence. This makes provenance and temporal validity of injected context a first-order concern.
We deployed UAML (Universal Agent Memory Layer) on two production agent instances processing real-world multi-agent team conversations over a period of several weeks. Over 16,000 chat messages were indexed, producing 6,400+ structured knowledge entries. We then measured entity recovery — the ability to recall named entities (people, tools, decisions, dates, configurations) after compaction events.
Our architecture provides three recovery layers:
| Layer | Source | Function |
|---|---|---|
| L1 | Platform native compaction | Summarized conversation context |
| L2 | UAML knowledge base | Structured entities extracted from conversations |
| L3 | SQL archive | Complete, unmodified message history |
| Configuration | Entity Recovery | What Survives |
|---|---|---|
| L1 only (standard compaction) | 23% | Broad themes, recent topics |
| L1 + L2 (+ structured knowledge) | 50% | Named entities, key decisions, facts |
| L1 + L2 + L3 (full architecture) | 100% | Everything — zero data loss |
Standard compaction loses 77% of named entities. After compaction, the agent cannot reliably recall:
These are precisely the facts that determine whether an agent's next response is helpful or hallucinated.
During production operation, an agent was asked about a prior decision regarding a payment processor integration — a topic discussed 90 minutes earlier but lost to compaction. Without structured memory, the agent could not answer. With on-demand recall (a single query taking <100ms), the full decision context was recovered and the agent responded correctly with complete attribution.
This is not an edge case. In multi-session, multi-day agent deployments, every compaction cycle creates potential blind spots that accumulate over time.
Mason (2026) independently confirmed the scale of the problem by analyzing 857 production LLM sessions comprising 4.45 million effective input tokens [7]. The finding: 21.8% of all context was structural waste — tool definitions that were never invoked, system prompt repetitions, stale tool results from completed subtasks. This waste is not merely an efficiency problem; it actively competes with relevant information for the model's limited attention capacity.
A naive solution would be to inject all available memory into every context. This fails for three reasons:
On-demand recall avoids all three problems:
| Approach | Tokens/turn | Cost Pattern | Accuracy |
|---|---|---|---|
| No compaction (full history) | 200K+ | Very high, every turn | Degrades with length |
| Standard compaction | 20–50K | Low | 23% entity recovery |
| Auto-inject all memory | 50–80K | High, every turn | High but with noise |
| On-demand recall | 20–50K base + 2–5K when needed | Low (recall only when needed) | 100% recoverable |
The mechanism works in two phases within a single turn:
This mirrors how professionals work: you don't carry every document to every meeting, but you know where to find them when needed.
The compression approach tackles the problem at the infrastructure level — making context smaller without (ideally) losing information.
Activation Beacon (Zhang, Liu, Xiao et al., BAAI/FlagOpen, 2024) compresses KV cache activations rather than text, achieving 8× KV cache reduction and 2× inference speedup on 128K+ contexts [9]. Recurrent Context Compression (Huang, Zhu, Wang et al., 2024) achieves 32× compression with BLEU4 near 0.95 [10]. Semantic Compression (Fei, Niu, Zhou et al., Huawei, ACL 2024) applies information-theoretic source coding to extend context 6–8× without fine-tuning [11].
These approaches address hardware efficiency but do not solve the semantic quality problem: compressed content still carries all information indiscriminately.
A more promising direction treats memory as a managed system rather than a compression target.
MemAgent (Yu, Chen, Feng et al., 2025) uses RL-trained agents that read text in segments and update memory via an overwrite strategy, extrapolating from 8K training context to 3.5M token QA with <5% loss [12]. With 81 citations, it represents the current state of the art in Chinese memory agent research.
SimpleMem (Liu, Su, Xia et al., 2026) implements a three-stage pipeline — semantic compression, online synthesis, intent-aware retrieval — achieving +26.4% F1 on LoCoMo while reducing token consumption by 30× [13]. Its architecture is conceptually closest to our on-demand recall approach.
MemOS (MemTensor, Shanghai Jiao Tong University, Renmin University, China Telecom, 2025) proposes a Memory Operating System with OS-inspired lifecycle management, achieving state-of-the-art across benchmarks [14]. It operates primarily in latent space — complementary to text-level structured approaches.
M+ (ICML 2025) extends MemoryLLM with scalable long-term memory, confirming that "retaining information from the distant past remains a challenge" [15] — exactly what SQL-backed archival solves without model modification.
ACON (Kang et al., Microsoft Research, 2025) optimizes context compression for long-horizon agents, reducing memory usage by 26–54% [16]. Its key finding validates our approach: "generic summarization easily loses critical details" — task-aware, selective retrieval is essential.
Pichay (Mason, 2026) takes the most radical approach, treating the context window as L1 cache and implementing demand paging with eviction and fault detection [7]. In production, it reduces context consumption by up to 93%. Mason's framing captures the field's core insight: "the problems — context limits, attention degradation, cost scaling, lost state across sessions — are virtual memory problems wearing different clothes."
Focus (Verma, 2026) implements autonomous context compression where agents decide when to consolidate learnings and prune history, achieving 22.7% token reduction without accuracy loss [17].
MemArt (2025) demonstrates that structured retrieval improves accuracy by 11.8–39.4% over plaintext memory methods with 91–135× reduction in prefill tokens [18] — direct validation of the principle that targeted recall outperforms brute-force injection.
InfiniGen (Lee et al., Seoul National University, OSDI 2024) addresses KV cache management for long-text generation, achieving up to 3× speedup over existing methods [23]. Funded by Samsung Advanced Institute of Technology and cited 253 times, it represents the hardware-level approach to context scaling that complements software-level memory management.
THEANINE (Ong, Kim, Gwak et al., NAACL 2025) introduces timeline-based memory management for lifelong dialogue agents [24]. Its key insight — don't delete old memories, but connect them temporally and causally — directly validates UAML's temporal validity mechanism. Memories form evolving timelines rather than static snapshots.
LRAgent (Jeon et al., Korea, 2026) tackles KV cache sharing for multi-LoRA agents [25], addressing the exact overhead problem that emerges when multiple agents share a backbone but maintain separate caches.
Most striking is "Store then On-Demand Extract" (Yamanaka et al., Japan, 2026), which argues against the dominant "extract then store" paradigm and advocates storing raw data with on-demand extraction at query time [26]. This is philosophically identical to UAML's three-layer approach: preserve everything (L3), extract structured knowledge (L2), and retrieve on demand. Yamanaka's framing — "uplifting the world with memory" — captures the same conviction that memory infrastructure is foundational, not auxiliary.
AIM-RM (Yoshizato, Shimizu et al., Japan, AAMAS 2026) demonstrates practical deployment of memory retrieval in industrial supply chain agents [27], confirming that memory-augmented agents are moving beyond chatbot applications into production enterprise systems.
MemGPT/Letta (Packer et al., 2023) pioneered virtual context management inspired by OS memory hierarchies [19]. Mem0 (2025) offers scalable memory for multi-session dialogues [20]. Memex(RL) (Wang et al., 2026) is the closest academic parallel — indexed memory with compact summaries plus full-fidelity external storage [5].
Two comprehensive surveys map the field: Cognitive Memory in LLMs (Shan, Luo, Zhu et al., 2025) provides the most complete taxonomy of memory mechanisms with 34 citations [21], and A Comprehensive Survey on Long Context Language Modeling (Liu, Zhu, Bai et al., 2025) covers the full spectrum with contributions from 35+ researchers and 88 citations [22].
Across all surveyed approaches, several critical capabilities are consistently absent:
| Capability | MemAgent | SimpleMem | MemOS | Pichay | THEANINE | Mem0 | MemGPT | UAML |
|---|---|---|---|---|---|---|---|---|
| On-demand selective recall | ✗ | ✓ | Partial | ✓ | ✗ | Partial | Partial | ✓ |
| Cross-session memory | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✓ |
| End-to-end encryption | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ (PQC) |
| Audit trail / provenance | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Temporal validity | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| Multi-agent isolation | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Local-first / self-hosted | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ | Partial | ✓ |
| Certifiable for regulated use | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Zero-downtime compaction | ✗ | ✗ | ✗ | Partial | ✗ | ✗ | ✗ | ✓ |
| MCP integration (drop-in) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Production-deployed | Research | Research | Research | ✓ | Research | ✓ | Partial | ✓ |
No existing system combines selective on-demand recall with the security, auditability, and temporal reasoning required for deployment in regulated environments — healthcare, legal, financial services, government.
Academic memory systems universally neglect security. For enterprises in regulated industries, this is a deployment blocker. UAML addresses this gap with:
These are not features for a roadmap. They are implemented, tested, and deployed in production.
We believe memory infrastructure should be standardized, not proprietary. We have proposed an External Memory Provider API (RFC #49233) [28] that addresses three interconnected problems: agent downtime during compaction, information loss after compaction, and the lack of a standard integration path for external memory systems.
When an agent platform's context window fills up, the platform performs synchronous in-band compaction: the agent stops responding, the entire context is sent to an LLM for summarization, and the summary replaces the original context. In our measurements, this creates a 30–60 second blackout — the agent is completely unresponsive. For production use cases (customer support, financial services, healthcare), this is a deployment blocker.
RFC #49233 proposes a hot-swap architecture with continuous background synchronization:
┌──────────────────────────────────────────┐
│ Agent Platform │
│ Context Slot A (active) ←→ Slot B │
│ ↕ ↕ │
│ ┌────────────────────────────────┐ │
│ │ Memory Provider Interface │ │
│ └──────────┬─────────────────────┘ │
└─────────────┼────────────────────────────┘
▼
┌─────────────────────────────┐
│ External Memory Provider │
│ • Continuous async sync │
│ • Background compression │
│ • Pre-built context ready │
│ • Full audit trail │
└─────────────────────────────┘
The mechanism has four stages:
This is analogous to double-buffering in graphics: while the agent uses buffer A, the provider prepares buffer B in the background. When it's time to compact, the platform atomically swaps A for B.
The proposed interface is deliberately minimal — three core operations plus lifecycle hooks:
interface MemoryProvider {
// Fire-and-forget after each message (async, not in critical path)
onMessage(sessionId: string, message: Message): Promise<void>;
// Returns pre-built compressed context
getCompressedContext(sessionId: string, maxChars: number): Promise<CompressedContext>;
// On-demand recall tool
recall(sessionId: string, query: string, limit?: number): Promise<MemoryEntry[]>;
ping(): Promise<{ ok: boolean; latencyMs: number }>;
}
interface SessionHooks {
'pre-compaction': (session: Session) => Promise<void>;
'post-compaction': (session: Session, newContext: CompressedContext) => Promise<void>;
'session-start': (session: Session) => Promise<CompressedContext | null>; // Cold-start recovery
'session-end': (session: Session) => Promise<void>;
}
Configuration is opt-in with full backward compatibility:
{
"memory": {
"provider": "uaml",
"endpoint": "http://localhost:8770",
"compaction": {
"strategy": "hot-swap",
"threshold": 0.85,
"targetSize": 0.40
}
}
}
| Metric | Current (builtin) | With Memory Provider |
|---|---|---|
| Compaction duration | 30–60s | <100ms |
| Agent downtime | 30–60s | 0 (between messages) |
| Information loss | Significant (77%) | None (full DB) |
| Audit trail | None | Complete |
| Cost per compaction | ~$0.10–0.50 | ~$0.001 (async local) |
A critical design decision is the choice of integration protocol for the recall tool. UAML implements the Model Context Protocol (MCP) connector, which means any existing LLM agent that supports MCP tool calls can connect to UAML's full memory capabilities — recall, indexing, knowledge extraction — without any code changes to the agent itself. The agent simply gains a new tool in its toolbox. No forking, no framework lock-in, no migration.
In our production deployment, two agents from different platforms were connected to UAML via MCP within minutes, immediately gaining access to structured memory recall while retaining all their existing functionality. The MCP approach turns memory from a platform feature into a universal service layer that any agent can consume.
A critical design principle: the platform's builtin compaction pipeline always runs in parallel — it is never disabled, even when a memory provider is active.
Builtin Compaction ─────────────────────► ALWAYS running (shadow/backup)
UAML Memory Provider ────────────────────► Enriches when available (overlay)
This guarantees 100% functionality at all times:
Not all information is equally important. The memory provider classifies every entry:
| Level | Criteria | % of data | Examples |
|---|---|---|---|
| HIGH | Decisions, rules, architecture choices | 0.5% | "Chose ML-KEM-768 for encryption" |
| MEDIUM | Entity mentions, config changes, results | 7% | "VPS IP: 5.189.139.221" |
| LOW | Debug output, heartbeats, transient status | 93% | Tool output, NO_REPLY messages |
This filtering ensures the recall API returns signal, not noise. The getCompressedContext endpoint prioritizes HIGH and MEDIUM entries, keeping injected context compact and relevant.
Every knowledge entry maintains a verifiable provenance chain from recalled fact to original message:
UAML Entry #4521
├── content: "Decided on hot-swap compaction strategy"
├── importance: HIGH (score: 7)
├── source: session:a7e0260c:msg_hash_abc123
├── chat_history_id: 28451 ← links to SQL archive
└── created_at: 2026-03-18T14:23:00Z
│
▼
SQL chat_messages #28451
├── text: [full original message, verbatim]
├── source_file: a7e0260c-...jsonl
└── source_line: 14832
This enables audit (trace any fact to its source), verification (compare summary against verbatim record), and compliance (demonstrate data provenance for regulated environments).
In production, agents operate across multiple sessions (Discord channels, messaging platforms, scheduled tasks). The proposed API extension enables:
The design prioritizes graceful degradation:
session-start hook enables context reconstruction from historical data| Phase | Capability | Estimated Timeline |
|---|---|---|
| 1 | Pre/post-compaction hooks | 1–2 weeks |
| 2 | Memory Provider Interface | 2–3 weeks |
| 3 | Hot-swap compaction strategy | 3–4 weeks |
| 4 | Background sync + config + docs | 2–3 weeks |
Phase 1 alone would already enable external memory integration and demonstrate value. The full RFC proposal is publicly available at github.com/openclaw/openclaw/issues/49233 and has received community attention and independent analysis, confirming demand for standardized memory infrastructure.
Research groups across East Asia are producing world-class work on context compression and memory architectures. Chinese institutions (BAAI, Shanghai Jiao Tong, Harbin Institute, Huawei, China Telecom, MemTensor), Korean groups (Seoul National University, Samsung, KAIST), and Japanese researchers are collectively advancing the field at a pace that exceeds Western output in volume and increasingly matches it in impact. Any serious memory infrastructure product must engage with this global research base.
The evidence converges from multiple independent sources: context quality determines output quality. A smaller context with the right information produces better results than a larger context with noise. Standard compaction loses 77% of named entities — a measurable degradation that directly affects agent decision quality.
On-demand recall from structured memory resolves this without the token cost of context stuffing. Our three-layer architecture (compaction + knowledge base + archive) achieves 100% entity recovery while maintaining minimal per-turn costs. The approach requires no model modifications, no cloud dependencies, and works as an overlay on existing agent platforms.
We are not arguing for replacing compaction — it serves a useful purpose in maintaining manageable context sizes. We are arguing that compaction alone is insufficient, and that a structured external memory layer transforms it from a lossy compression into a lossless one.
The organizations that understand this will build agents that actually work. The rest will keep buying bigger models and wondering why they still forget.
Memory quality is the new model quality.
[1] Liu, N.F. et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL, 12, 157–173. arXiv:2307.03172
[2] Esmi, N. et al. (2026). "GPT-5 vs Other LLMs in Long Short-Context Performance." arXiv, February 2026.
[3] Salvatore, N. et al. (2025). "Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs." arXiv, October 2025.
[4] Chroma Research (2026). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." research.trychroma.com/context-rot
[5] Wang, Z. et al. (2026). "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory." arXiv:2603.04257
[6] Tan, Y. et al. (2024). "When Retrieved Context Conflicts with Parametric Knowledge." ACL 2024, Harbin Institute of Technology.
[7] Mason, T. (2026). "The Missing Memory Hierarchy: Demand Paging for LLM Context Windows." arXiv:2603.09023
[8] Iratni, M. et al. (2025). "Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias." arXiv, December 2025.
[9] Zhang, P. et al. (2024). "Long Context Compression with Activation Beacon." BAAI/FlagOpen. arXiv:2401.03462
[10] Huang, C. et al. (2024). "Recurrent Context Compression: Efficiently Expanding the Context Window of LLM." arXiv:2406.06110
[11] Fei, W. et al. (2024). "Extending Context Window of Large Language Models via Semantic Compression." Huawei. Findings of ACL 2024, 5169–5181.
[12] Yu, H. et al. (2025). "MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent." arXiv:2507.02259
[13] Liu, J. et al. (2026). "SimpleMem: Efficient Lifelong Memory for LLM Agents." arXiv:2601.02553
[14] MemTensor et al. (2025). "MemOS: A Memory OS for AI System." Shanghai Jiao Tong University, Renmin University, China Telecom. arXiv:2507.03724
[15] M+ (2025). "Extending MemoryLLM with Scalable Long-Term Memory." ICML 2025. arXiv:2502.00592
[16] Kang, M. et al. (2025). "ACON: Optimizing Context Compression for Long-Horizon LLM Agents." arXiv:2510.00615
[17] Verma, N. (2026). "Active Context Compression: Autonomous Memory Management in LLM Agents." arXiv:2601.07190
[18] MemArt (2025). "KVCache-Centric Memory for LLM Agents." OpenReview.
[19] Packer, C. et al. (2023). "MemGPT: Towards LLMs as Operating Systems." arXiv:2310.08560
[20] Mem0 (2025). "Building Production-Ready AI Agents with Scalable Long-Term Memory." arXiv:2504.19413
[21] Shan, L. et al. (2025). "Cognitive Memory in Large Language Models." arXiv:2504.02441
[22] Liu, J. et al. (2025). "A Comprehensive Survey on Long Context Language Modeling." arXiv:2503.17407
[23] Lee, S. et al. (2024). "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management." OSDI 2024, Seoul National University.
[24] Ong, D., Kim, H., Gwak, S. et al. (2025). "THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation." NAACL 2025.
[25] Jeon, H. et al. (2026). "LRAgent: Multi-LoRA Agents with KV Cache Sharing." arXiv:2602.01053
[26] Yamanaka, Y. et al. (2026). "Store then On-Demand Extract: A Memory Architecture for LLM Agents." arXiv:2602.16192
[27] Yoshizato, T., Shimizu, H. et al. (2026). "AIM-RM: Agent-based Inventory Management with Retrieval Memory." AAMAS 2026. arXiv:2602.05524
[28] Zamazal, L. / GLG, a.s. (2026). "RFC: External Memory Provider API for OpenClaw." GitHub Issue #49233. github.com/openclaw/openclaw/issues/49233
GLG, a.s. — UAML (Universal Agent Memory Layer) is available at uaml-memory.com. Technical documentation and API reference at smart-memory.ai.