MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

Layers of Memory, Layers of Compression

2025-06-15 08:00:00

Recently, Anthropic published a blog post detailing their multi-agent approach to building their Research agent. Also, Cognition wrote a post on why multi-agent systems don’t work today. The thing is, they’re both saying the same thing.

At the same time, I’ve been enthralled watching a new bot, Void, interact with users on Bluesky. Void is written in Letta, an AI framework oriented around memory. Void feels alive in a way no other AI bot I’ve encountered feels. Something about the memory gives it a certain magic.

I took some time to dive into Letta’s architecture and noticed a ton of parallels with what the Anthropic and Cognition posts were saying, around context management. Letta takes a different approach.

Below, I’ve had OpenAI Deep Research format our conversation into a blog post. I’ve done some light editing, adding visuals etc., but generally it’s all AI. I appreciated this, I hope you do too.


When an AI agent “remembers,” it compresses. Finite context windows force hard choices about what to keep verbatim, what to summarize, and what to discard. Letta’s layered memory architecture embraces this reality by structuring an agent’s memory into tiers – each a lossy compression of the last. This design isn’t just a storage trick; it’s an information strategy.

Layered Memory as Lossy Compression

Letta (formerly MemGPT) splits memory into four memory blocks: core, message buffer, archival, and recall. Think of these as concentric rings of context, from most essential to most expansive, similar to L1, L2, L3 cache on a CPU:

flowchart TD subgraph rec[Recall Memory] subgraph arch[Archival Memory] subgraph msg[Message Buffer] Core[Core Memory] end end end
  • Core memory holds the agent’s invariants – the system persona, key instructions, fundamental facts. It’s small but always in the prompt, like the kernel of identity and immediate purpose.
  • Message buffer is a rolling window of recent conversation. This is the agent’s short-term memory (recent user messages and responses) with a fixed capacity. As new messages come in, older ones eventually overflow.
  • Archival memory is a long-term store, often an external vector database or text log, where overflow messages and distilled knowledge go. It’s practically unbounded in size, but far from the model’s immediate gaze. This is highly compressed memory – not compressed in ZIP-file fashion, but in being irrelevant by default until needed.
  • Recall memory is the retrieval buffer. When the agent needs something from the archive, it issues a query; relevant snippets are loaded into this block for use. In effect, recall memory “rehydrates” compressed knowledge on demand.

How it works: On each turn, the agent assembles its context from core knowledge, the fresh message buffer, and any recall snippets. All three streams feed into the model’s input. Meanwhile, if the message buffer is full, the oldest interactions get archived out to long-term memory.

Later, if those details become relevant, the agent can query the archival store to retrieve them into the recall slot. What’s crucial is that each layer is a lossy filter: core memory is tiny but high-priority (no loss for the most vital data), the message buffer holds only recent events (older details dropped unless explicitly saved), and the archive contains everything in theory but only yields an approximate answer via search. The agent itself chooses what to promote to long-term storage (e.g. summarizing and saving a key decision) and what to fetch back.

It’s a cascade of compressions and selective decompressions.

Rate–distortion tradeoff: This hierarchy embodies a classic principle from information theory. With a fixed channel (context window) size, maximizing information fidelity means balancing rate (how many tokens we include) against distortion (how much detail we lose).

Letta’s memory blocks are essentially a rate–distortion ladder. Core memory has a tiny rate (few tokens) but zero distortion on the most critical facts. The message buffer has a larger rate (recent dialogue in full) but cannot hold everything – older context is distorted by omission or summary. Archival memory has effectively infinite capacity (high rate) but in practice high distortion: it’s all the minutiae and past conversations compressed into embeddings or summaries that the agent might never look at again.

The recall stage tries to recover (rehydrate) just enough of that detail when needed. Every step accepts some information loss to preserve what matters most. In other words, to remember usefully, the agent must forget judiciously.

This layered approach turns memory management into an act of cognition.

Summarizing a chunk of conversation before archiving it forces the agent to decide what the gist is – a form of understanding. Searching the archive for relevant facts forces it to formulate good queries – effectively reasoning about what was important. In Letta’s design, compression is not just a storage optimization; it is part of the thinking process. The agent is continually compressing its history and decompressing relevant knowledge as needed, like a human mind generalizing past events but recalling a specific detail when prompted.

flowchart TD U[User Input] ---> LLM CI[Core Instructions] --> LLM RM["Recent Messages
(Short-term Buffer)"] --> LLM RS["Retrieved Snippets
(Recall)"] --> LLM LLM ----> AR[Agent Response] RM -- evict / summarize --> VS["Vector Store
(Archival Memory)"] LLM -- summarize ---> VS VS -- retrieve --> RS

Caption: As new user input comes in, the agent’s core instructions and recent messages combine with any retrieved snippets from long-term memory, all funneling into the LLM. After responding, the agent may drop the oldest message from short-term memory into a vector store, and perhaps summarize it for posterity. The next query might hit that store and pull up the summary as needed. The memory “cache” is always in flux.

One Mind vs. Many Minds: Two Approaches to Compression

The above is a single-agent solution: one cognitive entity juggling compressed memories over time. An alternative approach has emerged that distributes cognition across multiple agents, each with its own context window – in effect, parallel minds that later merge their knowledge.

Anthropic’s recent multi-agent research system frames intelligence itself as an exercise in compression across agents. In their words, “The essence of search is compression: distilling insights from a vast corpus.” Subagents “facilitate compression by operating in parallel with their own context windows… condensing the most important tokens for the lead research agent”.

Instead of one agent with one context compressing over time, they spin up several agents that each compress different aspects of a problem in parallel. The lead agent acts like a coordinator, taking these condensed answers and integrating them.

This multi-agent strategy acknowledges the same limitation (finite context per agent) but tackles it by splitting the work. Each subagent effectively says, “I’ll compress this chunk of the task down to a summary for you,” and the lead agent aggregates those results.

It’s analogous to a team of researchers: divide the topic, each person reads a mountain of material and reports back with a summary so the leader can synthesize a conclusion. By partitioning the context across agents, the system can cover far more ground than a single context window would allow.

In fact, Anthropic found that a well-coordinated multi-agent setup outperformed a single-agent approach on broad queries that require exploring many sources. The subagents provided separation of concerns (each focused on one thread of the problem) and reduced the path-dependence of reasoning – because they explored independently, the final answer benefited from multiple compressions of evidence rather than one linear search.

However, this comes at a cost.

Coordination overhead and consistency become serious challenges. Cognition’s Walden Yan argues that multi-agent systems today are fragile chiefly due to context management failures. Each agent only sees a slice of the whole, so misunderstandings proliferate.

One subagent might interpret a task slightly differently than another, and without a shared memory of each other’s decisions, the final assembly can conflict or miss pieces. As Yan puts it, running multiple agents in collaboration in 2025 “only results in fragile systems. The decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents.” In other words, when each subagent compresses its piece of reality in isolation, the group may lack a common context to stay aligned.

In Anthropic’s terms, the “separation of concerns” cuts both ways: it reduces interference, but also means no single agent grasps the full picture. Humans solve this by constant communication (we compress our thoughts into language and share it), but current AI agents aren’t yet adept at the high-bandwidth, nuanced communication needed to truly stay in sync over long tasks.

Cognition’s solution? Don’t default to multi-agent. First try a simpler architecture: one agent, one continuous context. Ensure every decision that agent makes “sees” the trace of reasoning that led up to it – no hidden divergent contexts.

Of course, a single context will eventually overflow, but the answer isn’t to spawn independent agents; it’s to better compress the context. Yan suggests using an extra model whose sole job is to condense the conversation history into “key details, events, and decisions.”

This summarized memory can then persist as the backbone context for the main agent. In fact, Cognition has fine-tuned smaller models to perform this kind of compression reliably. The philosophy is that if you must lose information, lose it intentionally and in one place – via a trained compressor – rather than losing it implicitly across multiple agents’ blind spots.

This approach echoes Letta’s layered memory idea: maintain one coherent thread of thought, pruning and abstracting it as needed, instead of forking into many threads that might diverge.

Conclusion: Compression is Cognition

In the end, these approaches converge on a theme: intelligence is limited by information bottlenecks, and overcoming those limits looks a lot like compression. Whether it’s a single agent summarizing its past and querying a knowledge base, or a swarm of subagents parceling out a huge problem and each reporting back a digest, the core challenge is the same.

An effective mind (machine or human) can’t and shouldn’t hold every detail in working memory – it must aggressively filter, abstract, and encode information, yet be ready to recover the right detail at the right time. This is the classic rate–distortion tradeoff of cognition: maximize useful signal, minimize wasted space.

Letta’s layered memory shows one way: a built-in hierarchy of memory caches, from the always-present essentials to the vast but faint echo of long-term archives. Anthropic’s multi-agent system shows another: multiple minds sharing the load, each mind a lossy compressor for a different subset of the task. And Cognition’s critique reminds us that compression without coordination can fail – the pieces have to ultimately fit together into a coherent whole.

Perhaps as AI agents evolve, we’ll see hybrid strategies. We might use multi-agent teams whose members share a common architectural memory (imagine subagents all plugged into a shared Letta-style archival memory, so they’re not flying blind with respect to each other). Or we might simply get better at single agents with enormous contexts and sophisticated internal compression mechanisms, making multi-agent orchestration unnecessary for most tasks. Either way, the direction is clear: to control and extend AI cognition, we are, in a very real sense, engineering the art of forgetting. By deciding what to forget and when to recall, an agent demonstrates what it truly understands. In artificial minds as in our own, memory is meaningful precisely because it isn’t perfect recording – it’s prioritized, lossy, and alive.

A2A Is For UI

2025-06-14 08:00:00

There’s a lot of skepticism around A2A, Google’s Agent-to-Agent protocol. A lot of that is well earned. I mean, they launched a protocol with zero implementations. But a lot’s changed, and it’s worth taking a look again.

I’d like to convince you that you should be thinking about A2A as a protocol for giving agents a UI. And that UI is a bridge into a more complex multi-agent world. Gotta start somewhere!

It’s Just HTTP

The protocol is just a single HTTP endpoint and an agent card (can be statically served). Inside that single endpoint are JSON RPC methods:

  • message/send & message/stream — Both send messages, one returns a stream of events (SSE). The first message implicitly creates a task.
  • tasks/resubscribe — For when you were doing message/stream but your connection broke.
  • tasks/get — If you want to poll. SSE isn’t for everyone, I guess. cURL works too.
  • tasks/pushNotifications/set & .../get — for webhooks, if that’s your thing

So basically, you create a task, and then you exchange messages with it. That’s it.

Tasks are Actors

Uh, if you don’t know what actors are, this analogy might not help, but I’m going with it anyway.

Tasks are actors (think Erlang actors or Akka). The first time you send a message to an actor, a task (an actor) is implicitly created.

flowchart TD client((client)) client--send msg-->box[implicit mailbox] box-->task--"also
queued"-->client

Messages are processed one-at-a-time, in the order they were received. Messages can mutate task state. But it doesn’t get crazy because the interaction is very single threaded (well, I guess you could process messages in parallel, but why?)

UIs are Agents

I think of a UI as being an agent that happens to have a human behind it. Not an AI agent, but a human agent. The UI application code handles the computer part, the human handles the intelligence part.

Yes, A2A was designed for sending messages between AI agents, but we don’t currently live in a world where open-ended multi-agent systems are pervasive. We do live in a world where humans talk to agents. And that won’t ever really change, because agents aren’t super valuable if their work never makes it to a human.

A2A supports any data

Each message, in either direction, contains multiple parts, each of one of these types:

  • TextPart — plain text, think unprocessed LLM outputs
  • DataPart — think JSON or binary. The format is specified by the mime type
  • FilePart — like DataPart, but can be at a URL

So an agent can do things like mix plain LLM outputs with JSON outputs.

Delegate state to Prefect or Temporal

One subtly difficult part of A2A is that it requires keeping state, potentially over long periods of time.

For example, an agent realizes the initiating user didn’t say enough, so it asks for clarification. People aren’t very good computers and while we sometimes respond quickly, sometimes we take minutes or hours, or even years. Or never.

How do you deal with that?

I’ve dealt with this by using systems like Temporal and Prefect. Both are sometimes called “workflow” systems, but can also be thought of as providing durable function execution.

Both are more interesting than most workflow systems because they also provide suspend & resume functionality. For example, in prefect you can call await suspend_flow_run() and the flow will be completely shut down and occupy zero memory or CPU while the user is twiddling their thumbs.

The Shim

I pulled this diagram directly from FastA2A docs:

flowchart TB Server["HTTP Server"] <--> |Sends Requests/
Receives Results| TM subgraph CC[Core Components] direction RL TM["TaskManager
(coordinates)"] --> |Schedules Tasks| Broker TM <--> Storage Broker["Broker
(queues & schedules)"] <--> Storage["Storage
(persistence)"] Broker --> |Delegates Execution| Worker end Worker["Worker
(implementation)"]

Note: I like FastA2A because it implements the HTTP endpoints as a Starlette app that you can easily mount right into your API alongside everything else. Also, it has basically nothing to do with Pydantic or Pydantic AI other than it happens to be collocated inside the same Github repository.

FastA2A clearly realizes there’s a state problem and so they created interfaces for dealing with it. Not only that, but these interfaces are a fairly standard architecture for workflow systems.

I’ve created simple shims for both Temporal and Prefect that use the workflow systems to implement the TaskManager, Storage and Broker. The idea being you could use either Prefect or Temporal, whichever you prefer, to quickly create a robust A2A-compatible agent.

They’re each ~100 lines of code, yet implement just about everything you’d want from a stateful system, from retries and deployments to observability and a management UI.

Where does this fit into your agent?

Say you’re following the Plan-Act-Verify flow that’s become popular:

flowchart TD client((client)) client-->clarify[Clarify Question]-->Plan-->Act["Act (Query)"]-->Verify-->Plan Verify-->prepare[Prepare Report]-->client2((client))

All those boxes are things that need to happen once and only once (well, in a loop). Every agent has a slightly different take on this, but many boil down to some variant of this architecture. The workflows don’t have to be complicated (but by all means, they can be).

The point is, yes, A2A is stateful and statefulness can be hard. But it can be solved simply and cleanly by delegating to other hardened distributed systems that were designed to do this well.

A2A Versus MCP

Simply, MCP is for tools (function-like things with inputs and outputs). A2A is for when you need free-form communication. Hence why tasks look more like actors.

They also solve similar fan-out problems. MCP enables many tools to be used by few AI applications or agents. A2A enables many agents to be used by few user interfaces and other agents.

flowchart TD subgraph c[A2A Clients] teams[MS Teams] agentspace[Google
AgentSpace] ServiceNow end subgraph m[MCP Servers] comp[Computer
Use] search[Web
Search] APIs end teams-->Agent[A2A-compatible
Agent] agentspace-->Agent ServiceNow-->Agent Agent-->comp Agent-->search Agent-->APIs

Side note: AI Engineering has become incredibly complex. You have to master not just AI tech, but also be a full-stack engineer and a data engineer. The emergence of A2A & MCP dramatically reduces the scope of an AI engineer, and that’s exciting on it’s own.

Implementation is picking up quickly

I’m going to finish this post by linking to a ton of products that are using A2A or soon will. My hope being that you’ll realize that now is a good time to get in on this.

A2A-compatible agents you can launch (server side)

Commercial / SaaS agents – live today

  • Google-built agents inside Vertex AI Agent Builder & Agentspace – e.g., Deep Research Agent, Idea Generation Agent; all expose an A2A JSON-RPC endpoint out of the box. (cloud.google.com, cloud.google.com, cloud.google.com)
  • SAP Joule Agents & Business Agent Foundation – Joule delegates work to SAP and non-SAP systems via A2A. (news.sap.com, architecture.learning.sap.com)
  • Box AI Agents – content-centric agents (contract analysis, form extraction) advertise themselves through A2A so external agents can call them. (developers.googleblog.com, blog.box.com)
  • Zoom AI Companion – meeting-scheduling and recap agents are now published as A2A servers on the Zoom developer platform. (instagram.com, uctoday.com)
  • UiPath Maestro agents – healthcare summarization, invoice triage, etc.; natively speak A2A for cross-platform automation. (uipath.com, itbrief.com.au)
  • Deloitte enterprise Gemini agents – 100 + production agents deployed for clients, exposed over A2A. (venturebeat.com)

Open-source agents & frameworks

  • LangGraph sample Currency-Agent, Travel-Agent, etc. (a2aprotocol.ai, github.com)
  • CrewAI – “crews” can publish themselves as remote A2A services (#2970). (github.com)
  • Semantic Kernel travel-planner & “Meeting Agent” demos. (devblogs.microsoft.com, linkedin.com)
  • FastA2A reference server (Starlette + Pydantic AI) – minimal A2A turnkey agent. (github.com)
  • Official a2a-samples repo – dozens of runnable Python & JS agents. (github.com)

Announced / on the roadmap

  • Salesforce Agentforce will “incorporate standard protocols like A2A” in upcoming releases. (medium.com, salesforce.com)
  • ServiceNow, Atlassian, Intuit, MongoDB, PayPal, Workday, Accenture and ~40 other partners listed by Google as “founding A2A agents.” (venturebeat.com)

Products that dispatch to A2A agents (client/orchestrator side)

Cloud platforms & orchestration layers

  • Azure AI Foundry – multi-agent pipelines can send tasks/send & tasks/stream RPCs to any A2A server. (microsoft.com, microsoft.com)
  • Microsoft Copilot Studio – low-code tool that now “securely invokes external agents” over A2A. (microsoft.com, microsoft.com)
  • Google Agentspace hub – lets knowledge workers discover, invoke, and chain A2A agents (including third-party ones). (cloud.google.com, cloud.google.com)
  • Vertex AI Agent Builder – generates dispatch stubs so your front-end or workflow engine can call remote A2A agents. (cloud.google.com)

Gateways & governance

  • MuleSoft Flex Gateway – Governance for Agent Interactions – policy enforcement, rate-limiting, and auth for outbound A2A calls. (blogs.mulesoft.com, docs.mulesoft.com)
  • Auth0 “Market0” demo – shows how to mint JWT-style tokens and forward them in authentication headers for A2A requests. (auth0.com)

Open-source dispatch tooling

  • Official A2A Python SDK (a2a-python) – full client API (tasks/send, SSE streaming, retries). (github.com)
  • a2a-js client library (part of the A2A GitHub org). (github.com)
  • n8n-nodes-agent2agent – drop-in nodes that let any n8n workflow call or await A2A agents. (github.com)

Coming soon

  • UiPath Maestro orchestration layer (already works internally, public A2A client API expanding). (linkedin.com)
  • Salesforce Agentforce Mobile SDK – upcoming SDK will be able to dispatch to external A2A agents from mobile apps. (salesforceben.com)
  • ServiceNow & UiPath cross-dispatch partnerships are in private preview. (venturebeat.com)

MCP Resources Are For Caching

2025-06-05 08:00:00

If your MCP client doesn’t support resources, it is not a good client.

There! I said it!

It’s because MCP resources are for improved prompt utilization, namely cache invalidation. Without resources, you eat through your context and token budget faster than Elon at a drug store. And so if your client doesn’t support it, you basically can’t do RAG with MCP. At least not in a way that anyone would consider production worthy.

RAG documents are BIG

You don’t want to duplicate files. See this here:

system prompt
user message with tool definitions
agent message with tool calls
user message with tool call results
giant file 1
giant file 2
another agent message with tool calls
user message with tool call results
giant file 2
giant file 3
...

That’s 2 tool calls. The second one contains a duplicate file.

Is this bad? If your answer is “no” then this blog post isn’t going to resonate with you.

Separate results from whole files

The core of it: A well-implemented app, MCP or not, will keep track of the documents returned from a RAG query and avoid duplicating them in the prompt. To do this, you keep a list of resource IDs that you’ve seen before (sure, call it a “cache”).

Format the RAG tool response in the prompt like so:

<result uri="rag://polar-bears/74.md" />
<result uri="rag://chickens/23.md" />

<full-text uri="rag://chickens/23">
Chickens are...
</full-text>

In other words:

  1. The return value of the function, to the LLM, is an array of resources
  2. The full text is included elsewhere, for reference

URIs are useful as a cache key.

btw I’m just spitballing what the prompt format should be for returning results. You can play around with it, you might already have strong opinions. The point is, mapping must be done.

MCP is not LLM-readable

There’s been a lot of discussion about if LLMs can interpret OpenAPI fine, and if so, why use MCP. That misses the entire point. MCP isn’t supposed to be interpreted directly by an LLM.

When you implement an MCP client, you should be mapping MCP concepts to whatever works for that LLM. This is called implementing the protocol. If you throw vanilla MCP objects into a prompt, it could actually work. But a good client is going to map the results to phrases & formats that particular LLM has gone through extraordinarily expensive training to understand.

MCP is a protocol

MCP standardizes how tools should return their results. MCP resources exist so that tools (e.g. RAG search) can return files, and client can de-duplicate those files across many calls.

Yes, it’s cool that you can list a directory, but that’s not the primary purpose of resources. Without resources, your LLMs just eat more tokens unnecessarily.

(Side note: did you notice that neither Anthropic nor OpenAI supports resources in their APIs? It’s a conspiracy..)

Resources are table stakes MCP support

If a client doesn’t support MCP resources, it’s because they don’t care enough to implement a proper client. Period.

While I’m at it, prompts are just functions with special handling of the results. Might as well support those too.

UPDATE: tupac

I made a minimalist reference implementation of an MCP client. Feel free to check it out. Bare minimum, it’s extremely useful. It’s on Github and runs with a simple uvx command.

Discussion

I was wrong: AI Won't Overtake Software Engineering

2025-05-10 08:00:00

Back in January I wrote, Normware: The Decline of Software Engineering and, while I think it was generally well-reasoned, I was wrong. Or at least overly ambitious.

I predicted that software engineering as a profession is bound to decline and be replaced by less technical people with AI that are closer to the business problems. I no longer think that will happen, but not for technical reasons, but for social reasons.

What Changed

I saw people code.

I wrote the initial piece after using Cursor’s agent a few times. Since then the tools have gotten even more powerful and I can reliably one-shot entire non-trivial apps. I told a PM buddy about how I was doing it and he wanted to try and… it didn’t work. Not at all.

What I learned:

  1. I’m using a lot of hidden technical skills
  2. Yes, anyone can do it, but few will

On the surface it was stuff like, I’m comfortable in the terminal, he was not. And I don’t freak out when I get a huge error. But also softer skills, like how I know what complex code looks like vs simple code (with AI coding, overly complex code will cause an agent to deadlock). Also, he tried including authentication in the earliest version (lol n00b).

For some people, those are merely road blocks. I’ve talked to a few people with zero technical background that are absolutely crushing it with code right now. It’s hard, but they have the drive to push through the hard parts. Sure, they’ve got their fair share of total flops, but they a strong will and push through.

Those are not common people. Most are weak, or just care about other things.

How It Will happen

I suppose this scene hasn’t unfolded and maybe my first take was right after all. But I don’t think so.

It’s likely that AI improves dramatically and makes it seamless to generate any code at any time. That will certainly increase the pool of people willing to suffer through coding. But I don’t think it can shift enough such that the Normware vision pans out. Most people just aren’t interested.

Instead, I think we’ll see a steady decline of “boring code” jobs.

Someone at a very large tech company told me they worked on a (software engineering!!) team that did nothing but make configuration changes. That’s nuts. Over time, I think AI will chip away at these roles until they’re gone and replaced by code that engineers (say they) want to write. Early prototypes and demo-quality software is already being replaced by AI, and the trend will continue from that end as well.

Discussion

MCP is Unnecessary

2025-04-27 08:00:00

I can’t think of any strong technological reasons for MCP to exist. There’s a lot of weak technological reasons, and there’s strong sociological reasons. I still strongly feel that, ironically, it is necessary. I’m writing this post to force myself to clarify my own thoughts, and to get opinions from everyone else.

Misconception: MCP Doesn’t Go Into The Prompt

You absolutely can directly paste the JSON from an MCP tool declaration into a prompt. It’ll work, and it’s arguably better than doing the same with OpenAPI. But it’s JSON, extremely parseable, structured information, and most LLMs are trained to do function calling with some XML-like variant anyway.

An LLM tool declaration can look like:

  • Raw MCP/OpenAPI JSON
  • Formatted as XML
  • Use the tool calling APIs (e.g. OpenAI, Ollama)
  • Formatted as Python code (e.g. smolagents)

MCP is not concerned with what your prompt looks like. That is not a function of MCP.

Tool Libraries

MCP has two primary functions:

  1. Advertising tools
  2. Calling tools

It does a lot of other things (logging, sampling, etc.), but tool calling is the part that’s most frequently implemented and used.

You could accomplish the same thing with OpenAPI:

  1. Advertising tools: Always post the openapi.json file in the same place
  2. Calling tools: OpenAPI standardizes this part

This is even easier than you think. OpenAPI operations have an operationId that is usually set to the function name of the server API anyway.

Steelman: OpenAPI APIs Are Too Granular

This is a good argument, at least on the surface. Here’s an example of a typical API representing an async task:

graph TD c((client))-->start_job c-->poll_status c-->get_result

You can wrap all that into one single MCP operation. One operation is better than 3 because it removes the possibility that the LLM can behave wrong.

graph TD subgraph MCP c end client((client))-->c c[do_job]-->start_job c-->poll_status c-->get_result

Okay, but why does this have to be MCP? Why can’t you do the same thing with OpenAPI?

Steelman: MCP Is Optimized For LLMs

Yes, most APIs don’t work well directly in LLM prompts because they’re not designed or documented well.

There’s great tooling in the MCP ecosystem for composing servers and operations, enhancing documentation, etc. So on the surface, it seems like MCP is an advancement in API design and documentation.

But again, why can’t OpenAPI also be that advancement? There’s no technological reason.

Steelman: MCP Is A Sociological Advancement

Here’s the thing. Everything you can do with MCP you can do with OpenAPI. But..

  1. It’s not being done
  2. There’s too many ways to do it

Why isn’t it being done? In the example of the async API, the operation might take a very long time, hence why it’s an async API. There’s no technical reason why APIs can’t take a long time. In fact, MCP implements tool calls via Server Sent Events (SSE). OpenAPI can represent SSE.

The reason we don’t do OpenAPI that way is because engineering teams have been conditioned to keep close watch on operation latency. If an API operation takes longer than a few hundred milliseconds, someone should be spotting that on a graph and diagnosing the cause. There’s a lot of reasons for this, but it’s fundamentally sociological.

SSE is a newer technology. When we measure latency with SSE operations, we measure time-to-first-byte. So it’s 100% solveable, but async APIs are more familiar so we just do that.

Steelman: One Way To Do Things

The absolute strongest argument for MCP is that there’s mostly only a single way to do things.

If you want to waste an entire day of an engineering team’s time, go find an arbitrary API POST operation and ask, “but shouldn’t this be a PUT?” You’ll quickly discover that HTTP has a lot of ambiguity. Even when things are clear, they don’t always map well to how we normally think, so it gets implemented inconsistently.

MCP OpenAPI
function call resources, PUT/POST/DELETE
function parameters query args, path args, body, headers
return value SSE, JSON, web sockets, etc.

Conclusion: Standardization Is Valuable

Standards are mostly sociological advancements. Yes, they concern technology, but they govern how society interacts with them. The biggest reason for MCP is simply that everyone else is doing it. Sure, you can be a purist and demand that OpenAPI is adequate, but how many clients support it?

The reason everyone is agreeing on MCP is because it’s far smaller than OpenAPI. Everything in the tools part of an MCP server is directly isomorphic to something else in OpenAPI. In fact, I can easily generate an MCP server from an openapi.json file, and vice versa. But MCP is far smaller and purpose-focused than OpenAPI is.

Discussion

Inner Loop Agents

2025-04-19 08:00:00

What if an LLM could use tools directly? As in, what if LLMs executed tool calls without going back to the client. That’s the idea behind inner loop agents. It’s a conceptual shift. Instead of thinking of agents as being a system involving client & server, you just have a single entity, the LLM. I hope it will help clarify how o3 and o4-mini work.

(note: this post isn’t as long as it looks, there’s a lot of diagrams and examples)

To illustrate, regular LLMs rely on the client to parse and execute tools, like this:

graph TD subgraph inn["LLM (Inner Loop)"] Tokenizer-->nn[Neural Net]-->samp[Select Next Token]-->Tokenizer end text((Input))-->Tokenizer parse--->out((Output)) samp-->parse[Parse Tool Calls]-->exec[Run Tools]-->parse parse--"tool
result"-->Tokenizer

On the other hand, with inner loop agents, the LLM can parse and execute tools on it’s own, like this:

graph TD subgraph inn["Inner Loop Agent"] direction TB Tokenizer nn[Neural Net] samp[Select Next Token] parse[Parse Tool Calls] exec[Run Tools] end text((Input)) --> Tokenizer Tokenizer --> nn --> samp --> parse parse --> exec -->parse parse -----> Tokenizer parse ---> out((Output))

The LLM Operating Software (Ollama, vLLM, etc)

In these diagrams, the LLM is emitting text that looks like this:

System: You are an agent with access to the following tools:

<tool name="google_maps" description="Look up directions between two places on Google Maps">
    <param name="begin" description="The starting point of the trip"/>
    <param name="end" description="The ending point of the trip"/>
</tool>


User: How do you drive from Raleigh, NC to Greene, NY?


Assistant: To do this, I will use my Google Maps tool.

<tool name="google_maps">
    <param name="begin">Raleigh, NC</param>
    <param name="end">Greene, NY</param>
</tool>
<|eot|>

The LLM only generates the text after "Assistant:"

That <|eot|> is a special token that the LLM is trained to emit as a way to signal that it’s done.

The software you’re using to run your LLM, e.g. Ollama, vLLM, OpenAI, Anthropic, etc., is responsible for running this loop. It parses the LLM output and stops the loop when it runs into a <|eot|> token.

If you use the tool calling APIs (Ollama, OpenAI), Ollama will parse out the tool call and return it as JSON in the API response.

Ollama and vLLM are special in that they support a lot of different models. Some models are trained to represent tool calls with XML, others are JSON, others something else entirely. Ollama and vLLM abstract that away by allowing the model to configure how it wants to represent tool calls. It doesn’t much matter what the format is, only you’re consistent with how the model was trained.

Why Are Inner Loop Agents Good?

Okay, so inner loop agents still do all that parsing. The only difference is that they handle the tool calling themselves instead of letting the client handle the tool call and making another API response.

But why?

The most compelling reason to do this is so that the LLM can call tools concurrently with it’s thinking process.

If you’ve gotten a chance to use an agent, like Deep Research or o3, you’ll notice that it’s thought process isn’t just inner dialog, it’s also tool calls like web searches. That’s the future of agents.

Trained With Tools

o3 and o4-mini are special because they’re trained to be agentic models.

In reinforcement learning, the model is given a problem to solve and rewarded for good behavior, like getting the right answer or at least getting the format right. For example the R1 paper discussed rewarding the model for staying in English if the question was given in English.

Here’s a diagram of reinforcement learning:

graph TD input((Problem)) subgraph LLM tok[Tokenizer]-->nn[Neural Net]-->samp[Select Next Token]-->tok end input-->tok samp-->out[Output]-->reward[Calculate Reward]-->update[Update Model Weights]-->next((Next Problem)) update----->nn

With inner loop agents, you would change the above diagram to include tools in the yellow box, in the inner loop. The model is still rewarded for the same things, like getting the right result, but since tools are included you’re simultaneously reinforcing the model’s ability to use it’s tools well.

It’s clear to me that o3 was trained to use it’s web search tool. I believe they even said that, although I might be remembering wrong. It’s certainly the generally accepted view.

Today LLMs can do all this, if they’re trained for tool use. What changes, is that the model become good at using the tools. Tool use isn’t just possible, tools are used at the optimal time in order to solve the problem in the best possible way.

Optimal tool use. Hmm… Almost sounds like art.

Emergent Tool Use

The agentic models today (o3, o4-mini, Claude Sonnet) are only trained to use a small set of specific tools.

Web search & bash usage are cool and all, but what would be truly powerful is if one of these inner loop agents were trained to use tools that regular people use. Like, what if it could submit a purchase order, or analyze a contract to understand if I can make the supplier cover the tariffs? Or to use a tool to navigate an org chart and guess who I need to talk to.

Model Context Protocol (MCP) was designed to support diverse tool use. All you have to do to get an LLM to use your API is build an MCP server. Anyone can then use your API from their own AI apps. Cool.

But the LLM wasn’t trained to use your tool. It was only trained to use tools, generically. It just follows the tool call format, but it hasn’t been optimized for using those tools to solve a problem.

Emergent tool use would mean that an LLM could pick up any MCP description and use the tool effectively to solve a problem.

This isn’t planning.

Let’s say you’re doing wood working and you get a new chisel. You can read all you want on when and how you’re supposed to the chisel, but ultimately it takes experience to know what kind of results you can expect from it. And once you fully understand the tool, then you can include it in your planning.

Emergent tool use hasn’t happened yet, as far as I know. I hope it’ll happen, but it seems unlikely that an LLM can discover the finer points of how to use a tool just from reading the manual, without any training.

Trained Tool Use

Until emergent tool use happens, we have two options:

  1. Use MCP description fields to carefully explain how the tool is used and hope for the best.
  2. Inner loop agents. Train a model with your tool.

Right now, those options are our future.

If you want an agent, you can prototype by prompting it to use tools well. But ultimately, to build a high-quality agent as a product, you’ll likely need to train a model to use your tools effectively.

Agent To Agent

Google recently released Agent 2 Agent (A2A). A protocol that facilitates interactions between agents.

My hunch is that this level of protocol will become critical. If people take inner loop agents seriously, it’ll be difficult to always use the state of the art models. Instead, each agent will be using it’s own LLM, because training is expensive and slow.

A protocol like A2A allows each of these fine tuned LLM agents to communicate without forcing yourself into LLM dependency hell.

Conclusion

That’s inner loop agents.

One big note, is that even if you’re training an LLM with tools, the tools don’t actually have to be executed on the same host that’s running the LLM. In fact, that’s unlikely to be the case. So, inner loop vs not inner loop is not really the part that matters. It’s all about whether or not the LLM was trained to use tools.

Discussion