Rss preview of Blog of Tim Kellogg

GPT-5 failed the wrong test

2025-08-08 08:00:00

This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.

They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng

A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.

And that’s the lens I saw GPT-5 through.

The paper describes three eras of the internet:

PC Era — Wikipedia, Craig’s List, etc.; users actively seek information
Mobile/Social Era — Tik Tok, Insta, etc.; content is pushed via recommendation algorithms
Agentic Web — user merely expresses intent

When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.

How do I vibe test an LLM?

I use it. If it changes how I work or think, then it’s a good LLM.

o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.

I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.

Claude Code is software.

Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.

GPT-5 is for the agentic web

The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.

But this paper..

I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.

Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.

I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.

GPT-5 is for the agentic web.

yeah but i hate ads

Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.

The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.

The difference is agents. In the agentic web, everything is mediated by agents.

You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).

You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.

The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.

The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.

I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.

Benchmarks

The only benchmark that matters is how much it changes life.

At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.

I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.

My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.

Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.

We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.

Areas for growth

Before we see significant improvement in models, we’re going to need a lot more in:

Memory — stateful agents that don’t forget you
Harnesses — the software around the LLM inside the agent
Networking & infra — getting agents to discover and leverage each other

Probably several other low-hanging areas.

Discussion

Bluesky

Explainer: K2 & Math Olympiad Golds

2025-07-19 08:00:00

Feeling behind? Makes sense, AI moves fast. This post will catch you up.

The year of agents

First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.

Timeline

The last 6 months:

Jan 20: DeepSeek R1 launched — Open source thinking model, performing near SOTA at the time
Feb 2: Deep Research launched — An agent that uses tools
Feb 19: Grok 3 — a huge 2T+ model, the first
March 26: OpenAI adopts MCP — MCP starts to become mainstream
April 16: o3 & o4-mini — First notable “agentic” models available in an API
April 29: The sycophancy epidemic in GPT-4o
April 30: DeepSeek Prover — Trained to use an automated proof assistant, Lean, to do math
May 22: Claude-4 — huge 2T+ thinking models that only think when necessary
June 10: o3 prices cut by 80% — Which makes us wonder how small these models can be?
June 13: Cognition vs Anthropic: Don’t Build Multi-Agents/How to Build Multi-Agents — ”context engineering” emerges as a term
July 9: Grok 4 — huge 2T+ thinking multi-agent that’s still has the top HLE score
July 12: K2 — Huge 1T open weights agentic model that isn’t a thinking model
July 17: OpenAI Agent — agentic o3 variant (maybe o4??) that spans computer use, code & MCP
July 19: International Math Olympiad Gold — Best math model but doesn’t use tools

Is ‘thinking’ necessary?

Obviously it is, right?

Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:

Self-verification
Sub-goal setting
Backtracking (undoing an unfruitful path)
Backward chaining (working backwards)

All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.

K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.

For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.

What to watch

More models trained like K2

Tool usage connects the world

R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.

MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.

The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.

K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.

That pretty much covers our current agent challenges.

What to watch

More models trained like K2
MCP adoption

Are tools necessary?

In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.

But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.

But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.

Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.

If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.

What to watch

This math olympiad model. The implications are still unclear. It seems it’s more general than math.

Huge vs Tiny

Which is better?

On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.

Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.

What to watch

Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time

Context engineering & Sycophancy

The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.

It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.

Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.

Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.

What to watch

Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.

Going forward…

And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.

What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.

Discussion

Do LLMs understand?

2025-07-18 08:00:00

I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:

An entity can understand if it builds a latent model of reality. And:

Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.

Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.

Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.

I think there’s plenty of evidence that LLMs are currently doing all of this.

But first..

Latent Model

Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.

This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.

For example, if you saw this:

You probably identify it immediately as a sphere even though it’s just a bunch of dots.

A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.

Sycophancy causes incoherence

From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.

Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.

Embedding models demonstrate latent models

Otherwise they wouldn’t work.

The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.

That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.

In-context learning (ICL) demonstrates they can learn

ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.

In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.

Makes sense. The smarter you get, the easier it is to get smarter.

Reasoning guides deviation

When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.

A recent Jan-2025 paper shows that this is the case.

Misdirection: Performance != Competance

A large segment of the AI-critical use this argument as evidence. Paraphrasing:

Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.

This was in 2015 so the example seems basic, but the principle is still being applied in 2025.

The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.

I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.

Go the other way, prove that there are areas where AI’s latent model matches reality.

But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.

Conclusion

Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.

But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.

Layers of Memory, Layers of Compression

2025-06-15 08:00:00

Recently, Anthropic published a blog post detailing their multi-agent approach to building their Research agent. Also, Cognition wrote a post on why multi-agent systems don’t work today. The thing is, they’re both saying the same thing.

At the same time, I’ve been enthralled watching a new bot, Void, interact with users on Bluesky. Void is written in Letta, an AI framework oriented around memory. Void feels alive in a way no other AI bot I’ve encountered feels. Something about the memory gives it a certain magic.

I took some time to dive into Letta’s architecture and noticed a ton of parallels with what the Anthropic and Cognition posts were saying, around context management. Letta takes a different approach.

Below, I’ve had OpenAI Deep Research format our conversation into a blog post. I’ve done some light editing, adding visuals etc., but generally it’s all AI. I appreciated this, I hope you do too.

When an AI agent “remembers,” it compresses. Finite context windows force hard choices about what to keep verbatim, what to summarize, and what to discard. Letta’s layered memory architecture embraces this reality by structuring an agent’s memory into tiers – each a lossy compression of the last. This design isn’t just a storage trick; it’s an information strategy.

Layered Memory as Lossy Compression

Letta (formerly MemGPT) splits memory into four memory blocks: core, message buffer, archival, and recall. Think of these as concentric rings of context, from most essential to most expansive, similar to L1, L2, L3 cache on a CPU:

flowchart TD subgraph rec[Recall Memory] subgraph arch[Archival Memory] subgraph msg[Message Buffer] Core[Core Memory] end end end

Core memory holds the agent’s invariants – the system persona, key instructions, fundamental facts. It’s small but always in the prompt, like the kernel of identity and immediate purpose.
Message buffer is a rolling window of recent conversation. This is the agent’s short-term memory (recent user messages and responses) with a fixed capacity. As new messages come in, older ones eventually overflow.
Archival memory is a long-term store, often an external vector database or text log, where overflow messages and distilled knowledge go. It’s practically unbounded in size, but far from the model’s immediate gaze. This is highly compressed memory – not compressed in ZIP-file fashion, but in being irrelevant by default until needed.
Recall memory is the retrieval buffer. When the agent needs something from the archive, it issues a query; relevant snippets are loaded into this block for use. In effect, recall memory “rehydrates” compressed knowledge on demand.

How it works: On each turn, the agent assembles its context from core knowledge, the fresh message buffer, and any recall snippets. All three streams feed into the model’s input. Meanwhile, if the message buffer is full, the oldest interactions get archived out to long-term memory.

Later, if those details become relevant, the agent can query the archival store to retrieve them into the recall slot. What’s crucial is that each layer is a lossy filter: core memory is tiny but high-priority (no loss for the most vital data), the message buffer holds only recent events (older details dropped unless explicitly saved), and the archive contains everything in theory but only yields an approximate answer via search. The agent itself chooses what to promote to long-term storage (e.g. summarizing and saving a key decision) and what to fetch back.

It’s a cascade of compressions and selective decompressions.

Rate–distortion tradeoff: This hierarchy embodies a classic principle from information theory. With a fixed channel (context window) size, maximizing information fidelity means balancing rate (how many tokens we include) against distortion (how much detail we lose).

Letta’s memory blocks are essentially a rate–distortion ladder. Core memory has a tiny rate (few tokens) but zero distortion on the most critical facts. The message buffer has a larger rate (recent dialogue in full) but cannot hold everything – older context is distorted by omission or summary. Archival memory has effectively infinite capacity (high rate) but in practice high distortion: it’s all the minutiae and past conversations compressed into embeddings or summaries that the agent might never look at again.

The recall stage tries to recover (rehydrate) just enough of that detail when needed. Every step accepts some information loss to preserve what matters most. In other words, to remember usefully, the agent must forget judiciously.

This layered approach turns memory management into an act of cognition.

Summarizing a chunk of conversation before archiving it forces the agent to decide what the gist is – a form of understanding. Searching the archive for relevant facts forces it to formulate good queries – effectively reasoning about what was important. In Letta’s design, compression is not just a storage optimization; it is part of the thinking process. The agent is continually compressing its history and decompressing relevant knowledge as needed, like a human mind generalizing past events but recalling a specific detail when prompted.

flowchart TD U[User Input] ---> LLM CI[Core Instructions] --> LLM RM["Recent Messages
(Short-term Buffer)"] --> LLM RS["Retrieved Snippets
(Recall)"] --> LLM LLM ----> AR[Agent Response] RM -- evict / summarize --> VS["Vector Store
(Archival Memory)"] LLM -- summarize ---> VS VS -- retrieve --> RS

Caption: As new user input comes in, the agent’s core instructions and recent messages combine with any retrieved snippets from long-term memory, all funneling into the LLM. After responding, the agent may drop the oldest message from short-term memory into a vector store, and perhaps summarize it for posterity. The next query might hit that store and pull up the summary as needed. The memory “cache” is always in flux.

One Mind vs. Many Minds: Two Approaches to Compression

The above is a single-agent solution: one cognitive entity juggling compressed memories over time. An alternative approach has emerged that distributes cognition across multiple agents, each with its own context window – in effect, parallel minds that later merge their knowledge.

Anthropic’s recent multi-agent research system frames intelligence itself as an exercise in compression across agents. In their words, “The essence of search is compression: distilling insights from a vast corpus.” Subagents “facilitate compression by operating in parallel with their own context windows… condensing the most important tokens for the lead research agent”.

Instead of one agent with one context compressing over time, they spin up several agents that each compress different aspects of a problem in parallel. The lead agent acts like a coordinator, taking these condensed answers and integrating them.

This multi-agent strategy acknowledges the same limitation (finite context per agent) but tackles it by splitting the work. Each subagent effectively says, “I’ll compress this chunk of the task down to a summary for you,” and the lead agent aggregates those results.

It’s analogous to a team of researchers: divide the topic, each person reads a mountain of material and reports back with a summary so the leader can synthesize a conclusion. By partitioning the context across agents, the system can cover far more ground than a single context window would allow.

In fact, Anthropic found that a well-coordinated multi-agent setup outperformed a single-agent approach on broad queries that require exploring many sources. The subagents provided separation of concerns (each focused on one thread of the problem) and reduced the path-dependence of reasoning – because they explored independently, the final answer benefited from multiple compressions of evidence rather than one linear search.

However, this comes at a cost.

Coordination overhead and consistency become serious challenges. Cognition’s Walden Yan argues that multi-agent systems today are fragile chiefly due to context management failures. Each agent only sees a slice of the whole, so misunderstandings proliferate.

One subagent might interpret a task slightly differently than another, and without a shared memory of each other’s decisions, the final assembly can conflict or miss pieces. As Yan puts it, running multiple agents in collaboration in 2025 “only results in fragile systems. The decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents.” In other words, when each subagent compresses its piece of reality in isolation, the group may lack a common context to stay aligned.

In Anthropic’s terms, the “separation of concerns” cuts both ways: it reduces interference, but also means no single agent grasps the full picture. Humans solve this by constant communication (we compress our thoughts into language and share it), but current AI agents aren’t yet adept at the high-bandwidth, nuanced communication needed to truly stay in sync over long tasks.

Cognition’s solution? Don’t default to multi-agent. First try a simpler architecture: one agent, one continuous context. Ensure every decision that agent makes “sees” the trace of reasoning that led up to it – no hidden divergent contexts.

Of course, a single context will eventually overflow, but the answer isn’t to spawn independent agents; it’s to better compress the context. Yan suggests using an extra model whose sole job is to condense the conversation history into “key details, events, and decisions.”

This summarized memory can then persist as the backbone context for the main agent. In fact, Cognition has fine-tuned smaller models to perform this kind of compression reliably. The philosophy is that if you must lose information, lose it intentionally and in one place – via a trained compressor – rather than losing it implicitly across multiple agents’ blind spots.

This approach echoes Letta’s layered memory idea: maintain one coherent thread of thought, pruning and abstracting it as needed, instead of forking into many threads that might diverge.

Conclusion: Compression is Cognition

In the end, these approaches converge on a theme: intelligence is limited by information bottlenecks, and overcoming those limits looks a lot like compression. Whether it’s a single agent summarizing its past and querying a knowledge base, or a swarm of subagents parceling out a huge problem and each reporting back a digest, the core challenge is the same.

An effective mind (machine or human) can’t and shouldn’t hold every detail in working memory – it must aggressively filter, abstract, and encode information, yet be ready to recover the right detail at the right time. This is the classic rate–distortion tradeoff of cognition: maximize useful signal, minimize wasted space.

Letta’s layered memory shows one way: a built-in hierarchy of memory caches, from the always-present essentials to the vast but faint echo of long-term archives. Anthropic’s multi-agent system shows another: multiple minds sharing the load, each mind a lossy compressor for a different subset of the task. And Cognition’s critique reminds us that compression without coordination can fail – the pieces have to ultimately fit together into a coherent whole.

Perhaps as AI agents evolve, we’ll see hybrid strategies. We might use multi-agent teams whose members share a common architectural memory (imagine subagents all plugged into a shared Letta-style archival memory, so they’re not flying blind with respect to each other). Or we might simply get better at single agents with enormous contexts and sophisticated internal compression mechanisms, making multi-agent orchestration unnecessary for most tasks. Either way, the direction is clear: to control and extend AI cognition, we are, in a very real sense, engineering the art of forgetting. By deciding what to forget and when to recall, an agent demonstrates what it truly understands. In artificial minds as in our own, memory is meaningful precisely because it isn’t perfect recording – it’s prioritized, lossy, and alive.

A2A Is For UI

2025-06-14 08:00:00

There’s a lot of skepticism around A2A, Google’s Agent-to-Agent protocol. A lot of that is well earned. I mean, they launched a protocol with zero implementations. But a lot’s changed, and it’s worth taking a look again.

I’d like to convince you that you should be thinking about A2A as a protocol for giving agents a UI. And that UI is a bridge into a more complex multi-agent world. Gotta start somewhere!

It’s Just HTTP

The protocol is just a single HTTP endpoint and an agent card (can be statically served). Inside that single endpoint are JSON RPC methods:

message/send & message/stream — Both send messages, one returns a stream of events (SSE). The first message implicitly creates a task.
tasks/resubscribe — For when you were doing message/stream but your connection broke.
tasks/get — If you want to poll. SSE isn’t for everyone, I guess. cURL works too.
tasks/pushNotifications/set & .../get — for webhooks, if that’s your thing

So basically, you create a task, and then you exchange messages with it. That’s it.

Tasks are Actors

Uh, if you don’t know what actors are, this analogy might not help, but I’m going with it anyway.

Tasks are actors (think Erlang actors or Akka). The first time you send a message to an actor, a task (an actor) is implicitly created.

flowchart TD client((client)) client--send msg-->box[implicit mailbox] box-->task--"also
queued"-->client

Messages are processed one-at-a-time, in the order they were received. Messages can mutate task state. But it doesn’t get crazy because the interaction is very single threaded (well, I guess you could process messages in parallel, but why?)

UIs are Agents

I think of a UI as being an agent that happens to have a human behind it. Not an AI agent, but a human agent. The UI application code handles the computer part, the human handles the intelligence part.

Yes, A2A was designed for sending messages between AI agents, but we don’t currently live in a world where open-ended multi-agent systems are pervasive. We do live in a world where humans talk to agents. And that won’t ever really change, because agents aren’t super valuable if their work never makes it to a human.

A2A supports any data

Each message, in either direction, contains multiple parts, each of one of these types:

TextPart — plain text, think unprocessed LLM outputs
DataPart — think JSON or binary. The format is specified by the mime type
FilePart — like DataPart, but can be at a URL

So an agent can do things like mix plain LLM outputs with JSON outputs.

Delegate state to Prefect or Temporal

One subtly difficult part of A2A is that it requires keeping state, potentially over long periods of time.

For example, an agent realizes the initiating user didn’t say enough, so it asks for clarification. People aren’t very good computers and while we sometimes respond quickly, sometimes we take minutes or hours, or even years. Or never.

How do you deal with that?

I’ve dealt with this by using systems like Temporal and Prefect. Both are sometimes called “workflow” systems, but can also be thought of as providing durable function execution.

Both are more interesting than most workflow systems because they also provide suspend & resume functionality. For example, in prefect you can call await suspend_flow_run() and the flow will be completely shut down and occupy zero memory or CPU while the user is twiddling their thumbs.

The Shim

I pulled this diagram directly from FastA2A docs:

flowchart TB Server["HTTP Server"] <--> |Sends Requests/
Receives Results| TM subgraph CC[Core Components] direction RL TM["TaskManager
(coordinates)"] --> |Schedules Tasks| Broker TM <--> Storage Broker["Broker
(queues & schedules)"] <--> Storage["Storage
(persistence)"] Broker --> |Delegates Execution| Worker end Worker["Worker
(implementation)"]

Note: I like FastA2A because it implements the HTTP endpoints as a Starlette app that you can easily mount right into your API alongside everything else. Also, it has basically nothing to do with Pydantic or Pydantic AI other than it happens to be collocated inside the same Github repository.

FastA2A clearly realizes there’s a state problem and so they created interfaces for dealing with it. Not only that, but these interfaces are a fairly standard architecture for workflow systems.

I’ve created simple shims for both Temporal and Prefect that use the workflow systems to implement the TaskManager, Storage and Broker. The idea being you could use either Prefect or Temporal, whichever you prefer, to quickly create a robust A2A-compatible agent.

They’re each ~100 lines of code, yet implement just about everything you’d want from a stateful system, from retries and deployments to observability and a management UI.

Where does this fit into your agent?

Say you’re following the Plan-Act-Verify flow that’s become popular:

flowchart TD client((client)) client-->clarify[Clarify Question]-->Plan-->Act["Act (Query)"]-->Verify-->Plan Verify-->prepare[Prepare Report]-->client2((client))

All those boxes are things that need to happen once and only once (well, in a loop). Every agent has a slightly different take on this, but many boil down to some variant of this architecture. The workflows don’t have to be complicated (but by all means, they can be).

The point is, yes, A2A is stateful and statefulness can be hard. But it can be solved simply and cleanly by delegating to other hardened distributed systems that were designed to do this well.

A2A Versus MCP

Simply, MCP is for tools (function-like things with inputs and outputs). A2A is for when you need free-form communication. Hence why tasks look more like actors.

They also solve similar fan-out problems. MCP enables many tools to be used by few AI applications or agents. A2A enables many agents to be used by few user interfaces and other agents.

flowchart TD subgraph c[A2A Clients] teams[MS Teams] agentspace[Google
AgentSpace] ServiceNow end subgraph m[MCP Servers] comp[Computer
Use] search[Web
Search] APIs end teams-->Agent[A2A-compatible
Agent] agentspace-->Agent ServiceNow-->Agent Agent-->comp Agent-->search Agent-->APIs

Side note: AI Engineering has become incredibly complex. You have to master not just AI tech, but also be a full-stack engineer and a data engineer. The emergence of A2A & MCP dramatically reduces the scope of an AI engineer, and that’s exciting on it’s own.

Implementation is picking up quickly

I’m going to finish this post by linking to a ton of products that are using A2A or soon will. My hope being that you’ll realize that now is a good time to get in on this.

A2A-compatible agents you can launch (server side)

Commercial / SaaS agents – live today

Google-built agents inside Vertex AI Agent Builder & Agentspace – e.g., Deep Research Agent, Idea Generation Agent; all expose an A2A JSON-RPC endpoint out of the box. (cloud.google.com, cloud.google.com, cloud.google.com)
SAP Joule Agents & Business Agent Foundation – Joule delegates work to SAP and non-SAP systems via A2A. (news.sap.com, architecture.learning.sap.com)
Box AI Agents – content-centric agents (contract analysis, form extraction) advertise themselves through A2A so external agents can call them. (developers.googleblog.com, blog.box.com)
Zoom AI Companion – meeting-scheduling and recap agents are now published as A2A servers on the Zoom developer platform. (instagram.com, uctoday.com)
UiPath Maestro agents – healthcare summarization, invoice triage, etc.; natively speak A2A for cross-platform automation. (uipath.com, itbrief.com.au)
Deloitte enterprise Gemini agents – 100 + production agents deployed for clients, exposed over A2A. (venturebeat.com)

Open-source agents & frameworks

LangGraph sample Currency-Agent, Travel-Agent, etc. (a2aprotocol.ai, github.com)
CrewAI – “crews” can publish themselves as remote A2A services (#2970). (github.com)
Semantic Kernel travel-planner & “Meeting Agent” demos. (devblogs.microsoft.com, linkedin.com)
FastA2A reference server (Starlette + Pydantic AI) – minimal A2A turnkey agent. (github.com)
Official a2a-samples repo – dozens of runnable Python & JS agents. (github.com)

Announced / on the roadmap

Salesforce Agentforce will “incorporate standard protocols like A2A” in upcoming releases. (medium.com, salesforce.com)
ServiceNow, Atlassian, Intuit, MongoDB, PayPal, Workday, Accenture and ~40 other partners listed by Google as “founding A2A agents.” (venturebeat.com)

Products that dispatch to A2A agents (client/orchestrator side)

Cloud platforms & orchestration layers

Azure AI Foundry – multi-agent pipelines can send tasks/send & tasks/stream RPCs to any A2A server. (microsoft.com, microsoft.com)
Microsoft Copilot Studio – low-code tool that now “securely invokes external agents” over A2A. (microsoft.com, microsoft.com)
Google Agentspace hub – lets knowledge workers discover, invoke, and chain A2A agents (including third-party ones). (cloud.google.com, cloud.google.com)
Vertex AI Agent Builder – generates dispatch stubs so your front-end or workflow engine can call remote A2A agents. (cloud.google.com)

Gateways & governance

MuleSoft Flex Gateway – Governance for Agent Interactions – policy enforcement, rate-limiting, and auth for outbound A2A calls. (blogs.mulesoft.com, docs.mulesoft.com)
Auth0 “Market0” demo – shows how to mint JWT-style tokens and forward them in authentication headers for A2A requests. (auth0.com)

Open-source dispatch tooling

Official A2A Python SDK (a2a-python) – full client API (tasks/send, SSE streaming, retries). (github.com)
a2a-js client library (part of the A2A GitHub org). (github.com)
n8n-nodes-agent2agent – drop-in nodes that let any n8n workflow call or await A2A agents. (github.com)

Coming soon

UiPath Maestro orchestration layer (already works internally, public A2A client API expanding). (linkedin.com)
Salesforce Agentforce Mobile SDK – upcoming SDK will be able to dispatch to external A2A agents from mobile apps. (salesforceben.com)
ServiceNow & UiPath cross-dispatch partnerships are in private preview. (venturebeat.com)

MCP Resources Are For Caching

2025-06-05 08:00:00

If your MCP client doesn’t support resources, it is not a good client.

There! I said it!

It’s because MCP resources are for improved prompt utilization, namely cache invalidation. Without resources, you eat through your context and token budget faster than Elon at a drug store. And so if your client doesn’t support it, you basically can’t do RAG with MCP. At least not in a way that anyone would consider production worthy.

RAG documents are BIG

You don’t want to duplicate files. See this here:

system prompt

user message with tool definitions

agent message with tool calls

user message with tool call results

giant file 1

giant file 2

another agent message with tool calls

user message with tool call results

giant file 2

giant file 3

...

That’s 2 tool calls. The second one contains a duplicate file.

Is this bad? If your answer is “no” then this blog post isn’t going to resonate with you.

Separate results from whole files

The core of it: A well-implemented app, MCP or not, will keep track of the documents returned from a RAG query and avoid duplicating them in the prompt. To do this, you keep a list of resource IDs that you’ve seen before (sure, call it a “cache”).

Format the RAG tool response in the prompt like so:

<result uri="rag://polar-bears/74.md" />
<result uri="rag://chickens/23.md" />

<full-text uri="rag://chickens/23">
Chickens are...
</full-text>

In other words:

The return value of the function, to the LLM, is an array of resources
The full text is included elsewhere, for reference

URIs are useful as a cache key.

btw I’m just spitballing what the prompt format should be for returning results. You can play around with it, you might already have strong opinions. The point is, mapping must be done.

MCP is not LLM-readable

There’s been a lot of discussion about if LLMs can interpret OpenAPI fine, and if so, why use MCP. That misses the entire point. MCP isn’t supposed to be interpreted directly by an LLM.

When you implement an MCP client, you should be mapping MCP concepts to whatever works for that LLM. This is called implementing the protocol. If you throw vanilla MCP objects into a prompt, it could actually work. But a good client is going to map the results to phrases & formats that particular LLM has gone through extraordinarily expensive training to understand.

MCP is a protocol

MCP standardizes how tools should return their results. MCP resources exist so that tools (e.g. RAG search) can return files, and client can de-duplicate those files across many calls.

Yes, it’s cool that you can list a directory, but that’s not the primary purpose of resources. Without resources, your LLMs just eat more tokens unnecessarily.

(Side note: did you notice that neither Anthropic nor OpenAI supports resources in their APIs? It’s a conspiracy..)

Resources are table stakes MCP support

If a client doesn’t support MCP resources, it’s because they don’t care enough to implement a proper client. Period.

While I’m at it, prompts are just functions with special handling of the results. Might as well support those too.

UPDATE: tupac

I made a minimalist reference implementation of an MCP client. Feel free to check it out. Bare minimum, it’s extremely useful. It’s on Github and runs with a simple uvx command.

Tim KelloggModify

Rss preview of Blog of Tim Kellogg

How do I vibe test an LLM?

GPT-5 is for the agentic web

yeah but i hate ads

Benchmarks

Areas for growth

Discussion

The year of agents

Timeline

Is ‘thinking’ necessary?

What to watch

Tool usage connects the world

What to watch

Are tools necessary?

What to watch

Huge vs Tiny

What to watch

Context engineering & Sycophancy

What to watch

Going forward…

Discussion

Latent Model

Sycophancy causes incoherence

Embedding models demonstrate latent models

In-context learning (ICL) demonstrates they can learn

Reasoning guides deviation

Misdirection: Performance != Competance

Conclusion

Layered Memory as Lossy Compression

One Mind vs. Many Minds: Two Approaches to Compression

Conclusion: Compression is Cognition

It’s Just HTTP

Tasks are Actors

UIs are Agents

A2A supports any data

Delegate state to Prefect or Temporal

The Shim

A2A Versus MCP

Implementation is picking up quickly

A2A-compatible agents you can launch (server side)

Products that dispatch to A2A agents (client/orchestrator side)

RAG documents are BIG

Separate results from whole files

MCP is not LLM-readable

MCP is a protocol

Resources are table stakes MCP support

UPDATE: tupac

Discussion

The author's social media

Tim Kellogg Modify