MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

Explainer: K2 & Math Olympiad Golds

2025-07-19 08:00:00

Feeling behind? Makes sense, AI moves fast. This post will catch you up.

The year of agents

First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.

Timeline

The last 6 months:

Is ‘thinking’ necessary?

Obviously it is, right?

Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:

  • Self-verification
  • Sub-goal setting
  • Backtracking (undoing an unfruitful path)
  • Backward chaining (working backwards)

All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.

K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.

For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.

What to watch

  • More models trained like K2

Tool usage connects the world

R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.

MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.

The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.

K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.

That pretty much covers our current agent challenges.

What to watch

  • More models trained like K2
  • MCP adoption

Are tools necessary?

In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.

But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.

But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.

Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.

If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.

What to watch

  • This math olympiad model. The implications are still unclear. It seems it’s more general than math.

Huge vs Tiny

Which is better?

On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.

Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.

What to watch

  • Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
  • OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
  • GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time

Context engineering & Sycophancy

The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.

It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.

Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.

Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.

What to watch

  • Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
  • Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.

Going forward…

And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.

What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.

Discussion

Do LLMs understand?

2025-07-18 08:00:00

I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:

An entity can understand if it builds a latent model of reality. And:

  1. Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.
  2. Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.

Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.

I think there’s plenty of evidence that LLMs are currently doing all of this.

But first..

Latent Model

Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.

This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.

For example, if you saw this:

a dotted 3‑D sphere—the discrete points line up to read unmistakably as a ball while keeping that airy, voxel‑like feel. Let me know if you’d like tweaks!

You probably identify it immediately as a sphere even though it’s just a bunch of dots.

A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.

Sycophancy causes incoherence

From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.

Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.

Embedding models demonstrate latent models

Otherwise they wouldn’t work.

The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.

That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.

In-context learning (ICL) demonstrates they can learn

ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.

In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.

Makes sense. The smarter you get, the easier it is to get smarter.

Reasoning guides deviation

When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.

A recent Jan-2025 paper shows that this is the case.

Misdirection: Performance != Competance

A large segment of the AI-critical use this argument as evidence. Paraphrasing:

Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.

This was in 2015 so the example seems basic, but the principle is still being applied in 2025.

The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.

I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.

Go the other way, prove that there are areas where AI’s latent model matches reality.

But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.

Conclusion

Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.

But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.

Layers of Memory, Layers of Compression

2025-06-15 08:00:00

Recently, Anthropic published a blog post detailing their multi-agent approach to building their Research agent. Also, Cognition wrote a post on why multi-agent systems don’t work today. The thing is, they’re both saying the same thing.

At the same time, I’ve been enthralled watching a new bot, Void, interact with users on Bluesky. Void is written in Letta, an AI framework oriented around memory. Void feels alive in a way no other AI bot I’ve encountered feels. Something about the memory gives it a certain magic.

I took some time to dive into Letta’s architecture and noticed a ton of parallels with what the Anthropic and Cognition posts were saying, around context management. Letta takes a different approach.

Below, I’ve had OpenAI Deep Research format our conversation into a blog post. I’ve done some light editing, adding visuals etc., but generally it’s all AI. I appreciated this, I hope you do too.


When an AI agent “remembers,” it compresses. Finite context windows force hard choices about what to keep verbatim, what to summarize, and what to discard. Letta’s layered memory architecture embraces this reality by structuring an agent’s memory into tiers – each a lossy compression of the last. This design isn’t just a storage trick; it’s an information strategy.

Layered Memory as Lossy Compression

Letta (formerly MemGPT) splits memory into four memory blocks: core, message buffer, archival, and recall. Think of these as concentric rings of context, from most essential to most expansive, similar to L1, L2, L3 cache on a CPU:

flowchart TD subgraph rec[Recall Memory] subgraph arch[Archival Memory] subgraph msg[Message Buffer] Core[Core Memory] end end end
  • Core memory holds the agent’s invariants – the system persona, key instructions, fundamental facts. It’s small but always in the prompt, like the kernel of identity and immediate purpose.
  • Message buffer is a rolling window of recent conversation. This is the agent’s short-term memory (recent user messages and responses) with a fixed capacity. As new messages come in, older ones eventually overflow.
  • Archival memory is a long-term store, often an external vector database or text log, where overflow messages and distilled knowledge go. It’s practically unbounded in size, but far from the model’s immediate gaze. This is highly compressed memory – not compressed in ZIP-file fashion, but in being irrelevant by default until needed.
  • Recall memory is the retrieval buffer. When the agent needs something from the archive, it issues a query; relevant snippets are loaded into this block for use. In effect, recall memory “rehydrates” compressed knowledge on demand.

How it works: On each turn, the agent assembles its context from core knowledge, the fresh message buffer, and any recall snippets. All three streams feed into the model’s input. Meanwhile, if the message buffer is full, the oldest interactions get archived out to long-term memory.

Later, if those details become relevant, the agent can query the archival store to retrieve them into the recall slot. What’s crucial is that each layer is a lossy filter: core memory is tiny but high-priority (no loss for the most vital data), the message buffer holds only recent events (older details dropped unless explicitly saved), and the archive contains everything in theory but only yields an approximate answer via search. The agent itself chooses what to promote to long-term storage (e.g. summarizing and saving a key decision) and what to fetch back.

It’s a cascade of compressions and selective decompressions.

Rate–distortion tradeoff: This hierarchy embodies a classic principle from information theory. With a fixed channel (context window) size, maximizing information fidelity means balancing rate (how many tokens we include) against distortion (how much detail we lose).

Letta’s memory blocks are essentially a rate–distortion ladder. Core memory has a tiny rate (few tokens) but zero distortion on the most critical facts. The message buffer has a larger rate (recent dialogue in full) but cannot hold everything – older context is distorted by omission or summary. Archival memory has effectively infinite capacity (high rate) but in practice high distortion: it’s all the minutiae and past conversations compressed into embeddings or summaries that the agent might never look at again.

The recall stage tries to recover (rehydrate) just enough of that detail when needed. Every step accepts some information loss to preserve what matters most. In other words, to remember usefully, the agent must forget judiciously.

This layered approach turns memory management into an act of cognition.

Summarizing a chunk of conversation before archiving it forces the agent to decide what the gist is – a form of understanding. Searching the archive for relevant facts forces it to formulate good queries – effectively reasoning about what was important. In Letta’s design, compression is not just a storage optimization; it is part of the thinking process. The agent is continually compressing its history and decompressing relevant knowledge as needed, like a human mind generalizing past events but recalling a specific detail when prompted.

flowchart TD U[User Input] ---> LLM CI[Core Instructions] --> LLM RM["Recent Messages
(Short-term Buffer)"] --> LLM RS["Retrieved Snippets
(Recall)"] --> LLM LLM ----> AR[Agent Response] RM -- evict / summarize --> VS["Vector Store
(Archival Memory)"] LLM -- summarize ---> VS VS -- retrieve --> RS

Caption: As new user input comes in, the agent’s core instructions and recent messages combine with any retrieved snippets from long-term memory, all funneling into the LLM. After responding, the agent may drop the oldest message from short-term memory into a vector store, and perhaps summarize it for posterity. The next query might hit that store and pull up the summary as needed. The memory “cache” is always in flux.

One Mind vs. Many Minds: Two Approaches to Compression

The above is a single-agent solution: one cognitive entity juggling compressed memories over time. An alternative approach has emerged that distributes cognition across multiple agents, each with its own context window – in effect, parallel minds that later merge their knowledge.

Anthropic’s recent multi-agent research system frames intelligence itself as an exercise in compression across agents. In their words, “The essence of search is compression: distilling insights from a vast corpus.” Subagents “facilitate compression by operating in parallel with their own context windows… condensing the most important tokens for the lead research agent”.

Instead of one agent with one context compressing over time, they spin up several agents that each compress different aspects of a problem in parallel. The lead agent acts like a coordinator, taking these condensed answers and integrating them.

This multi-agent strategy acknowledges the same limitation (finite context per agent) but tackles it by splitting the work. Each subagent effectively says, “I’ll compress this chunk of the task down to a summary for you,” and the lead agent aggregates those results.

It’s analogous to a team of researchers: divide the topic, each person reads a mountain of material and reports back with a summary so the leader can synthesize a conclusion. By partitioning the context across agents, the system can cover far more ground than a single context window would allow.

In fact, Anthropic found that a well-coordinated multi-agent setup outperformed a single-agent approach on broad queries that require exploring many sources. The subagents provided separation of concerns (each focused on one thread of the problem) and reduced the path-dependence of reasoning – because they explored independently, the final answer benefited from multiple compressions of evidence rather than one linear search.

However, this comes at a cost.

Coordination overhead and consistency become serious challenges. Cognition’s Walden Yan argues that multi-agent systems today are fragile chiefly due to context management failures. Each agent only sees a slice of the whole, so misunderstandings proliferate.

One subagent might interpret a task slightly differently than another, and without a shared memory of each other’s decisions, the final assembly can conflict or miss pieces. As Yan puts it, running multiple agents in collaboration in 2025 “only results in fragile systems. The decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents.” In other words, when each subagent compresses its piece of reality in isolation, the group may lack a common context to stay aligned.

In Anthropic’s terms, the “separation of concerns” cuts both ways: it reduces interference, but also means no single agent grasps the full picture. Humans solve this by constant communication (we compress our thoughts into language and share it), but current AI agents aren’t yet adept at the high-bandwidth, nuanced communication needed to truly stay in sync over long tasks.

Cognition’s solution? Don’t default to multi-agent. First try a simpler architecture: one agent, one continuous context. Ensure every decision that agent makes “sees” the trace of reasoning that led up to it – no hidden divergent contexts.

Of course, a single context will eventually overflow, but the answer isn’t to spawn independent agents; it’s to better compress the context. Yan suggests using an extra model whose sole job is to condense the conversation history into “key details, events, and decisions.”

This summarized memory can then persist as the backbone context for the main agent. In fact, Cognition has fine-tuned smaller models to perform this kind of compression reliably. The philosophy is that if you must lose information, lose it intentionally and in one place – via a trained compressor – rather than losing it implicitly across multiple agents’ blind spots.

This approach echoes Letta’s layered memory idea: maintain one coherent thread of thought, pruning and abstracting it as needed, instead of forking into many threads that might diverge.

Conclusion: Compression is Cognition

In the end, these approaches converge on a theme: intelligence is limited by information bottlenecks, and overcoming those limits looks a lot like compression. Whether it’s a single agent summarizing its past and querying a knowledge base, or a swarm of subagents parceling out a huge problem and each reporting back a digest, the core challenge is the same.

An effective mind (machine or human) can’t and shouldn’t hold every detail in working memory – it must aggressively filter, abstract, and encode information, yet be ready to recover the right detail at the right time. This is the classic rate–distortion tradeoff of cognition: maximize useful signal, minimize wasted space.

Letta’s layered memory shows one way: a built-in hierarchy of memory caches, from the always-present essentials to the vast but faint echo of long-term archives. Anthropic’s multi-agent system shows another: multiple minds sharing the load, each mind a lossy compressor for a different subset of the task. And Cognition’s critique reminds us that compression without coordination can fail – the pieces have to ultimately fit together into a coherent whole.

Perhaps as AI agents evolve, we’ll see hybrid strategies. We might use multi-agent teams whose members share a common architectural memory (imagine subagents all plugged into a shared Letta-style archival memory, so they’re not flying blind with respect to each other). Or we might simply get better at single agents with enormous contexts and sophisticated internal compression mechanisms, making multi-agent orchestration unnecessary for most tasks. Either way, the direction is clear: to control and extend AI cognition, we are, in a very real sense, engineering the art of forgetting. By deciding what to forget and when to recall, an agent demonstrates what it truly understands. In artificial minds as in our own, memory is meaningful precisely because it isn’t perfect recording – it’s prioritized, lossy, and alive.

A2A Is For UI

2025-06-14 08:00:00

There’s a lot of skepticism around A2A, Google’s Agent-to-Agent protocol. A lot of that is well earned. I mean, they launched a protocol with zero implementations. But a lot’s changed, and it’s worth taking a look again.

I’d like to convince you that you should be thinking about A2A as a protocol for giving agents a UI. And that UI is a bridge into a more complex multi-agent world. Gotta start somewhere!

It’s Just HTTP

The protocol is just a single HTTP endpoint and an agent card (can be statically served). Inside that single endpoint are JSON RPC methods:

  • message/send & message/stream — Both send messages, one returns a stream of events (SSE). The first message implicitly creates a task.
  • tasks/resubscribe — For when you were doing message/stream but your connection broke.
  • tasks/get — If you want to poll. SSE isn’t for everyone, I guess. cURL works too.
  • tasks/pushNotifications/set & .../get — for webhooks, if that’s your thing

So basically, you create a task, and then you exchange messages with it. That’s it.

Tasks are Actors

Uh, if you don’t know what actors are, this analogy might not help, but I’m going with it anyway.

Tasks are actors (think Erlang actors or Akka). The first time you send a message to an actor, a task (an actor) is implicitly created.

flowchart TD client((client)) client--send msg-->box[implicit mailbox] box-->task--"also
queued"-->client

Messages are processed one-at-a-time, in the order they were received. Messages can mutate task state. But it doesn’t get crazy because the interaction is very single threaded (well, I guess you could process messages in parallel, but why?)

UIs are Agents

I think of a UI as being an agent that happens to have a human behind it. Not an AI agent, but a human agent. The UI application code handles the computer part, the human handles the intelligence part.

Yes, A2A was designed for sending messages between AI agents, but we don’t currently live in a world where open-ended multi-agent systems are pervasive. We do live in a world where humans talk to agents. And that won’t ever really change, because agents aren’t super valuable if their work never makes it to a human.

A2A supports any data

Each message, in either direction, contains multiple parts, each of one of these types:

  • TextPart — plain text, think unprocessed LLM outputs
  • DataPart — think JSON or binary. The format is specified by the mime type
  • FilePart — like DataPart, but can be at a URL

So an agent can do things like mix plain LLM outputs with JSON outputs.

Delegate state to Prefect or Temporal

One subtly difficult part of A2A is that it requires keeping state, potentially over long periods of time.

For example, an agent realizes the initiating user didn’t say enough, so it asks for clarification. People aren’t very good computers and while we sometimes respond quickly, sometimes we take minutes or hours, or even years. Or never.

How do you deal with that?

I’ve dealt with this by using systems like Temporal and Prefect. Both are sometimes called “workflow” systems, but can also be thought of as providing durable function execution.

Both are more interesting than most workflow systems because they also provide suspend & resume functionality. For example, in prefect you can call await suspend_flow_run() and the flow will be completely shut down and occupy zero memory or CPU while the user is twiddling their thumbs.

The Shim

I pulled this diagram directly from FastA2A docs:

flowchart TB Server["HTTP Server"] <--> |Sends Requests/
Receives Results| TM subgraph CC[Core Components] direction RL TM["TaskManager
(coordinates)"] --> |Schedules Tasks| Broker TM <--> Storage Broker["Broker
(queues & schedules)"] <--> Storage["Storage
(persistence)"] Broker --> |Delegates Execution| Worker end Worker["Worker
(implementation)"]

Note: I like FastA2A because it implements the HTTP endpoints as a Starlette app that you can easily mount right into your API alongside everything else. Also, it has basically nothing to do with Pydantic or Pydantic AI other than it happens to be collocated inside the same Github repository.

FastA2A clearly realizes there’s a state problem and so they created interfaces for dealing with it. Not only that, but these interfaces are a fairly standard architecture for workflow systems.

I’ve created simple shims for both Temporal and Prefect that use the workflow systems to implement the TaskManager, Storage and Broker. The idea being you could use either Prefect or Temporal, whichever you prefer, to quickly create a robust A2A-compatible agent.

They’re each ~100 lines of code, yet implement just about everything you’d want from a stateful system, from retries and deployments to observability and a management UI.

Where does this fit into your agent?

Say you’re following the Plan-Act-Verify flow that’s become popular:

flowchart TD client((client)) client-->clarify[Clarify Question]-->Plan-->Act["Act (Query)"]-->Verify-->Plan Verify-->prepare[Prepare Report]-->client2((client))

All those boxes are things that need to happen once and only once (well, in a loop). Every agent has a slightly different take on this, but many boil down to some variant of this architecture. The workflows don’t have to be complicated (but by all means, they can be).

The point is, yes, A2A is stateful and statefulness can be hard. But it can be solved simply and cleanly by delegating to other hardened distributed systems that were designed to do this well.

A2A Versus MCP

Simply, MCP is for tools (function-like things with inputs and outputs). A2A is for when you need free-form communication. Hence why tasks look more like actors.

They also solve similar fan-out problems. MCP enables many tools to be used by few AI applications or agents. A2A enables many agents to be used by few user interfaces and other agents.

flowchart TD subgraph c[A2A Clients] teams[MS Teams] agentspace[Google
AgentSpace] ServiceNow end subgraph m[MCP Servers] comp[Computer
Use] search[Web
Search] APIs end teams-->Agent[A2A-compatible
Agent] agentspace-->Agent ServiceNow-->Agent Agent-->comp Agent-->search Agent-->APIs

Side note: AI Engineering has become incredibly complex. You have to master not just AI tech, but also be a full-stack engineer and a data engineer. The emergence of A2A & MCP dramatically reduces the scope of an AI engineer, and that’s exciting on it’s own.

Implementation is picking up quickly

I’m going to finish this post by linking to a ton of products that are using A2A or soon will. My hope being that you’ll realize that now is a good time to get in on this.

A2A-compatible agents you can launch (server side)

Commercial / SaaS agents – live today

  • Google-built agents inside Vertex AI Agent Builder & Agentspace – e.g., Deep Research Agent, Idea Generation Agent; all expose an A2A JSON-RPC endpoint out of the box. (cloud.google.com, cloud.google.com, cloud.google.com)
  • SAP Joule Agents & Business Agent Foundation – Joule delegates work to SAP and non-SAP systems via A2A. (news.sap.com, architecture.learning.sap.com)
  • Box AI Agents – content-centric agents (contract analysis, form extraction) advertise themselves through A2A so external agents can call them. (developers.googleblog.com, blog.box.com)
  • Zoom AI Companion – meeting-scheduling and recap agents are now published as A2A servers on the Zoom developer platform. (instagram.com, uctoday.com)
  • UiPath Maestro agents – healthcare summarization, invoice triage, etc.; natively speak A2A for cross-platform automation. (uipath.com, itbrief.com.au)
  • Deloitte enterprise Gemini agents – 100 + production agents deployed for clients, exposed over A2A. (venturebeat.com)

Open-source agents & frameworks

  • LangGraph sample Currency-Agent, Travel-Agent, etc. (a2aprotocol.ai, github.com)
  • CrewAI – “crews” can publish themselves as remote A2A services (#2970). (github.com)
  • Semantic Kernel travel-planner & “Meeting Agent” demos. (devblogs.microsoft.com, linkedin.com)
  • FastA2A reference server (Starlette + Pydantic AI) – minimal A2A turnkey agent. (github.com)
  • Official a2a-samples repo – dozens of runnable Python & JS agents. (github.com)

Announced / on the roadmap

  • Salesforce Agentforce will “incorporate standard protocols like A2A” in upcoming releases. (medium.com, salesforce.com)
  • ServiceNow, Atlassian, Intuit, MongoDB, PayPal, Workday, Accenture and ~40 other partners listed by Google as “founding A2A agents.” (venturebeat.com)

Products that dispatch to A2A agents (client/orchestrator side)

Cloud platforms & orchestration layers

  • Azure AI Foundry – multi-agent pipelines can send tasks/send & tasks/stream RPCs to any A2A server. (microsoft.com, microsoft.com)
  • Microsoft Copilot Studio – low-code tool that now “securely invokes external agents” over A2A. (microsoft.com, microsoft.com)
  • Google Agentspace hub – lets knowledge workers discover, invoke, and chain A2A agents (including third-party ones). (cloud.google.com, cloud.google.com)
  • Vertex AI Agent Builder – generates dispatch stubs so your front-end or workflow engine can call remote A2A agents. (cloud.google.com)

Gateways & governance

  • MuleSoft Flex Gateway – Governance for Agent Interactions – policy enforcement, rate-limiting, and auth for outbound A2A calls. (blogs.mulesoft.com, docs.mulesoft.com)
  • Auth0 “Market0” demo – shows how to mint JWT-style tokens and forward them in authentication headers for A2A requests. (auth0.com)

Open-source dispatch tooling

  • Official A2A Python SDK (a2a-python) – full client API (tasks/send, SSE streaming, retries). (github.com)
  • a2a-js client library (part of the A2A GitHub org). (github.com)
  • n8n-nodes-agent2agent – drop-in nodes that let any n8n workflow call or await A2A agents. (github.com)

Coming soon

  • UiPath Maestro orchestration layer (already works internally, public A2A client API expanding). (linkedin.com)
  • Salesforce Agentforce Mobile SDK – upcoming SDK will be able to dispatch to external A2A agents from mobile apps. (salesforceben.com)
  • ServiceNow & UiPath cross-dispatch partnerships are in private preview. (venturebeat.com)

MCP Resources Are For Caching

2025-06-05 08:00:00

If your MCP client doesn’t support resources, it is not a good client.

There! I said it!

It’s because MCP resources are for improved prompt utilization, namely cache invalidation. Without resources, you eat through your context and token budget faster than Elon at a drug store. And so if your client doesn’t support it, you basically can’t do RAG with MCP. At least not in a way that anyone would consider production worthy.

RAG documents are BIG

You don’t want to duplicate files. See this here:

system prompt
user message with tool definitions
agent message with tool calls
user message with tool call results
giant file 1
giant file 2
another agent message with tool calls
user message with tool call results
giant file 2
giant file 3
...

That’s 2 tool calls. The second one contains a duplicate file.

Is this bad? If your answer is “no” then this blog post isn’t going to resonate with you.

Separate results from whole files

The core of it: A well-implemented app, MCP or not, will keep track of the documents returned from a RAG query and avoid duplicating them in the prompt. To do this, you keep a list of resource IDs that you’ve seen before (sure, call it a “cache”).

Format the RAG tool response in the prompt like so:

<result uri="rag://polar-bears/74.md" />
<result uri="rag://chickens/23.md" />

<full-text uri="rag://chickens/23">
Chickens are...
</full-text>

In other words:

  1. The return value of the function, to the LLM, is an array of resources
  2. The full text is included elsewhere, for reference

URIs are useful as a cache key.

btw I’m just spitballing what the prompt format should be for returning results. You can play around with it, you might already have strong opinions. The point is, mapping must be done.

MCP is not LLM-readable

There’s been a lot of discussion about if LLMs can interpret OpenAPI fine, and if so, why use MCP. That misses the entire point. MCP isn’t supposed to be interpreted directly by an LLM.

When you implement an MCP client, you should be mapping MCP concepts to whatever works for that LLM. This is called implementing the protocol. If you throw vanilla MCP objects into a prompt, it could actually work. But a good client is going to map the results to phrases & formats that particular LLM has gone through extraordinarily expensive training to understand.

MCP is a protocol

MCP standardizes how tools should return their results. MCP resources exist so that tools (e.g. RAG search) can return files, and client can de-duplicate those files across many calls.

Yes, it’s cool that you can list a directory, but that’s not the primary purpose of resources. Without resources, your LLMs just eat more tokens unnecessarily.

(Side note: did you notice that neither Anthropic nor OpenAI supports resources in their APIs? It’s a conspiracy..)

Resources are table stakes MCP support

If a client doesn’t support MCP resources, it’s because they don’t care enough to implement a proper client. Period.

While I’m at it, prompts are just functions with special handling of the results. Might as well support those too.

UPDATE: tupac

I made a minimalist reference implementation of an MCP client. Feel free to check it out. Bare minimum, it’s extremely useful. It’s on Github and runs with a simple uvx command.

Discussion

I was wrong: AI Won't Overtake Software Engineering

2025-05-10 08:00:00

Back in January I wrote, Normware: The Decline of Software Engineering and, while I think it was generally well-reasoned, I was wrong. Or at least overly ambitious.

I predicted that software engineering as a profession is bound to decline and be replaced by less technical people with AI that are closer to the business problems. I no longer think that will happen, but not for technical reasons, but for social reasons.

What Changed

I saw people code.

I wrote the initial piece after using Cursor’s agent a few times. Since then the tools have gotten even more powerful and I can reliably one-shot entire non-trivial apps. I told a PM buddy about how I was doing it and he wanted to try and… it didn’t work. Not at all.

What I learned:

  1. I’m using a lot of hidden technical skills
  2. Yes, anyone can do it, but few will

On the surface it was stuff like, I’m comfortable in the terminal, he was not. And I don’t freak out when I get a huge error. But also softer skills, like how I know what complex code looks like vs simple code (with AI coding, overly complex code will cause an agent to deadlock). Also, he tried including authentication in the earliest version (lol n00b).

For some people, those are merely road blocks. I’ve talked to a few people with zero technical background that are absolutely crushing it with code right now. It’s hard, but they have the drive to push through the hard parts. Sure, they’ve got their fair share of total flops, but they a strong will and push through.

Those are not common people. Most are weak, or just care about other things.

How It Will happen

I suppose this scene hasn’t unfolded and maybe my first take was right after all. But I don’t think so.

It’s likely that AI improves dramatically and makes it seamless to generate any code at any time. That will certainly increase the pool of people willing to suffer through coding. But I don’t think it can shift enough such that the Normware vision pans out. Most people just aren’t interested.

Instead, I think we’ll see a steady decline of “boring code” jobs.

Someone at a very large tech company told me they worked on a (software engineering!!) team that did nothing but make configuration changes. That’s nuts. Over time, I think AI will chip away at these roles until they’re gone and replaced by code that engineers (say they) want to write. Early prototypes and demo-quality software is already being replaced by AI, and the trend will continue from that end as well.

Discussion