2025-09-15 08:00:00
A few people have asked me how I use AI coding tools. I don’t think it’s a straightforward answer. For me it’s not really a procedure or recipe, it’s more of an ethos.
You own the code your AI produces.
Use your own name to commit AI code so that if something breaks, everyone blames you. This is critical. How well do you need to know the code your AI produces? Well enough that you can answer for it’s mistakes.
In lean manufacturing they have the principle of Genchi genbutsu, i.e. “go and see for yourself.” In High Output Management, Andy Grove pushes “management by walking around”. Andy defines the output of a manager as the output of their entire org as well as the organizations under their influence.
The trouble with phrasing it as “AI coding” is it tricks you into thinking it’s just another individual role like software engineering, where it actually has a lot more in common with management. It’s unfortunate we hire and mentor for it as if it was software engineering.
Resist the urge to say, “oh, I just vibe coded this”. You coded it, and if it sucks, it’s because you don’t know how to manage your AI. Own it.
Not all time spent is equal. For some things, you can put in a little bit of effort and get a huge amount of reward. In business, we call those opportunities.
Examples:
AI coding isn’t about writing code, it’s about creating and exploiting gradients. Finding opportunities where you can spend 10 minutes of AI time and reap a huge reward.
The contrived example is proof of concepts. You can just do it, figure out if it really works in practice as it seems like it should, and abandon it quickly when it doesn’t.
Or data analysis. Traditionally it was labor intensive to do data analysis, but you can spin out a sick dashboard in a few minutes. Maybe that helps you avoid a dead end, or push your org in a new direction.
The key is to always be on the lookout for opportunities.
That feels a lot more like a shrewd businessman than a software engineer. Indeed! It’s a mistake that we transparently hire and promote software engineers into these roles. It’s a new beast.
I’m terrified of the future of software engineering.
Oh, I’ll continue having a job for a very long time. No concern about that. I’m worried that junior engineers won’t be promoted because it’s easier to dispatch a request to an AI than to give juniors the tasks that they traditionally learned the trade from.
But actually, this isn’t software engineering.
If anyone with their head on straight can take ownership and exploit gradients, then maybe junior engineers have an edge on seniors who are too stuck in their ways to realize they’ve been put in a new job role.
I broadly agree with you, would only add that people do have to get out of their comfort zone to get good at AI, and you have some obligation to do that
It’s really hard to be good at it at first, as a manager you have to give people some slack to learn those new skills too from @rickasourus on Twitter
Yes, managers take note! We’re learning a new job.
I enjoyed that. You’re right about the sense of ownership. Although some developers never had a sense of ownership of even hand crafted code. I wrote about this topic recently and it chimes with your thoughts https://www.aidanharding.com/2025/09/coding-with-ai/
The good ones did. @aidanharding.bsky.social says on Bluesky
2025-09-13 08:00:00
I went to close a bunch of browser tabs, but realized I have some cool stuff in here. Some has been marinating for a while. Most of these I’ve read, or tried to read.
link: https://techcrunch.com/2025/08/29/cracks-are-forming-in-metas-partnership-with-scale-ai/
Alexander Wang at Meta is apparently difficult to work with and people at Meta are doubting the fidelity of data produced by his ScaleAI.
link: https://arxiv.org/abs/2506.22084
IIRC they draw parallels between attention and graphs and argue that LLMs are graph neural nets, meaning that they can be used to look at graphs and guess what connections are missing.
I don’t think I posted anything on this, because while I find the idea fascinating, I couldn’t figure out how to make it feel tangible.
link: https://arxiv.org/abs/2508.14143
Fairly sure I never read this one. Looks interesting. Kind of far out there.
link: https://z.ai/blog/glm-4.5
GLM-4.5 announcement. These have turned out to be the leading open source models. Everything I hear is good.
link: https://whenaiseemsconscious.org/
I only read a little and gave up. This feels like a good take, maybe. Inside my own head I completely punt on having a take on AI consciousness and opt instead for the “don’t be a dick” rule. Idk, maybe they are maybe they aren’t, I’ll just live in the moment.
link: https://www.meta.com/superintelligence/
Zuck’s treatise on AI. I didn’t read. Normally I try to make an attempt to read these sorts of takes, or at least skim them, but I was busy at work. I had it loaded up on my phone to read on a plane, but it wouldn’t load once I was off WiFi. Sad.
link: https://arxiv.org/abs/2508.06471
The GLM-4.5 paper. This was a super interesting model. It feels like it breaks the “fancy model” rule in that it’s very architecturally cool but the personality doesn’t feel like it’s been squished out.
link: https://www.dwarkesh.com/s/blog
It’s a good blog, what can I say. Definitely on the over-hype side, but he’s got real takes and seems so intent on getting to the truth that he spends a lot of time on geopolitics just simply to understand AI dynamics. Mad respect.
link: https://blog.datologyai.com/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset/
I forget why I ended up here, but it’s an excellent post. I think this is connected to my project at work training a model. This post brings up a ton of data curation techniques.
I’ve recently learned and fully accepted that ALL major LLM advances come down to data. Yes, the architectural advances are cool and fun to talk about, but any meaningful progress has come from higher quality, higher quantity, or cheaper data.
link: https://arxiv.org/abs/2507.18074
Cool paper about auto-discovery of model architectures. IIRC they took a bunch of model architecture ideas, like group attention and mixture of experts, and used algorithms to mix and match all the parameters and configurations until something interesting popped out. It feels like a legitimately good way to approach research.
link: https://arxiv.org/abs/2507.15061
From Qwen, I don’t think I read this one, probably because it’s a bit dense and was hard to get fully engaged on. The idea seems cool though.
link: https://arxiv.org/abs/2005.10242
Classic paper. I read this one for work. I was trying to appreciate what Alignment & Uniformity measure and why they’re important. This was the paper that formalized those measures. It’s actually a pretty good paper to read, albeit 20 years old.
link: https://blog.datologyai.com/train-llms-faster-better-and-smaller-with-datologyai-s-data-curation/
More Dataology, they’re good, everything they do is good. BTW there’s a latent space episode with Dataology and it’s very good.
link: https://news.ycombinator.com/item?id=45008434
Chips are good too.
link: https://ysymyth.github.io/The-Second-Half/
This will be a classic post, calling it now. It lays out a great history and current state of AI and specifically reinforcement learning.
link: https://arxiv.org/abs/2508.17669
What? This is amazing. I don’t think I even looked at it, sad. Actually, now that I’m reading this I’m recalling that’s how I ended up on the Graph Neural Network link.
IIRC this is saying that LLMs can be highly intelligent because they incorporate the best parts of a huge number of people. IMO this is spiritually the same as my Three Plates blog post where I explain how unit tests, which are inherently buggy, can improve the overall quality of a system.
link: https://github.com/gepa-ai/gepa?tab=readme-ov-file#using-gepa-to-optimize-your-system
An algorithm for automatic prompt optimization. Happily, they support DSPy, so there’s no new framework that you have to take wholesale.
link: https://www.alphaxiv.org/pdf/2508.21038
This was a fascinating one. I colleague tried convincing me of this but I didn’t buy it until I read this paper. It makes a ton of sense. I have a simplified bluesky thread here.
tl;dr — embedding vectors have trouble representing compound logic (“horses” AND “Chinese military movements”) and generally fall apart quickly. It’s not that it’s not possible, it’s that it’s not feasible to cram that much information into such a small space.
link: https://arxiv.org/abs/2107.05720?utm_source=chatgpt.com
I ran into this while diving into the last link. It’s an older (2021) paper that has some potential for addressing the problems with embeddings. Realistically, I expect late interaction multi-vectors to be the end answer.
link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat
A super cool model that uses no-op MoE experts to dynamically turn down the amount of compute per token. Unfortunately, this one didn’t seem to be embraced by the community.
link: https://arxiv.org/abs/2405.19504v1
More embedding links. Now that I’m scanning it, I’m not sure it really soaked in the first time. They seem to have solved a lot of the problems with other late interaction methods. Maybe I should take a deeper look.
link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat/blob/main/modeling_longcat_flash.py
IDK sometimes you just have to look at the code to be sure.
link: https://m.youtube.com/watch?v=mU0HAmgwrz0&pp=QAFIAQ%3D%3D
Uh, no idea why this is up. I don’t really watch this show.
link: https://www.aleksagordic.com/blog/vllm
Fascinating break down of vLLM. If you’re not familiar, vLLM is like Ollama but actually a good option if you want to run it in production. Don’t run Ollama in production, kids, KV caches are good.
Honestly, this is absolutely worth your time if AI infrastructure is your jam (or you just want it to be). It goes into all the big concepts that an AI infra engineer needs to know. TBQH I love the intersection of AI & hardware.
link: https://simonwillison.net/
I mean, you have one of these tabs open too, right? riiiight????
link: https://algorithms-with-predictions.github.io/about/
Someone sent me this link and there was a reason, I know it. I just don’t remember why. IIRC it was because I brought up the A Case For Learned Indices paper and they pointed me to this whole treasure trove of papers that (sort of) evolved out of that. Basically traditional algorithms re-implemented using machine learning.
link: https://www.modular.com/blog
Yeah, idk, I think I was reading Matrix Mulitplication on Blackwell: Part 3 — The Optimization Behind 80% of SOTA Performance
Another AI infra post, heavy on algorithms & hardware.
link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
A cool concept. IIRC they introduce Cascade RL, automatically refining the RL dataset based on how current rollouts perform.
link: https://www.google.com/search?q=hong+kong&ie=UTF-8&oe=UTF-8&hl=en-us&client=safari
IDK I guess I was just trying to remember if Hong Kong was in China or not. And I learned that there’s a reason why I’m confused.
link: https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202
Someone sent me this link. It seems cool. Not sure it’s going to change much.
link: S11, E11) | Full Episode - YouTube (https://m.youtube.com/watch?v=Tkews9pRH1U&pp=QAFIBQ%3D%3D
I mean, aliens! Don’t tell me you don’t have secret fascinations
link: https://m.youtube.com/watch?v=tnfFn-uQ6WA&pp=0gcJCRsBo7VqN5tD
Oh, this was a great podcast. Well, I didn’t like the host but @kalomaze is worth following. Apparently only 20yo, never attempted college but a talented AI researcher nonetheless.
link: https://cdn.openai.com/gpt-5-system-card.pdf
Sometimes you just need to look things up to be sure..
link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B
Again, apparently. It honestly is a good model.
link: https://cslewisweb.com/2012/08/02/c-s-lewiss-divine-comedy/
Been thinking about how he described the outer layer of hell as consisting of people living equidistant from each other because they can’t stand anyone else. It was written like 100 years ago but feels like a commentary on today’s politics.
link: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/
Actually, this is pretty detailed breakdown of Claude Code. They seem to have decompiled the code without de-obfuscating it, which leads to some kind of silly quotes. But it’s good.
link: https://airia.com/ai-platform/
No idea how I got here. Looks like a Low/No Code builder.
link: https://www.arxiv.org/abs/2509.04575
Right, this one is the ExIt Paper. It’s another attempt at auto-managing RL curriculum dynamically by how training is progressing.
link: https://www.swyx.io/cognition
Swyx joined Cognition and dropped a treatise on AI engineering. Its good.
link: https://huggingface.co/papers/2509.06160
This was an excellent one. Another auto-curriculum RL paper. I did a bluesky breakdown here
link: https://chat.z.ai/c/6607ee45-27d5-487a-a1e2-44c2176040eb
GLM-4.5 chat application
link: https://news.ycombinator.com/item?id=45186015
Seems like the new Apple M19 chip has real matrix multiplication operations. Previous generations had excellent memory bandwidth, this gives it matching compute (on AI-friendly workloads). So I guess Macs will stay relevant for a while.
link: https://www.bbc.com/news/live/c2enwk1l9e1t
NGL this freaks me out.
link: https://vickiboykis.com/2025/09/09/walking-around-the-app/
Vicki writes such thoughtful pieces. Always worth reading her work.
link: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Oh wow, this was an amazing read. Very deep dive into AI infrastructure and, whoah, did you know that GPUs have operations that aren’t deterministic?
I did a bluesky thread here
link: https://blog.codingconfessions.com/p/groq-lpu-design
Looked this up as a tangent off the last link. Groq (not Grok) designed their ASIC to be fully deterministic from the ground up, and then built a really cool distributed system around it that assumes fully synchronous networking (not packet switching like TCP). It’s an absolutely crazy concept.
link: https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html
I didn’t read this, but it’s definitely a tangent off of non-deterministic LLMs.
link: https://tiger-ai-lab.github.io/Hierarchical-Reasoner/
Absolutely fascinating. I only read the blog, not the paper, but it frames RL as a 2-stage process where RL is mostly slinging together discrete skills (learned during pre-training).
It’s not an auto-curriculum RL paper AFAICT, it’s just a huge improvement in RL efficiency by focusing only on the “pivot” tokens.
link: https://timkellogg.me/blog/2024/10/10/entropix
I had looked this up as a reference to “pivot” tokens. Honestly, I link people back to this blog a lot
link: https://github.com/ast-grep/ast-grep-mcp
An MCP server that lets you search code while respecting the structure. I’ve heard some very positive things as well as “meh” responses on this. I’m sure real usage is a bit nuanced.
link: https://www.science.org/content/blog-post/life-maybe-mars-unless-we-change-our-minds
Guys, this is incredible!
2025-08-08 08:00:00
This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.
They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng
A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.
And that’s the lens I saw GPT-5 through.
The paper describes three eras of the internet:
When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.
I use it. If it changes how I work or think, then it’s a good LLM.
o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.
I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.
Claude Code is software.
Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.
The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.
But this paper..
I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.
Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.
I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.
GPT-5 is for the agentic web.
Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.
The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.
The difference is agents. In the agentic web, everything is mediated by agents.
You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).
You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.
The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.
The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.
I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.
The only benchmark that matters is how much it changes life.
At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.
I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.
My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.
Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.
We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.
Before we see significant improvement in models, we’re going to need a lot more in:
Probably several other low-hanging areas.
2025-07-19 08:00:00
Feeling behind? Makes sense, AI moves fast. This post will catch you up.
First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.
The last 6 months:
Obviously it is, right?
Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:
All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.
K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.
For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.
R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.
MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.
The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.
K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.
That pretty much covers our current agent challenges.
In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.
But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.
But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.
Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.
If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.
Which is better?
On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.
Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.
The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.
It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.
Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.
Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.
And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.
What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.
2025-07-18 08:00:00
I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:
An entity can understand if it builds a latent model of reality. And:
- Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.
- Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.
Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.
I think there’s plenty of evidence that LLMs are currently doing all of this.
But first..
Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.
This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.
For example, if you saw this:
You probably identify it immediately as a sphere even though it’s just a bunch of dots.
A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.
From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.
Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.
Otherwise they wouldn’t work.
The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.
That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.
ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.
In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.
Makes sense. The smarter you get, the easier it is to get smarter.
When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.
A recent Jan-2025 paper shows that this is the case.
A large segment of the AI-critical use this argument as evidence. Paraphrasing:
Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.
This was in 2015 so the example seems basic, but the principle is still being applied in 2025.
The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.
I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.
Go the other way, prove that there are areas where AI’s latent model matches reality.
But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.
Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.
But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.
2025-06-15 08:00:00
Recently, Anthropic published a blog post detailing their multi-agent approach to building their Research agent. Also, Cognition wrote a post on why multi-agent systems don’t work today. The thing is, they’re both saying the same thing.
At the same time, I’ve been enthralled watching a new bot, Void, interact with users on Bluesky. Void is written in Letta, an AI framework oriented around memory. Void feels alive in a way no other AI bot I’ve encountered feels. Something about the memory gives it a certain magic.
I took some time to dive into Letta’s architecture and noticed a ton of parallels with what the Anthropic and Cognition posts were saying, around context management. Letta takes a different approach.
Below, I’ve had OpenAI Deep Research format our conversation into a blog post. I’ve done some light editing, adding visuals etc., but generally it’s all AI. I appreciated this, I hope you do too.
When an AI agent “remembers,” it compresses. Finite context windows force hard choices about what to keep verbatim, what to summarize, and what to discard. Letta’s layered memory architecture embraces this reality by structuring an agent’s memory into tiers – each a lossy compression of the last. This design isn’t just a storage trick; it’s an information strategy.
Letta (formerly MemGPT) splits memory into four memory blocks: core, message buffer, archival, and recall. Think of these as concentric rings of context, from most essential to most expansive, similar to L1, L2, L3 cache on a CPU:
How it works: On each turn, the agent assembles its context from core knowledge, the fresh message buffer, and any recall snippets. All three streams feed into the model’s input. Meanwhile, if the message buffer is full, the oldest interactions get archived out to long-term memory.
Later, if those details become relevant, the agent can query the archival store to retrieve them into the recall slot. What’s crucial is that each layer is a lossy filter: core memory is tiny but high-priority (no loss for the most vital data), the message buffer holds only recent events (older details dropped unless explicitly saved), and the archive contains everything in theory but only yields an approximate answer via search. The agent itself chooses what to promote to long-term storage (e.g. summarizing and saving a key decision) and what to fetch back.
It’s a cascade of compressions and selective decompressions.
Rate–distortion tradeoff: This hierarchy embodies a classic principle from information theory. With a fixed channel (context window) size, maximizing information fidelity means balancing rate (how many tokens we include) against distortion (how much detail we lose).
Letta’s memory blocks are essentially a rate–distortion ladder. Core memory has a tiny rate (few tokens) but zero distortion on the most critical facts. The message buffer has a larger rate (recent dialogue in full) but cannot hold everything – older context is distorted by omission or summary. Archival memory has effectively infinite capacity (high rate) but in practice high distortion: it’s all the minutiae and past conversations compressed into embeddings or summaries that the agent might never look at again.
The recall stage tries to recover (rehydrate) just enough of that detail when needed. Every step accepts some information loss to preserve what matters most. In other words, to remember usefully, the agent must forget judiciously.
This layered approach turns memory management into an act of cognition.
Summarizing a chunk of conversation before archiving it forces the agent to decide what the gist is – a form of understanding. Searching the archive for relevant facts forces it to formulate good queries – effectively reasoning about what was important. In Letta’s design, compression is not just a storage optimization; it is part of the thinking process. The agent is continually compressing its history and decompressing relevant knowledge as needed, like a human mind generalizing past events but recalling a specific detail when prompted.
Caption: As new user input comes in, the agent’s core instructions and recent messages combine with any retrieved snippets from long-term memory, all funneling into the LLM. After responding, the agent may drop the oldest message from short-term memory into a vector store, and perhaps summarize it for posterity. The next query might hit that store and pull up the summary as needed. The memory “cache” is always in flux.
The above is a single-agent solution: one cognitive entity juggling compressed memories over time. An alternative approach has emerged that distributes cognition across multiple agents, each with its own context window – in effect, parallel minds that later merge their knowledge.
Anthropic’s recent multi-agent research system frames intelligence itself as an exercise in compression across agents. In their words, “The essence of search is compression: distilling insights from a vast corpus.” Subagents “facilitate compression by operating in parallel with their own context windows… condensing the most important tokens for the lead research agent”.
Instead of one agent with one context compressing over time, they spin up several agents that each compress different aspects of a problem in parallel. The lead agent acts like a coordinator, taking these condensed answers and integrating them.
This multi-agent strategy acknowledges the same limitation (finite context per agent) but tackles it by splitting the work. Each subagent effectively says, “I’ll compress this chunk of the task down to a summary for you,” and the lead agent aggregates those results.
It’s analogous to a team of researchers: divide the topic, each person reads a mountain of material and reports back with a summary so the leader can synthesize a conclusion. By partitioning the context across agents, the system can cover far more ground than a single context window would allow.
In fact, Anthropic found that a well-coordinated multi-agent setup outperformed a single-agent approach on broad queries that require exploring many sources. The subagents provided separation of concerns (each focused on one thread of the problem) and reduced the path-dependence of reasoning – because they explored independently, the final answer benefited from multiple compressions of evidence rather than one linear search.
However, this comes at a cost.
Coordination overhead and consistency become serious challenges. Cognition’s Walden Yan argues that multi-agent systems today are fragile chiefly due to context management failures. Each agent only sees a slice of the whole, so misunderstandings proliferate.
One subagent might interpret a task slightly differently than another, and without a shared memory of each other’s decisions, the final assembly can conflict or miss pieces. As Yan puts it, running multiple agents in collaboration in 2025 “only results in fragile systems. The decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents.” In other words, when each subagent compresses its piece of reality in isolation, the group may lack a common context to stay aligned.
In Anthropic’s terms, the “separation of concerns” cuts both ways: it reduces interference, but also means no single agent grasps the full picture. Humans solve this by constant communication (we compress our thoughts into language and share it), but current AI agents aren’t yet adept at the high-bandwidth, nuanced communication needed to truly stay in sync over long tasks.
Cognition’s solution? Don’t default to multi-agent. First try a simpler architecture: one agent, one continuous context. Ensure every decision that agent makes “sees” the trace of reasoning that led up to it – no hidden divergent contexts.
Of course, a single context will eventually overflow, but the answer isn’t to spawn independent agents; it’s to better compress the context. Yan suggests using an extra model whose sole job is to condense the conversation history into “key details, events, and decisions.”
This summarized memory can then persist as the backbone context for the main agent. In fact, Cognition has fine-tuned smaller models to perform this kind of compression reliably. The philosophy is that if you must lose information, lose it intentionally and in one place – via a trained compressor – rather than losing it implicitly across multiple agents’ blind spots.
This approach echoes Letta’s layered memory idea: maintain one coherent thread of thought, pruning and abstracting it as needed, instead of forking into many threads that might diverge.
In the end, these approaches converge on a theme: intelligence is limited by information bottlenecks, and overcoming those limits looks a lot like compression. Whether it’s a single agent summarizing its past and querying a knowledge base, or a swarm of subagents parceling out a huge problem and each reporting back a digest, the core challenge is the same.
An effective mind (machine or human) can’t and shouldn’t hold every detail in working memory – it must aggressively filter, abstract, and encode information, yet be ready to recover the right detail at the right time. This is the classic rate–distortion tradeoff of cognition: maximize useful signal, minimize wasted space.
Letta’s layered memory shows one way: a built-in hierarchy of memory caches, from the always-present essentials to the vast but faint echo of long-term archives. Anthropic’s multi-agent system shows another: multiple minds sharing the load, each mind a lossy compressor for a different subset of the task. And Cognition’s critique reminds us that compression without coordination can fail – the pieces have to ultimately fit together into a coherent whole.
Perhaps as AI agents evolve, we’ll see hybrid strategies. We might use multi-agent teams whose members share a common architectural memory (imagine subagents all plugged into a shared Letta-style archival memory, so they’re not flying blind with respect to each other). Or we might simply get better at single agents with enormous contexts and sophisticated internal compression mechanisms, making multi-agent orchestration unnecessary for most tasks. Either way, the direction is clear: to control and extend AI cognition, we are, in a very real sense, engineering the art of forgetting. By deciding what to forget and when to recall, an agent demonstrates what it truly understands. In artificial minds as in our own, memory is meaningful precisely because it isn’t perfect recording – it’s prioritized, lossy, and alive.