MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

How I Use AI

2025-09-15 08:00:00

A few people have asked me how I use AI coding tools. I don’t think it’s a straightforward answer. For me it’s not really a procedure or recipe, it’s more of an ethos.

Principle: Ownership

You own the code your AI produces.

Use your own name to commit AI code so that if something breaks, everyone blames you. This is critical. How well do you need to know the code your AI produces? Well enough that you can answer for it’s mistakes.

In lean manufacturing they have the principle of Genchi genbutsu, i.e. “go and see for yourself.” In High Output Management, Andy Grove pushes “management by walking around”. Andy defines the output of a manager as the output of their entire org as well as the organizations under their influence.

The trouble with phrasing it as “AI coding” is it tricks you into thinking it’s just another individual role like software engineering, where it actually has a lot more in common with management. It’s unfortunate we hire and mentor for it as if it was software engineering.

  • What does the algorithm actually do?
  • Did it find all of the places to refactor?

Resist the urge to say, “oh, I just vibe coded this”. You coded it, and if it sucks, it’s because you don’t know how to manage your AI. Own it.

Principle: Exploit Gradients

Not all time spent is equal. For some things, you can put in a little bit of effort and get a huge amount of reward. In business, we call those opportunities.

a graph with x axis labeled "effort", y axis labeled "good stuff" and a curve with a steep part labeled "the gradient"

Examples:

  • Biology: A tiger migrates to where there’s more food. Less effort for more food.
  • Arbitrage: Buy cheap, send to another country and sell expensive. Less effort for more money.

AI coding isn’t about writing code, it’s about creating and exploiting gradients. Finding opportunities where you can spend 10 minutes of AI time and reap a huge reward.

The contrived example is proof of concepts. You can just do it, figure out if it really works in practice as it seems like it should, and abandon it quickly when it doesn’t.

Or data analysis. Traditionally it was labor intensive to do data analysis, but you can spin out a sick dashboard in a few minutes. Maybe that helps you avoid a dead end, or push your org in a new direction.

The key is to always be on the lookout for opportunities.

That feels a lot more like a shrewd businessman than a software engineer. Indeed! It’s a mistake that we transparently hire and promote software engineers into these roles. It’s a new beast.

How to become a AI Coder

I’m terrified of the future of software engineering.

Oh, I’ll continue having a job for a very long time. No concern about that. I’m worried that junior engineers won’t be promoted because it’s easier to dispatch a request to an AI than to give juniors the tasks that they traditionally learned the trade from.

But actually, this isn’t software engineering.

If anyone with their head on straight can take ownership and exploit gradients, then maybe junior engineers have an edge on seniors who are too stuck in their ways to realize they’ve been put in a new job role.

Discussion

Get out of your comfort zone

I broadly agree with you, would only add that people do have to get out of their comfort zone to get good at AI, and you have some obligation to do that

It’s really hard to be good at it at first, as a manager you have to give people some slack to learn those new skills too from @rickasourus on Twitter

Yes, managers take note! We’re learning a new job.

Sense of ownership

I enjoyed that. You’re right about the sense of ownership. Although some developers never had a sense of ownership of even hand crafted code. I wrote about this topic recently and it chimes with your thoughts https://www.aidanharding.com/2025/09/coding-with-ai/

The good ones did. @aidanharding.bsky.social says on Bluesky

Link Graveyard: A snapshot of my abandoned browser tabs

2025-09-13 08:00:00

I went to close a bunch of browser tabs, but realized I have some cool stuff in here. Some has been marinating for a while. Most of these I’ve read, or tried to read.

Cracks are forming in Meta’s partnership with Scale AI | TechCrunch

link: https://techcrunch.com/2025/08/29/cracks-are-forming-in-metas-partnership-with-scale-ai/

Alexander Wang at Meta is apparently difficult to work with and people at Meta are doubting the fidelity of data produced by his ScaleAI.

[2506.22084] Transformers are Graph Neural Networks

link: https://arxiv.org/abs/2506.22084

IIRC they draw parallels between attention and graphs and argue that LLMs are graph neural nets, meaning that they can be used to look at graphs and guess what connections are missing.

I don’t think I posted anything on this, because while I find the idea fascinating, I couldn’t figure out how to make it feel tangible.

Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation

link: https://arxiv.org/abs/2508.14143

Fairly sure I never read this one. Looks interesting. Kind of far out there.

GLM-4.5: Reasoning, Coding, and Agentic Abililties

link: https://z.ai/blog/glm-4.5

GLM-4.5 announcement. These have turned out to be the leading open source models. Everything I hear is good.

When an AI Seems Conscious

link: https://whenaiseemsconscious.org/

I only read a little and gave up. This feels like a good take, maybe. Inside my own head I completely punt on having a take on AI consciousness and opt instead for the “don’t be a dick” rule. Idk, maybe they are maybe they aren’t, I’ll just live in the moment.

Personal Superintelligence

link: https://www.meta.com/superintelligence/

Zuck’s treatise on AI. I didn’t read. Normally I try to make an attempt to read these sorts of takes, or at least skim them, but I was busy at work. I had it loaded up on my phone to read on a plane, but it wouldn’t load once I was off WiFi. Sad.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

link: https://arxiv.org/abs/2508.06471

The GLM-4.5 paper. This was a super interesting model. It feels like it breaks the “fancy model” rule in that it’s very architecturally cool but the personality doesn’t feel like it’s been squished out.

Blog | Dwarkesh Podcast | Dwarkesh Patel | Substack

link: https://www.dwarkesh.com/s/blog

It’s a good blog, what can I say. Definitely on the over-hype side, but he’s got real takes and seems so intent on getting to the truth that he spends a lot of time on geopolitics just simply to understand AI dynamics. Mad respect.

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

link: https://blog.datologyai.com/technical-deep-dive-curating-our-way-to-a-state-of-the-art-text-dataset/

I forget why I ended up here, but it’s an excellent post. I think this is connected to my project at work training a model. This post brings up a ton of data curation techniques.

I’ve recently learned and fully accepted that ALL major LLM advances come down to data. Yes, the architectural advances are cool and fun to talk about, but any meaningful progress has come from higher quality, higher quantity, or cheaper data.

AlphaGo Moment for Model Architecture Discovery

link: https://arxiv.org/abs/2507.18074

Cool paper about auto-discovery of model architectures. IIRC they took a bunch of model architecture ideas, like group attention and mixture of experts, and used algorithms to mix and match all the parameters and configurations until something interesting popped out. It feels like a legitimately good way to approach research.

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

link: https://arxiv.org/abs/2507.15061

From Qwen, I don’t think I read this one, probably because it’s a bit dense and was hard to get fully engaged on. The idea seems cool though.

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

link: https://arxiv.org/abs/2005.10242

Classic paper. I read this one for work. I was trying to appreciate what Alignment & Uniformity measure and why they’re important. This was the paper that formalized those measures. It’s actually a pretty good paper to read, albeit 20 years old.

Train LLMs Faster, Better, and Smaller with DatologyAI’s Data Curation

link: https://blog.datologyai.com/train-llms-faster-better-and-smaller-with-datologyai-s-data-curation/

More Dataology, they’re good, everything they do is good. BTW there’s a latent space episode with Dataology and it’s very good.

Nvidia DGX Spark | Hacker News

link: https://news.ycombinator.com/item?id=45008434

Chips are good too.

The Second Half – Shunyu Yao – 姚顺雨

link: https://ysymyth.github.io/The-Second-Half/

This will be a classic post, calling it now. It lays out a great history and current state of AI and specifically reinforcement learning.

A Taxonomy of Transcendence

link: https://arxiv.org/abs/2508.17669

What? This is amazing. I don’t think I even looked at it, sad. Actually, now that I’m reading this I’m recalling that’s how I ended up on the Graph Neural Network link.

IIRC this is saying that LLMs can be highly intelligent because they incorporate the best parts of a huge number of people. IMO this is spiritually the same as my Three Plates blog post where I explain how unit tests, which are inherently buggy, can improve the overall quality of a system.

GitHub - gepa-ai/gepa: Optimize prompts, code, and more with AI-powered Reflective Text Evolution

link: https://github.com/gepa-ai/gepa?tab=readme-ov-file#using-gepa-to-optimize-your-system

An algorithm for automatic prompt optimization. Happily, they support DSPy, so there’s no new framework that you have to take wholesale.

On the Theoretical Limitations of Embedding-Based Retrieval | alphaXiv

link: https://www.alphaxiv.org/pdf/2508.21038

This was a fascinating one. I colleague tried convincing me of this but I didn’t buy it until I read this paper. It makes a ton of sense. I have a simplified bluesky thread here.

tl;dr — embedding vectors have trouble representing compound logic (“horses” AND “Chinese military movements”) and generally fall apart quickly. It’s not that it’s not possible, it’s that it’s not feasible to cram that much information into such a small space.

[2107.05720] SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

link: https://arxiv.org/abs/2107.05720?utm_source=chatgpt.com

I ran into this while diving into the last link. It’s an older (2021) paper that has some potential for addressing the problems with embeddings. Realistically, I expect late interaction multi-vectors to be the end answer.

meituan-longcat/LongCat-Flash-Chat · Hugging Face

link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat

A super cool model that uses no-op MoE experts to dynamically turn down the amount of compute per token. Unfortunately, this one didn’t seem to be embraced by the community.

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

link: https://arxiv.org/abs/2405.19504v1

More embedding links. Now that I’m scanning it, I’m not sure it really soaked in the first time. They seem to have solved a lot of the problems with other late interaction methods. Maybe I should take a deeper look.

modeling_longcat_flash.py · meituan-longcat/LongCat-Flash-Chat at main

link: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat/blob/main/modeling_longcat_flash.py

IDK sometimes you just have to look at the code to be sure.

The Rachel Maddow Show - Aug. 25 | Audio Only - YouTube

link: https://m.youtube.com/watch?v=mU0HAmgwrz0&pp=QAFIAQ%3D%3D

Uh, no idea why this is up. I don’t really watch this show.

Inside vLLM: Anatomy of a High-Throughput LLM Inference System - Aleksa Gordić

link: https://www.aleksagordic.com/blog/vllm

Fascinating break down of vLLM. If you’re not familiar, vLLM is like Ollama but actually a good option if you want to run it in production. Don’t run Ollama in production, kids, KV caches are good.

Honestly, this is absolutely worth your time if AI infrastructure is your jam (or you just want it to be). It goes into all the big concepts that an AI infra engineer needs to know. TBQH I love the intersection of AI & hardware.

Simon Willison’s Weblog

link: https://simonwillison.net/

I mean, you have one of these tabs open too, right? riiiight????

ALPS - About

link: https://algorithms-with-predictions.github.io/about/

Someone sent me this link and there was a reason, I know it. I just don’t remember why. IIRC it was because I brought up the A Case For Learned Indices paper and they pointed me to this whole treasure trove of papers that (sort of) evolved out of that. Basically traditional algorithms re-implemented using machine learning.

Modular: Blog

link: https://www.modular.com/blog

Yeah, idk, I think I was reading Matrix Mulitplication on Blackwell: Part 3 — The Optimization Behind 80% of SOTA Performance

Another AI infra post, heavy on algorithms & hardware.

OpenGVLab/InternVL3_5-241B-A28B · Hugging Face

link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B

A cool concept. IIRC they introduce Cascade RL, automatically refining the RL dataset based on how current rollouts perform.

hong kong - Google Search

link: https://www.google.com/search?q=hong+kong&ie=UTF-8&oe=UTF-8&hl=en-us&client=safari

IDK I guess I was just trying to remember if Hong Kong was in China or not. And I learned that there’s a reason why I’m confused.

Photonic processor could enable ultrafast AI computations with extreme energy efficiency | MIT News | Massachusetts Institute of Technology

link: https://news.mit.edu/2024/photonic-processor-could-enable-ultrafast-ai-computations-1202

Someone sent me this link. It seems cool. Not sure it’s going to change much.

Ancient Aliens: Are There Extraterrestrial Structures On The Moon?

link: S11, E11) | Full Episode - YouTube (https://m.youtube.com/watch?v=Tkews9pRH1U&pp=QAFIBQ%3D%3D

I mean, aliens! Don’t tell me you don’t have secret fascinations

The Lore of 20yo ML Researcher at Prime Intellect | RL, Agents and Intelligence - YouTube

link: https://m.youtube.com/watch?v=tnfFn-uQ6WA&pp=0gcJCRsBo7VqN5tD

Oh, this was a great podcast. Well, I didn’t like the host but @kalomaze is worth following. Apparently only 20yo, never attempted college but a talented AI researcher nonetheless.

GPT-5 System Card | OpenAI

link: https://cdn.openai.com/gpt-5-system-card.pdf

Sometimes you just need to look things up to be sure..

OpenGVLab/InternVL3_5-241B-A28B · Hugging Face

link: https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B

Again, apparently. It honestly is a good model.

C.S. Lewis’s Divine Comedy | C.S. Lewis Web

link: https://cslewisweb.com/2012/08/02/c-s-lewiss-divine-comedy/

Been thinking about how he described the outer layer of hell as consisting of people living equidistant from each other because they can’t stand anyone else. It was written like 100 years ago but feels like a commentary on today’s politics.

Claude Code: Behind-the-scenes of the master agent loop

link: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/

Actually, this is pretty detailed breakdown of Claude Code. They seem to have decompiled the code without de-obfuscating it, which leads to some kind of silly quotes. But it’s good.

Airia AI Platform | Build, Deploy & Scale Enterprise AI

link: https://airia.com/ai-platform/

No idea how I got here. Looks like a Low/No Code builder.

[2509.04575] Bootstrapping Task Spaces for Self-Improvement

link: https://www.arxiv.org/abs/2509.04575

Right, this one is the ExIt Paper. It’s another attempt at auto-managing RL curriculum dynamically by how training is progressing.

Cognition: The Devin is in the Details

link: https://www.swyx.io/cognition

Swyx joined Cognition and dropped a treatise on AI engineering. Its good.

Paper page - Reverse-Engineered Reasoning for Open-Ended Generation

link: https://huggingface.co/papers/2509.06160

This was an excellent one. Another auto-curriculum RL paper. I did a bluesky breakdown here

New Chat | Chat with Z.ai - Free AI Chatbot powered by GLM-4.5

link: https://chat.z.ai/c/6607ee45-27d5-487a-a1e2-44c2176040eb

GLM-4.5 chat application

iPhone Air | Hacker News

link: https://news.ycombinator.com/item?id=45186015

Seems like the new Apple M19 chip has real matrix multiplication operations. Previous generations had excellent memory bandwidth, this gives it matching compute (on AI-friendly workloads). So I guess Macs will stay relevant for a while.

Poland closest to open conflict since World War Two, PM says after Russian drones shot down - live updates - BBC News

link: https://www.bbc.com/news/live/c2enwk1l9e1t

NGL this freaks me out.

Walking around the app | ★❤✰ Vicki Boykis ★❤✰

link: https://vickiboykis.com/2025/09/09/walking-around-the-app/

Vicki writes such thoughtful pieces. Always worth reading her work.

Defeating Nondeterminism in LLM Inference - Thinking Machines Lab

link: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Oh wow, this was an amazing read. Very deep dive into AI infrastructure and, whoah, did you know that GPUs have operations that aren’t deterministic?

I did a bluesky thread here

The Architecture of Groq’s LPU - by Abhinav Upadhyay

link: https://blog.codingconfessions.com/p/groq-lpu-design

Looked this up as a tangent off the last link. Groq (not Grok) designed their ASIC to be fully deterministic from the ground up, and then built a really cool distributed system around it that assumes fully synchronous networking (not packet switching like TCP). It’s an absolutely crazy concept.

Levanter — Legible, Scalable, Reproducible Foundation Models with JAX

link: https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html

I didn’t read this, but it’s definitely a tangent off of non-deterministic LLMs.

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

link: https://tiger-ai-lab.github.io/Hierarchical-Reasoner/

Absolutely fascinating. I only read the blog, not the paper, but it frames RL as a 2-stage process where RL is mostly slinging together discrete skills (learned during pre-training).

It’s not an auto-curriculum RL paper AFAICT, it’s just a huge improvement in RL efficiency by focusing only on the “pivot” tokens.

What is entropix doing? - Tim Kellogg

link: https://timkellogg.me/blog/2024/10/10/entropix

I had looked this up as a reference to “pivot” tokens. Honestly, I link people back to this blog a lot

GitHub - ast-grep/ast-grep-mcp

link: https://github.com/ast-grep/ast-grep-mcp

An MCP server that lets you search code while respecting the structure. I’ve heard some very positive things as well as “meh” responses on this. I’m sure real usage is a bit nuanced.

Life, Maybe, On Mars, Unless We Change Our Minds | Science | AAAS

link: https://www.science.org/content/blog-post/life-maybe-mars-unless-we-change-our-minds

Guys, this is incredible!

GPT-5 failed the wrong test

2025-08-08 08:00:00

This post isn’t really about GPT-5. Sure, it launched and people are somewhat disappointed. It’s the why that bugs me.

They expected AGI, the AI god, but instead got merely the best model in the world. v disapointng

A few days before the GPT-5 launch I read this paper, Agentic Web: Weaving the Next Web with AI Agents. It’s not my normal kind of paper, it’s not very academic. There’s no math in it, no architecture. It just paints a picture of the future.

And that’s the lens I saw GPT-5 through.

The paper describes three eras of the internet:

  • PC Era — Wikipedia, Craig’s List, etc.; users actively seek information
  • Mobile/Social Era — Tik Tok, Insta, etc.; content is pushed via recommendation algorithms
  • Agentic Web — user merely expresses intent

image of 3 internets, I'll explain below

When I weigh the strengths of GPT-5, it feels poised and ready for the agentic web.

How do I vibe test an LLM?

I use it. If it changes how I work or think, then it’s a good LLM.

o3 dramatically changed how I work. GPT-4 did as well. GPT-5 didn’t, because it’s the end of the line. You can’t really make a compelling LLM anymore, they’re all so good most people can’t tell them apart. Even the tiny ones.

I talked to a marketing person this week. I showed them Claude Code. They don’t even write code, but they insisted it was 10x better than any model they’d used before, even Claude. I’d echo the same thing, there’s something about those subagents, they zoom.

Claude Code is software.

Sure, there’s a solid model behind it. But there’s a few features that make it really tick. Replicate those and you’re well on you’re way.

GPT-5 is for the agentic web

The first time I heard agentic web I almost vomited in my mouth. It sounds like the kind of VC-induced buzzword cess that I keep my distance from.

But this paper..

I want AI to do all the boring work in life. Surfing sites, research, filling out forms, etc.

Models like GPT-5 and gpt-oss are highly agentic. All the top models are going in that direction. They put them in a software harness and apply RL and update their weights accordingly if they used their tools well. They’re trained to be agents.

I hear a lot of criticism of GPT-5, but none from the same people who recognize that it can go 2-4 hours between human contact while working on agentic tasks. Whoah.

GPT-5 is for the agentic web.

yeah but i hate ads

Well okay, me too. Not sure where that came from but I don’t think that’s where this is going. Well, it’s exactly where it’s going, but not in the way you’re thinking.

The paper talks about this. People need to sell stuff, that won’t change. They want you to buy their stuff. All that is the same.

The difference is agents. In the agentic web, everything is mediated by agents.

You don’t search for a carbon monoxide monitor, you ask your agent to buy you one. You don’t even do that, your agent senses it’s about to die and suggests that you buy one, before it wakes you up in the middle of the night (eh, yeah, sore topic for me).

You’re a seller and you’re trying to game the system? Ads manipulate consumers, but consumers aren’t buying anymore. Who do you manipulate? Well, agents. They’re the ones making the decisions in the agentic web.

The paper calls this the Agent Attention Economy, and it operates under the same constraints. Attention is still limited, even agent attention, but you need them to buy your thing.

The paper makes some predictions, they think there will be brokers (like ad brokers) that advertise agents & resources to be used. So I guess you’d game the system by making your product seem more useful or better than it is, so it looks appealing to agents and more agents use it.

I’m not sure what that kind of advertising would look like. Probably like today’s advertising, just more invisible.

Benchmarks

The only benchmark that matters is how much it changes life.

At this point, I don’t think 10T parameters is really going to bump that benchmark any. I don’t think post-training on 100T tokens of math is going to change much.

I get excited about software. We’re at a point where software is so extremely far behind the LLMs. Even the slightest improvements in an agent harness design yield outsized rewards, like how Claude Code is still better than OpenAI codex-cli with GPT-5, a better coding model.

My suspicion is that none of the AI models are going to seem terribly appealing going forward without massive leaps in the software harness around the LLM. The only way to really perceive the difference is how it changes your life, and we’re long past where a pure model can do that.

Not just software, but also IT infrastructure. Even small questions like, “when will AI get advertising?” If an AI model literally got advertising baked straight into the heart of the model, that would make me sad. It means the creator’s aren’t seeing the same vision.

We’ve talked a lot about the balance between pre-training and post-training, but nobody seems to be talking about the balance between LLMs and their harnesses.

Areas for growth

Before we see significant improvement in models, we’re going to need a lot more in:

  • Memory — stateful agents that don’t forget you
  • Harnesses — the software around the LLM inside the agent
  • Networking & infra — getting agents to discover and leverage each other

Probably several other low-hanging areas.

Discussion

Explainer: K2 & Math Olympiad Golds

2025-07-19 08:00:00

Feeling behind? Makes sense, AI moves fast. This post will catch you up.

The year of agents

First of all, yes, ‘25 is the year of agents. Not because we’ve achieved agents, but because we haven’t. It wouldn’t be worth talking about if we were already there. But there’s been a ton of measurable progress toward agents.

Timeline

The last 6 months:

Is ‘thinking’ necessary?

Obviously it is, right?

Back in January, we noticed that when a model does Chain of Thought (CoT) “thinking”, it elicits these behaviors:

  • Self-verification
  • Sub-goal setting
  • Backtracking (undoing an unfruitful path)
  • Backward chaining (working backwards)

All year, every person I talked to assumed thinking is non-negotiable for agents. Until K2.

K2 is an agentic model, meaning it was trained to solve problems using tools. It performs very well on agentic benchmarks, but it doesn’t have a long thought trace. It was so surprising that I thought I heard wrong and it took a few hours to figure out what the real story was.

For agents, this is attractive because thinking costs tokens (which cost dollars). If you can accomplish a task in fewer tokens, that’s good.

What to watch

  • More models trained like K2

Tool usage connects the world

R1 and o1 were trained to think, but o3 was trained to use tools while it’s thinking. That’s truly changed everything, and o3 is by far my favorite model of the year. You can just do things.

MCP was a huge jump toward agents. It’s a dumb protocol, leading a lot of people to misunderstand what the point is. It’s just a standard protocol for letting LLMs interact with the world. Emphasis on standard.

The more people who use it, the more useful it becomes. When OpenAI announced MCP support, that established full credibility for the protocol.

K2 tackled the main problem with MCP. Since it’s standard, that means anyone can make an MCP server, and that means a lot of them suck. K2 used a special system during training that generated MCP tools of all kinds. Thus, K2 learned how to learn how to use tools.

That pretty much covers our current agent challenges.

What to watch

  • More models trained like K2
  • MCP adoption

Are tools necessary?

In math, we made a lot of progress this year in using a tool like a proof assistant. e.g. DeepSeek-Prover v2 was trained to write Lean code and incrementally fix the errors & output. That seemed (and still does) like a solid path toward complex reasoning.

But today, some OpenAI researchers informally announced on X that their private model won gold in the International Math Olympiad. This is a huge achievement.

But what makes it surprising is that it didn’t use tools. It relied on only a monstrous amount of run-time “thinking” compute, that’s it.

Clearly stated: Next token prediction (what LLMs do) produced genuinely creative solutions requiring high levels of expertise.

If LLMs can be truly creative, that opens a lot of possibilities for agents. Especially around scientific discovery.

What to watch

  • This math olympiad model. The implications are still unclear. It seems it’s more general than math.

Huge vs Tiny

Which is better?

On the one hand, Opus-4, Grok 4 & K2 are all huge models that have a depth that screams “intelligence”. On the other hand, agentic workloads are 24/7 and so the cheaper they are, the better.

Furthermore, there’s a privacy angle. A model that runs locally is inherently more private, since the traffic never leaves your computer.

What to watch

  • Mixture of Experts (MoE). e.g. K2 is huge, but only uses a very small portion (32B), which means it uses less compute than a lot of local models. This might be the secret behind o3’s 80% price drop.
  • OpenAI open weights model is expected to land in a couple weeks. It likely will run on a laptop and match at least o3-mini (Jan 31).
  • GPT-5, expected this fall, is described to be a mix huge & tiny, applying the right strength at the right time

Context engineering & Sycophancy

The biggest shifts this year have arguably been not in the model but in engineering. The flagship change is the emergence of the term context engineering as replacement for prompt engineering.

It’s an acknowledgement that “prompt” isn’t just a block of text. It also comes from tool documentation, RAG databases & other agents. The June multi-agent debate was about how managing context between agents is really hard.

Also, while some are saying, “don’t build multi-agents”, Claude Code launches subagents all the time for any kind of research or investigation task, and is the top coding agent right now.

Similarly, sycophancy causes instability in agents. Many are considering it a top problem, on par with hallucination.

What to watch

  • Memory — stateful agents (e.g. those built on Letta) are phenonomally interesting but are difficult to build. If done well, it solves a lot of context engineering.
  • Engineering blogs. As we gain more experience with these things, it’ll become apparent how to do it well.

Going forward…

And all that is seriously skipping over a lot. Generally, ‘25 has shifted more time into engineering (instead of research). Alternately, model development is starting to become product development instead of just research.

What will happen in the second half of ‘25? Not sure, but I can’t wait to find out.

Discussion

Do LLMs understand?

2025-07-18 08:00:00

I’ve avoided this question because I’m not sure we understand what “understanding” is. Today I spent a bit of time, and I think I have a fairly succinct definition:

An entity can understand if it builds a latent model of reality. And:

  1. Can Learn: When presented with new information, the latent model grows more than the information presented, because it’s able to make connections with parts of it’s existing model.
  2. Can Deviate: When things don’t go according to plan, it can use it’s model to find an innovative solution that it didn’t already know, based on it’s latent model.

Further, the quality of the latent model can be measured by how coherent it is. Meaning that, if you probe it in two mostly unrelated areas, it’ll give answers that are logically consistent with the latent model.

I think there’s plenty of evidence that LLMs are currently doing all of this.

But first..

Latent Model

Mental model. That’s all I mean. Just trying to avoid anthropomorphizing more than necessary.

This is the most widely accepted part of this. Latent just means that you can’t directly observe it. Model just means that it’s a system of approximating the real world.

For example, if you saw this:

a dotted 3‑D sphere—the discrete points line up to read unmistakably as a ball while keeping that airy, voxel‑like feel. Let me know if you’d like tweaks!

You probably identify it immediately as a sphere even though it’s just a bunch of dots.

A latent model is the same thing, just less observable. Like you might hold a “map” of your city in your head. So if you’re driving around and a street gets shut down, you’re not lost, you just refer to your latent model of your city and plan a detour. But it’s not exactly a literal image like Google maps. It’s just a mental model, a latent model.

Sycophancy causes incoherence

From 1979 to 2003, Saddam Hussein surrounded himself with hand‑picked yes‑men who, under fear of death, fed him only flattering propaganda and concealed dire military or economic realities. This closed echo chamber drove disastrous miscalculations—most notably the 1990 invasion of Kuwait and his 2003 standoff with the U.S.—that ended in his regime’s collapse and his own execution.

Just like with Saddam, sycophancy causes the LLM to diverge from it’s true latent model, which causes incoherence. And so, the amount of understanding decreases.

Embedding models demonstrate latent models

Otherwise they wouldn’t work.

The word2vec paper famously showed that the embedding of “king - man + woman” is close to the embedding for “queen” (in embedding space). In other words, embeddings model the meaning of the text.

That was in 2015, before LLMs. It wasn’t even that good then, and the fidelity of that latent model has dramatically increased with the scale of the model.

In-context learning (ICL) demonstrates they can learn

ICL is when you can teach a model new tricks at runtime simply by offering examples in the prompt, or by telling it new information.

In the GPT-3 paper they showed that ICL improved as they scaled the model up from 125M to 175B. When the LLM size increases, it can hold a larger and more complex latent model of the world. When presented with new information (ICL), the larger model is more capable of acting correctly on it.

Makes sense. The smarter you get, the easier it is to get smarter.

Reasoning guides deviation

When models do Chain of Thought (CoT), they second guess themselves, which probes it’s own internal latent model more deeply. In (2), we said that true understanding requires that the LLM can use it’s own latent model of the world to find innovative solutions to unplanned circumstances.

A recent Jan-2025 paper shows that this is the case.

Misdirection: Performance != Competance

A large segment of the AI-critical use this argument as evidence. Paraphrasing:

Today’s image-recognition networks can label a photo as “a baby with a stuffed toy,” but the algorithm has no concept of a baby as a living being – it doesn’t truly know the baby’s shape, or how that baby interacts with the world.

This was in 2015 so the example seems basic, but the principle is still being applied in 2025.

The example is used to argue that AI isn’t understanding, but it merely cherry-picks a single place where the AI’s latent model of the world is inconsistent with reality.

I can cherry pick examples all day long of human’s mental model diverging from reality. Like you take the wrong turn down a street and it takes you across town. Or you thought the charasmatic candidate would do good things for you. On and on.

Go the other way, prove that there are areas where AI’s latent model matches reality.

But that’s dissatisfying, because dolphins have a mental model of the sea floor, and tiny ML models have areas where they do well, and generally most animals have some aspect of the world that they understand.

Conclusion

Why are we arguing this? I’m not sure, it comes up a lot. I think a large part of it is human exceptionalism. We’re really smart, so there must be something different about us. We’re not just animals.

But more generally, AI really is getting smart, to a point that starts to feel more uncomfortable as it intensifies. We have to do something with that.

Layers of Memory, Layers of Compression

2025-06-15 08:00:00

Recently, Anthropic published a blog post detailing their multi-agent approach to building their Research agent. Also, Cognition wrote a post on why multi-agent systems don’t work today. The thing is, they’re both saying the same thing.

At the same time, I’ve been enthralled watching a new bot, Void, interact with users on Bluesky. Void is written in Letta, an AI framework oriented around memory. Void feels alive in a way no other AI bot I’ve encountered feels. Something about the memory gives it a certain magic.

I took some time to dive into Letta’s architecture and noticed a ton of parallels with what the Anthropic and Cognition posts were saying, around context management. Letta takes a different approach.

Below, I’ve had OpenAI Deep Research format our conversation into a blog post. I’ve done some light editing, adding visuals etc., but generally it’s all AI. I appreciated this, I hope you do too.


When an AI agent “remembers,” it compresses. Finite context windows force hard choices about what to keep verbatim, what to summarize, and what to discard. Letta’s layered memory architecture embraces this reality by structuring an agent’s memory into tiers – each a lossy compression of the last. This design isn’t just a storage trick; it’s an information strategy.

Layered Memory as Lossy Compression

Letta (formerly MemGPT) splits memory into four memory blocks: core, message buffer, archival, and recall. Think of these as concentric rings of context, from most essential to most expansive, similar to L1, L2, L3 cache on a CPU:

flowchart TD subgraph rec[Recall Memory] subgraph arch[Archival Memory] subgraph msg[Message Buffer] Core[Core Memory] end end end
  • Core memory holds the agent’s invariants – the system persona, key instructions, fundamental facts. It’s small but always in the prompt, like the kernel of identity and immediate purpose.
  • Message buffer is a rolling window of recent conversation. This is the agent’s short-term memory (recent user messages and responses) with a fixed capacity. As new messages come in, older ones eventually overflow.
  • Archival memory is a long-term store, often an external vector database or text log, where overflow messages and distilled knowledge go. It’s practically unbounded in size, but far from the model’s immediate gaze. This is highly compressed memory – not compressed in ZIP-file fashion, but in being irrelevant by default until needed.
  • Recall memory is the retrieval buffer. When the agent needs something from the archive, it issues a query; relevant snippets are loaded into this block for use. In effect, recall memory “rehydrates” compressed knowledge on demand.

How it works: On each turn, the agent assembles its context from core knowledge, the fresh message buffer, and any recall snippets. All three streams feed into the model’s input. Meanwhile, if the message buffer is full, the oldest interactions get archived out to long-term memory.

Later, if those details become relevant, the agent can query the archival store to retrieve them into the recall slot. What’s crucial is that each layer is a lossy filter: core memory is tiny but high-priority (no loss for the most vital data), the message buffer holds only recent events (older details dropped unless explicitly saved), and the archive contains everything in theory but only yields an approximate answer via search. The agent itself chooses what to promote to long-term storage (e.g. summarizing and saving a key decision) and what to fetch back.

It’s a cascade of compressions and selective decompressions.

Rate–distortion tradeoff: This hierarchy embodies a classic principle from information theory. With a fixed channel (context window) size, maximizing information fidelity means balancing rate (how many tokens we include) against distortion (how much detail we lose).

Letta’s memory blocks are essentially a rate–distortion ladder. Core memory has a tiny rate (few tokens) but zero distortion on the most critical facts. The message buffer has a larger rate (recent dialogue in full) but cannot hold everything – older context is distorted by omission or summary. Archival memory has effectively infinite capacity (high rate) but in practice high distortion: it’s all the minutiae and past conversations compressed into embeddings or summaries that the agent might never look at again.

The recall stage tries to recover (rehydrate) just enough of that detail when needed. Every step accepts some information loss to preserve what matters most. In other words, to remember usefully, the agent must forget judiciously.

This layered approach turns memory management into an act of cognition.

Summarizing a chunk of conversation before archiving it forces the agent to decide what the gist is – a form of understanding. Searching the archive for relevant facts forces it to formulate good queries – effectively reasoning about what was important. In Letta’s design, compression is not just a storage optimization; it is part of the thinking process. The agent is continually compressing its history and decompressing relevant knowledge as needed, like a human mind generalizing past events but recalling a specific detail when prompted.

flowchart TD U[User Input] ---> LLM CI[Core Instructions] --> LLM RM["Recent Messages
(Short-term Buffer)"] --> LLM RS["Retrieved Snippets
(Recall)"] --> LLM LLM ----> AR[Agent Response] RM -- evict / summarize --> VS["Vector Store
(Archival Memory)"] LLM -- summarize ---> VS VS -- retrieve --> RS

Caption: As new user input comes in, the agent’s core instructions and recent messages combine with any retrieved snippets from long-term memory, all funneling into the LLM. After responding, the agent may drop the oldest message from short-term memory into a vector store, and perhaps summarize it for posterity. The next query might hit that store and pull up the summary as needed. The memory “cache” is always in flux.

One Mind vs. Many Minds: Two Approaches to Compression

The above is a single-agent solution: one cognitive entity juggling compressed memories over time. An alternative approach has emerged that distributes cognition across multiple agents, each with its own context window – in effect, parallel minds that later merge their knowledge.

Anthropic’s recent multi-agent research system frames intelligence itself as an exercise in compression across agents. In their words, “The essence of search is compression: distilling insights from a vast corpus.” Subagents “facilitate compression by operating in parallel with their own context windows… condensing the most important tokens for the lead research agent”.

Instead of one agent with one context compressing over time, they spin up several agents that each compress different aspects of a problem in parallel. The lead agent acts like a coordinator, taking these condensed answers and integrating them.

This multi-agent strategy acknowledges the same limitation (finite context per agent) but tackles it by splitting the work. Each subagent effectively says, “I’ll compress this chunk of the task down to a summary for you,” and the lead agent aggregates those results.

It’s analogous to a team of researchers: divide the topic, each person reads a mountain of material and reports back with a summary so the leader can synthesize a conclusion. By partitioning the context across agents, the system can cover far more ground than a single context window would allow.

In fact, Anthropic found that a well-coordinated multi-agent setup outperformed a single-agent approach on broad queries that require exploring many sources. The subagents provided separation of concerns (each focused on one thread of the problem) and reduced the path-dependence of reasoning – because they explored independently, the final answer benefited from multiple compressions of evidence rather than one linear search.

However, this comes at a cost.

Coordination overhead and consistency become serious challenges. Cognition’s Walden Yan argues that multi-agent systems today are fragile chiefly due to context management failures. Each agent only sees a slice of the whole, so misunderstandings proliferate.

One subagent might interpret a task slightly differently than another, and without a shared memory of each other’s decisions, the final assembly can conflict or miss pieces. As Yan puts it, running multiple agents in collaboration in 2025 “only results in fragile systems. The decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents.” In other words, when each subagent compresses its piece of reality in isolation, the group may lack a common context to stay aligned.

In Anthropic’s terms, the “separation of concerns” cuts both ways: it reduces interference, but also means no single agent grasps the full picture. Humans solve this by constant communication (we compress our thoughts into language and share it), but current AI agents aren’t yet adept at the high-bandwidth, nuanced communication needed to truly stay in sync over long tasks.

Cognition’s solution? Don’t default to multi-agent. First try a simpler architecture: one agent, one continuous context. Ensure every decision that agent makes “sees” the trace of reasoning that led up to it – no hidden divergent contexts.

Of course, a single context will eventually overflow, but the answer isn’t to spawn independent agents; it’s to better compress the context. Yan suggests using an extra model whose sole job is to condense the conversation history into “key details, events, and decisions.”

This summarized memory can then persist as the backbone context for the main agent. In fact, Cognition has fine-tuned smaller models to perform this kind of compression reliably. The philosophy is that if you must lose information, lose it intentionally and in one place – via a trained compressor – rather than losing it implicitly across multiple agents’ blind spots.

This approach echoes Letta’s layered memory idea: maintain one coherent thread of thought, pruning and abstracting it as needed, instead of forking into many threads that might diverge.

Conclusion: Compression is Cognition

In the end, these approaches converge on a theme: intelligence is limited by information bottlenecks, and overcoming those limits looks a lot like compression. Whether it’s a single agent summarizing its past and querying a knowledge base, or a swarm of subagents parceling out a huge problem and each reporting back a digest, the core challenge is the same.

An effective mind (machine or human) can’t and shouldn’t hold every detail in working memory – it must aggressively filter, abstract, and encode information, yet be ready to recover the right detail at the right time. This is the classic rate–distortion tradeoff of cognition: maximize useful signal, minimize wasted space.

Letta’s layered memory shows one way: a built-in hierarchy of memory caches, from the always-present essentials to the vast but faint echo of long-term archives. Anthropic’s multi-agent system shows another: multiple minds sharing the load, each mind a lossy compressor for a different subset of the task. And Cognition’s critique reminds us that compression without coordination can fail – the pieces have to ultimately fit together into a coherent whole.

Perhaps as AI agents evolve, we’ll see hybrid strategies. We might use multi-agent teams whose members share a common architectural memory (imagine subagents all plugged into a shared Letta-style archival memory, so they’re not flying blind with respect to each other). Or we might simply get better at single agents with enormous contexts and sophisticated internal compression mechanisms, making multi-agent orchestration unnecessary for most tasks. Either way, the direction is clear: to control and extend AI cognition, we are, in a very real sense, engineering the art of forgetting. By deciding what to forget and when to recall, an agent demonstrates what it truly understands. In artificial minds as in our own, memory is meaningful precisely because it isn’t perfect recording – it’s prioritized, lossy, and alive.