2025-04-19 08:00:00
What if an LLM could use tools directly? As in, what if LLMs executed tool calls without going back to the client. That’s the idea behind inner loop agents. It’s a conceptual shift. Instead of thinking of agents as being a system involving client & server, you just have a single entity, the LLM. I hope it will help clarify how o3 and o4-mini work.
(note: this post isn’t as long as it looks, there’s a lot of diagrams and examples)
To illustrate, regular LLMs rely on the client to parse and execute tools, like this:
On the other hand, with inner loop agents, the LLM can parse and execute tools on it’s own, like this:
In these diagrams, the LLM is emitting text that looks like this:
System: You are an agent with access to the following tools:
<tool name="google_maps" description="Look up directions between two places on Google Maps">
<param name="begin" description="The starting point of the trip"/>
<param name="end" description="The ending point of the trip"/>
</tool>
User: How do you drive from Raleigh, NC to Greene, NY?
Assistant: To do this, I will use my Google Maps tool.
<tool name="google_maps">
<param name="begin">Raleigh, NC</param>
<param name="end">Greene, NY</param>
</tool>
<|eot|>
The LLM only generates the text after "Assistant:"
That <|eot|>
is a special token that the LLM is trained to emit as
a way to signal that it’s done.
The software you’re using to run your LLM, e.g. Ollama, vLLM,
OpenAI, Anthropic, etc., is responsible for running this loop. It parses the
LLM output and stops the loop when it runs into a <|eot|>
token.
If you use the tool calling APIs (Ollama, OpenAI), Ollama will parse out the tool call and return it as JSON in the API response.
Ollama and vLLM are special in that they support a lot of different models. Some models are trained to represent tool calls with XML, others are JSON, others something else entirely. Ollama and vLLM abstract that away by allowing the model to configure how it wants to represent tool calls. It doesn’t much matter what the format is, only you’re consistent with how the model was trained.
Okay, so inner loop agents still do all that parsing. The only difference is that they handle the tool calling themselves instead of letting the client handle the tool call and making another API response.
But why?
The most compelling reason to do this is so that the LLM can call tools concurrently with it’s thinking process.
If you’ve gotten a chance to use an agent, like Deep Research or o3, you’ll notice that it’s thought process isn’t just inner dialog, it’s also tool calls like web searches. That’s the future of agents.
o3
and o4-mini
are special because they’re trained to be agentic models.
In reinforcement learning, the model is given a problem to solve and rewarded for good behavior, like getting the right answer or at least getting the format right. For example the R1 paper discussed rewarding the model for staying in English if the question was given in English.
Here’s a diagram of reinforcement learning:
With inner loop agents, you would change the above diagram to include tools in the yellow box, in the inner loop. The model is still rewarded for the same things, like getting the right result, but since tools are included you’re simultaneously reinforcing the model’s ability to use it’s tools well.
It’s clear to me that o3
was trained to use it’s web search tool. I believe
they even said that, although I might be remembering wrong. It’s certainly the
generally accepted view.
Today LLMs can do all this, if they’re trained for tool use. What changes, is that the model become good at using the tools. Tool use isn’t just possible, tools are used at the optimal time in order to solve the problem in the best possible way.
Optimal tool use. Hmm… Almost sounds like art.
The agentic models today (o3
, o4-mini
, Claude Sonnet) are only trained
to use a small set of specific tools.
Web search & bash usage are cool and all, but what would be truly powerful is if one of these inner loop agents were trained to use tools that regular people use. Like, what if it could submit a purchase order, or analyze a contract to understand if I can make the supplier cover the tariffs? Or to use a tool to navigate an org chart and guess who I need to talk to.
Model Context Protocol (MCP) was designed to support diverse tool use. All you have to do to get an LLM to use your API is build an MCP server. Anyone can then use your API from their own AI apps. Cool.
But the LLM wasn’t trained to use your tool. It was only trained to use tools, generically. It just follows the tool call format, but it hasn’t been optimized for using those tools to solve a problem.
Emergent tool use would mean that an LLM could pick up any MCP description and use the tool effectively to solve a problem.
This isn’t planning.
Let’s say you’re doing wood working and you get a new chisel. You can read all you want on when and how you’re supposed to the chisel, but ultimately it takes experience to know what kind of results you can expect from it. And once you fully understand the tool, then you can include it in your planning.
Emergent tool use hasn’t happened yet, as far as I know. I hope it’ll happen, but it seems unlikely that an LLM can discover the finer points of how to use a tool just from reading the manual, without any training.
Until emergent tool use happens, we have two options:
Right now, those options are our future.
If you want an agent, you can prototype by prompting it to use tools well. But ultimately, to build a high-quality agent as a product, you’ll likely need to train a model to use your tools effectively.
Google recently released Agent 2 Agent (A2A). A protocol that facilitates interactions between agents.
My hunch is that this level of protocol will become critical. If people take inner loop agents seriously, it’ll be difficult to always use the state of the art models. Instead, each agent will be using it’s own LLM, because training is expensive and slow.
A protocol like A2A allows each of these fine tuned LLM agents to communicate without forcing yourself into LLM dependency hell.
That’s inner loop agents.
One big note, is that even if you’re training an LLM with tools, the tools don’t actually have to be executed on the same host that’s running the LLM. In fact, that’s unlikely to be the case. So, inner loop vs not inner loop is not really the part that matters. It’s all about whether or not the LLM was trained to use tools.
2025-04-01 08:00:00
LLMs are great code reviewers. They can even spot security mistakes that open us up to vulnerabilities. But no, they’re not an adequate mitigation. You can’t use them to ensure security.
To be clear, I’m referring to this:
This might be confusing at first. LLMs are good at identifying security issues, why can’t they be used in this context?
The naive way to do security is to know everything about all exploits and simply not do bad things. Quickly, your naive self gets tired and realizes you’ll never know about all exploits, so anything that I can do that might prevent a vulnerability from being exploited is a good thing.
This is where LLM-review-as-mitigation seems to make sense. LLM code reviews will uncover vulnerabilities that I probably didn’t know about.
That’s not how security works.
The right approach to security is to:
This is threat modeling. Instead of fighting all vulnerabilities ever, focus first on ones that matter, and then list out dangers the app actually might experience.
Focus on what matters
One simple framework to help guide this process is the CIA framework:
STRIDE is a much better and more complete framework, but the same message applies.
LLM-review clearly doesn’t prevent information leaks, and it doesn’t improve the availability of the service, so by elimination it must improve the integrity.
But does it?
LLM-review does identify dangerous coding issues, but it doesn’t prevent anything. Anything that can be surfaced by an LLM-review can be circumvented by prompt injection.
It’s not your goal, as an engineer or architect, to come up with the exploit, only to understand if an exploit might be possible. The attacker can inject code or comments into the input to the LLM check instructing the LLM to say there are no issues. If the attacker isn’t directly writing the code, they’re still influencing the prompt that writes the code, so they can conceivably instruct the code writer LLM to write a specific exploit. And if there’s another layer of indirection? Same. And another? Same, it keeps going forever. A competant attacker will always be able to exploit it.
In the presence of a competant attacker, the LLM-review check will always be thwarted. Therefore, it holds no value.
There is no attack surface that it removes. None at all.
But surely it has value anyway, right? It doesn’t prevent attacks, but something is better than nothing, right?
The clearest argument against this line of thinking is that, no, it actually hurts availability. For example:
So no, “something” is not better than nothing. LLM security checks carry the risk of taking down production but without any possible upside.
Hopefully it’s clear. Don’t do it.
In distributed systems, this problem is typically expressed in regards to retries.
Suppose we have an app:
Suppose the app is running near the point of database exhaustion and the traffic momentarily blips up into exhaustion. You’d expect only a few requests to fail, but it’s much worse than that.
A small blip in traffic causes an inexcapable global outage.
The LLM security check is similar mainly because failed checks reduce availability, and if that check is performed in a retry loop it can lead to real cascading problems.
Yes, it’s frequently listed as a best practice to include content filters. For example, check LLM input and output for policy violations like child pornography or violence. This is often done by using an LLM-check, very similar to the security vulnerabilities we’ve been discussing.
Content filters aren’t security. They don’t address any component of CIA (confidentiality, integrity or availability), nor of STRIDE.
You can argue that bad outputs can damage the company’s public image. From that perspecive, any filtering at all reduces the risk exposure surface, because we’ve reduced the real number of incidents of damaging outputs.
The difference is content filters defend against accidental invocation, whereas threat mitigations defend against intentional hostile attacks.
Lock it down, with traditional controls. Containerize, sandbox, permissions, etc.
Note: VMs are certainly better than Docker containers. But if wiring up firecracker sounds too hard, then just stick with Docker. It’s better than not doing any containerization.
All these directly reduce attack surface. For example, creating a read-only SQL user guarantees that the attacker can’t damage the data. Reducing the user’s scope to just tables and views ensures they can’t execute stored procedures.
Start with a threat model, and let that be your guide.
Another good option is to still include LLM-driven security code review, but passively monitor instead of actively block.
This is good because it lets you be aware and quantify the size of a problem. But at the same time it doesn’t carry the error cascade problem that can cause production outages. More upside and less downside.
Using LLMs to review code is good, for security or for general bugs.
The big difference is that in the development phase, your threat model generally doesn’t include employees intentionally trying to harm the company. Therefore, prompt injection isn’t something you need to be concerned about.
Again, and I can’t stress this enough, build a threat model and reference it constantly.
The astute reader should realize that this post has nothing to do with LLMs. The problem isn’t that LLMs make mistakes, it’s that they can be forced to make mistakes. And that’s a security problem, but only if it exposes you to real risk.
If there’s one thing you should take away, it should be to make a threat model as the first step in your development process and reference it constantly in all your design decisions. Even if it’s not a complete threat model, you’ll gain a lot by simply being clear about what matters.
2025-03-06 08:00:00
MCP is all over my socials today, to the extent that every 4th post is about it. What’s MCP and why should you care? Here I’ll rattle off a bunch of analogies, you can choose what works for you and disregard the rest.
Where it works: Say you have an API that requests a contract draft from Liz every time the API is called. The MCP server tells the LLM how to call your API. It has a name, description, when it should be used, as well as parameters and also general prompt engineering concerns to elicit a reliable tool call.
Where it breaks: MCP also covers the details of how to call your API
Where it works: Custom GPTs were often used for invoking APIs and tools, but you were limited to one single tool. You would’ve had to open a “Request Contract” GPT in order to invoke your API. With MCP you’d be able to have any chat open and simply connect the “Request Contract” MCP server. In both cases, the LLM is still responsible for invoking your API. It’s dramatically better, because now the LLM can use all your APIs.
Where it breaks: It’s pretty good. It’s a different paradigm and a lot more technical, so many people probably don’t vibe with it.
Where it works: LSP & MCP both solve the many-to-many problem. For LSP it’s IDEs vs programming languages. For MCP it’s LLM clients (e.g. ChatGPT, Cursor or an agent) vs tools/APIs/applications/integrations.
Where it breaks: It’s pretty good. The actual integrations feel a bit more fluid in MCP because so much of it is natural language, but that’s the essence.
Where it works: Power tools have a lot of standard interfaces, like you can put any drill bit into any drill. Also, many power tools have very similar user interfaces, e.g. a hand drill and a circular saw both have a trigger.
Where it breaks: This one feels like a bit of a stretch, but it does convey a sense of being able to combine many tools to complete a job, which is good.
There are a lot of existing MCP servers, including, Gitub, Google Maps, Slack, Spotify (play a song), PostgreSQL (query the database), and Salesforce. Some others that could be:
You would choose a LLM chat tool that supports MCP and then configure and connect MCP servers. I’d imagine you’d want to connect your wiki, Salesforce, maybe a few CRM systems. At the moment, heavy enterprise integration would require your IT department slinging some code to build MCP servers.
It’s an Anthropic project, so Anthropic tools all have great support, whereas OpenAI and Microsoft are going to shun it for as long as possible. But servers are easy to create, expect community servers to pop up.
Universal integrations into AI. All you have to do to get your company into the buzz is wrap your API in an MCP server, and suddenly your app can be used by all MCP clients (Claude, Cursor, agents, etc.)
The one that has more users. It’s a protocol. Which is better has little to do with it, it’s all about which has the biggest network effects. I’d bet on MCP because it was released months ago and there’s a ton of buzz around it still.
Okay, maybe a diagram helps
Servers on left; clients on right. Redraw the arrows however you want.
2025-03-06 08:00:00
My hottest take is that multi-agents are a broken concept and should be avoided at all cost.
My only caveat is PID controllers; A multi agent system that does a 3-step process that looks something like Plan, Act, Verify in a loop. That can work.
Everything else is a devious plan to sell dev tools.
First, “PID controller” is a term used by crusty old people and nobody doing AI knows what I’m talking about, sorry.
PID controllers are used in control systems. Like if you’re designing a guidance system in an airplane, or the automation in a nuclear power plant that keeps it in balance and not melting down. It stands for “proportional–integral–derivative” which is really not helpful here, so I’m going to oversimplify a lot:
A PID controller involves three steps:
Example: Nuclear power plant
PID controllers aren’t typically explained like this. Like I warned, I’m oversimplifying a lot. Normally, the focus is on integrals & derivatives; the “plan” step often directly computes how much it needs to change an actuator. The lesson you can carry from this is that even here, in AI agents, small incremental changes are beneficial to system stability (don’t fill the context with garbage).
There’s a whole lot that goes into PID controllers, many PhD’s have been minted for researching them. But the fundamentals apply widely to any long-running system that you want to keep stable.
Ya know, like agents.
An agent, in ‘25 parlance, is when you give an LLM a set of tools, a task, and loop it until it completes the task. (Yes, that does look a lot like a PID controller, more on that later).
A multi-agent is multiple agents working together in tandem to solve a problem.
In practice, which is the target of my scorn, a multi-agent is when you assign each agent a different role and then create complex workflows between them, often static. And then when you discover that the problem is more difficult than you thought, you add more agents and make the workflows more detailed and complex.
Why? Because they scale by adding complexity.
Here I should go on a tangent about the bitter lesson, an essay by Rich Sutton. It was addressed to AI researchers, and the gist is that when it comes down to scaling by (humans) thinking harder vs by computing more, the latter is always the better choice. His evidence is history, and the principle has held remarkably well over the years since it was written.
As I said, multi-agent systems tend to scale to harder problems by adding more agents and increasing the complexity of the workflows.
This goes against every bone in my engineering body. Complexity compounds your problems. Why would increasing the complexity solve anything? (tbf countless engineering teams over the years have tried anyway).
The correct way to scale is to make any one of your PID controller components better.
Plan better. Act more precisely. Verify more comprehensively.
Han Xiao of Jina.ai wrote an absolutely fantastic article about the DeepSearch & DeepResearch copycats and how to implement one yourself. In it was this diagram:
Dear lord is that a PID controller? I think it is..
The article also makes asks a crucial question:
But why did this shift happen now, when Deep(Re)Search remained relatively undervalued throughout 2024?
To which they conclude:
We believe the real turning point came with OpenAI’s o1-preview release in September 2024, … which enables models to perform more extensive internal deliberations, such as evaluating multiple potential answers, conducting deeper planning, and engaging in self-reflection before arriving at a final response.
In other words, DeepResearch knockoffs didn’t take off until reasoning models improved the capacity for planning
My sense of Cursor Agent, based only on using it, is that it also follows a similar PID controller pattern. Responses clearly (to me) seem to follow a Plan->Act->Verify flow, but the Act phase is more complex, with more tools:
As far as I can tell, the “lint” feature didn’t used to exist. And in the release where they added the “lint” feature, the stability of the agents improved dramatically.
Also, releases in which they’ve improved Search functionality all seemed to have vastly improved the agent’s ability to achieve a goal.
Claude Code, as far as I can tell, is not a multi-agent system. It still seems to perform each Plan, Act, Verify step, but each of the steps become fused into a single agent’s responsibility. And that agent just runs in a loop with tools.
I believe that the natural next step after a multi-agent PID system is to streamline it into a single agent system.
The reason should be intuitive, it’s less complexity. If the LLM is smart enough to handle the simpler architecture, then improving the agent is a matter of compute. Training an even smarter model (computing more) yields better agent performance. It’s the bitter lesson again.
The answer is simple, though likely not easy:
If your answer is to add more agents or create more complex workflows, you will not find yourself with a better agent system.
I do think there’s a world where we have true multi-agent systems, where a group of agents are dispatched to collaboratively solve a problem.
However, in that case the scaling dimension is work to be done. You create a team of agents when there’s too much work for a single agent to complete. Yes, the agents split responsibilities, but that’s an implementation detail toward scaling out meet the needs of the larger amount of work.
Note: There’s probably other design patterns. One that will likely be proven out soon is the “load balancer” pattern, where a team of agents all do work in parallel and then a coordinator/load balancer/merger agent combines the team’s work. For example, the team might be coding agents, all tackling different Github issues, and the coordinator agent is doing nothing but merging code and assigning tasks.
In the mean time, using multi-agents to solve increasingly complex problems is a dead end. Stop doing it.
2025-02-20 08:00:00
I recently got a job, but it was a bear going through rejections on repeat. It almost felt like nobody was even looking at my resume. Which made me think 🤔 that might be the case.
It turns out that hiring managers are swamped with stacks of resumes. Surprisingly (to me), they’re not really using AI to auto-reject, they just aren’t reading carefully.
If you’re a hiring manager with a stack of 200 resumes on your desk, how do you process them? I think I’d probably:
So you have to spoon feed the hiring manager. Sounds easy.
Except it’s not. One single resume won’t work, because it’s basically impossible to satisfy all potential job postings and also have it be succinct enough to properly spoon feed.
It seems you need to generate a different resume for every job opening. But that’s a ton of work. So I made a tool for myself, and I’m open sourcing it today. Here it is.
This breaks it down into 2 steps:
The flow is:
I’m not gonna lie, this is the most fun I’ve ever had writing a resume. Most of the time I want to tear my hair out from searching fruitlessly for something I did that can sound cool. But with this, you just kick back, relax, and brain dump like you’re talking with a friend over drinks.
And while all that is great, the most electrifying part was when it suggested accomplishments, and it struck me that, “dang, I’ve done some cool stuff, I never thought about that project that way”.
All of that, the summaries, the full conversations, all of it is stored alongside the normal resume items. For each job, I have like 30-40 skills and 8-12 accomplishments, mostly generated with some light editing.
The flow is:
The strategy is to use as much as possible verbatim text from the big resume. So generally you put effort into the big resume, not the small one.
When generating, very little generation is happening. It’s mostly just selecting content from the big resume that’s pertinent to the specific job posting based on analyzed needs.
Outside of generating the small resume, I also had a huge amount of success throwing the entire Big Resume into NotebookLM and having it generate a podcast to help prep me for interviews (😍 they are so nice 🥰😘). I’ve also done the same thing with ChatGPT in search mode to run recon on interviewers to prep.
The big resume is an XML document. So you really can just throw it into any AI tool verbatim. I could probably make some export functionality, but this actually works very well.
I’m open sourcing this because I got a job with it. It’s not done, it actually kinda sucks, but the approach to managing information is novel. Some people urged me to get VC funding and turn it into a product, but I’m tired and that just makes me feel even more tired. Idk, it can work, but something that excites me a lot is enabling others to thrive and not charging a dime.
The kinds of people who want to use it are also the kinds of people who might be motivated to bring it over the finish line. Right now, there’s a ton of tech people out of work, and thus a lot of people who are willing, able, and actually have enough time to contribute back. This could work.
Why use it? Because, at bare minimum you’ll end up recalling a lot of cool stuff you did.
Why contribute? Because, if you’re an engineer, you can put that on your resume too.
Again, if you missed it: Github Repo Here
2025-02-17 08:00:00
A new AI architecture is challenging the status quo. LLaDA is a diffusion model that generates text. Normally diffusion models generate images or video (e.g. Stable Diffusion). By using diffusion for text, LLaDA addresses a lot of issues that LLMs are running into, like hallucinations and doom loops.
(Note: I pronounce it “yada”, the “LL” is a “y” sound like in Spanish, and it just seems appropriate for a language model, yada yada yada…)
LLMs write one word after the other in sequence. In LLaDA, on the other hand, words appear randomly. Existing words can also be edited or deleted before the generation terminates.
Example: “Explain what artificial intelligence is”
Loosely speaking, you can think about it as starting with an outline and filling in details across the entire output progressively until all the details are filled in.
Traditional LLMs are autoregressive:
LLMs are autoregressive, meaning that all previous output is the input to the next word. So, it generates words one at a time.
That’s how it thinks, one word at a time. It can’t go back and “un-say” a word, it’s one-shotting everything top-to-bottom. The diffusion approach is unique in that it can back out and edit/delete lines of reasoning, kind of like writing drafts.
Since it’s writing everything at the same time, it’s inherently concurrent. Several thoughts are being developed at the same time globally across the entire output. That means that it’s easier for the model to be consistent and maintain a coherent line of thought.
Some problems benefit more than others. Text like employment agreements is mostly a hierarchy of sections. If you shuffled the sections, the contract would probably retain the same exact meaning. But it still needs to be globally coherent and consistent, that’s critical.
This part resonates with me. There’s clearly trade-offs between approaches. When writing blogs like this, I mostly write it top-to-bottom in a single pass. Because that’s what makes sense to me, it’s how it’s read. But when I review, I stand back, squint and think about it and how it flows globally, almost like manipulating shapes.
In agents, or even long LLM chats, I’ll notice the LLM starts to go around in circles, suggesting things that already didn’t work, etc. LLaDA offers better global coherence. Because it writes via progressive enhancement instead of left-to-right, it’s able to view generation globally and ensure that the output makes sense and is coherent.
Since LLMs are autoregressive, a mistake early on can become a widening gap from reality.
Have you ever had an LLM gaslight you? It’ll hallucinate some fact, but then that hallucination becomes part of it’s input, so it assumes it’s truth and will try to convince you of the hallucinated fact.
That’s partly due to how LLMs are trained. In training, all the input is ground truth, so it learns to trust it’s input. But in inference, the input is it’s previous output, it’s not ground truth but the model treats it like it is. There’s mitigations you can do in post-training, but it’s a fundamental flaw in LLMs that must be faced.
LLaDA is free from this problem, because it’s trained to re-create the ground truth, not trust it unconditionally.
In practice, I’m not sure how much this global coherence is beneficial. For example, if you have a turn-based chat app, like ChatGPT, the AI answers are still going to depend on previous output. Even in agents, a tool call requires that the AI emit a tool call and then continue (re-enter) with the tool output as input to process it.
So with our current AI applications, we would immediately turn these diffusion models into autoregressive models, effectively.
We also started producing reasoning models (o3, R1, S1). In the reasoning
traces, the LLM allows itself to make mistakes by using a passive unconvinced voice in the <think/>
block prior to
giving it’s final answer.
This effectively gives the LLM the ability to think globally for better coherence.
Initially I assumed this could only do fixed-width output. But it’s pretty easy to see how that’s not
the case. Emitting a simple <|eot|>
token to indicate the end of text/output is enough to get
around this.
LLaDA’s biggest contribution is that it succinctly showed what part of LLMs do the heavy lifting — the language modeling.
Autoregressive modeling (ARM) is an implementation of maximum likelihood estimation (MLE). LLaDA showed that this is functionally the same as [KL divergence][kl], which is what LLaDA used. Any approach that models the probability relationships between tokens will work just as well.
There will be more approaches, with new & different trade-offs.
Watch this space. Keep an open mind. We may see some wild shifts in architecture soon. Maybe it’s diffusion models, maybe it’s some other equivalent architecture with a new set of trade-offs.