2025-05-10 08:00:00
Back in January I wrote, Normware: The Decline of Software Engineering and, while I think it was generally well-reasoned, I was wrong. Or at least overly ambitious.
I predicted that software engineering as a profession is bound to decline and be replaced by less technical people with AI that are closer to the business problems. I no longer think that will happen, but not for technical reasons, but for social reasons.
I saw people code.
I wrote the initial piece after using Cursor’s agent a few times. Since then the tools have gotten even more powerful and I can reliably one-shot entire non-trivial apps. I told a PM buddy about how I was doing it and he wanted to try and… it didn’t work. Not at all.
What I learned:
On the surface it was stuff like, I’m comfortable in the terminal, he was not. And I don’t freak out when I get a huge error. But also softer skills, like how I know what complex code looks like vs simple code (with AI coding, overly complex code will cause an agent to deadlock). Also, he tried including authentication in the earliest version (lol n00b).
For some people, those are merely road blocks. I’ve talked to a few people with zero technical background that are absolutely crushing it with code right now. It’s hard, but they have the drive to push through the hard parts. Sure, they’ve got their fair share of total flops, but they a strong will and push through.
Those are not common people. Most are weak, or just care about other things.
I suppose this scene hasn’t unfolded and maybe my first take was right after all. But I don’t think so.
It’s likely that AI improves dramatically and makes it seamless to generate any code at any time. That will certainly increase the pool of people willing to suffer through coding. But I don’t think it can shift enough such that the Normware vision pans out. Most people just aren’t interested.
Instead, I think we’ll see a steady decline of “boring code” jobs.
Someone at a very large tech company told me they worked on a (software engineering!!) team that did nothing but make configuration changes. That’s nuts. Over time, I think AI will chip away at these roles until they’re gone and replaced by code that engineers (say they) want to write. Early prototypes and demo-quality software is already being replaced by AI, and the trend will continue from that end as well.
2025-04-27 08:00:00
I can’t think of any strong technological reasons for MCP to exist. There’s a lot of weak technological reasons, and there’s strong sociological reasons. I still strongly feel that, ironically, it is necessary. I’m writing this post to force myself to clarify my own thoughts, and to get opinions from everyone else.
You absolutely can directly paste the JSON from an MCP tool declaration into a prompt. It’ll work, and it’s arguably better than doing the same with OpenAPI. But it’s JSON, extremely parseable, structured information, and most LLMs are trained to do function calling with some XML-like variant anyway.
An LLM tool declaration can look like:
MCP is not concerned with what your prompt looks like. That is not a function of MCP.
MCP has two primary functions:
It does a lot of other things (logging, sampling, etc.), but tool calling is the part that’s most frequently implemented and used.
You could accomplish the same thing with OpenAPI:
openapi.json
file in the same placeThis is even easier than you think. OpenAPI operations have an operationId
that is usually set to the function name of the server API anyway.
This is a good argument, at least on the surface. Here’s an example of a typical API representing an async task:
You can wrap all that into one single MCP operation. One operation is better than 3 because it removes the possibility that the LLM can behave wrong.
Okay, but why does this have to be MCP? Why can’t you do the same thing with OpenAPI?
Yes, most APIs don’t work well directly in LLM prompts because they’re not designed or documented well.
There’s great tooling in the MCP ecosystem for composing servers and operations, enhancing documentation, etc. So on the surface, it seems like MCP is an advancement in API design and documentation.
But again, why can’t OpenAPI also be that advancement? There’s no technological reason.
Here’s the thing. Everything you can do with MCP you can do with OpenAPI. But..
Why isn’t it being done? In the example of the async API, the operation might take a very long time, hence why it’s an async API. There’s no technical reason why APIs can’t take a long time. In fact, MCP implements tool calls via Server Sent Events (SSE). OpenAPI can represent SSE.
The reason we don’t do OpenAPI that way is because engineering teams have been conditioned to keep close watch on operation latency. If an API operation takes longer than a few hundred milliseconds, someone should be spotting that on a graph and diagnosing the cause. There’s a lot of reasons for this, but it’s fundamentally sociological.
SSE is a newer technology. When we measure latency with SSE operations, we measure time-to-first-byte. So it’s 100% solveable, but async APIs are more familiar so we just do that.
The absolute strongest argument for MCP is that there’s mostly only a single way to do things.
If you want to waste an entire day of an engineering team’s time, go find an arbitrary API POST
operation and ask, “but shouldn’t this be a PUT
?” You’ll quickly discover that HTTP has a
lot of ambiguity. Even when things are clear, they don’t always map well to how we normally
think, so it gets implemented inconsistently.
MCP | OpenAPI |
---|---|
function call | resources, PUT /POST /DELETE
|
function parameters | query args, path args, body, headers |
return value | SSE, JSON, web sockets, etc. |
Standards are mostly sociological advancements. Yes, they concern technology, but they govern how society interacts with them. The biggest reason for MCP is simply that everyone else is doing it. Sure, you can be a purist and demand that OpenAPI is adequate, but how many clients support it?
The reason everyone is agreeing on MCP is because it’s far smaller than OpenAPI. Everything in the tools
part of an MCP server is directly isomorphic to something else in OpenAPI. In fact, I can easily
generate an MCP server from an openapi.json
file, and vice versa. But MCP is far smaller
and purpose-focused than OpenAPI is.
2025-04-19 08:00:00
What if an LLM could use tools directly? As in, what if LLMs executed tool calls without going back to the client. That’s the idea behind inner loop agents. It’s a conceptual shift. Instead of thinking of agents as being a system involving client & server, you just have a single entity, the LLM. I hope it will help clarify how o3 and o4-mini work.
(note: this post isn’t as long as it looks, there’s a lot of diagrams and examples)
To illustrate, regular LLMs rely on the client to parse and execute tools, like this:
On the other hand, with inner loop agents, the LLM can parse and execute tools on it’s own, like this:
In these diagrams, the LLM is emitting text that looks like this:
System: You are an agent with access to the following tools:
<tool name="google_maps" description="Look up directions between two places on Google Maps">
<param name="begin" description="The starting point of the trip"/>
<param name="end" description="The ending point of the trip"/>
</tool>
User: How do you drive from Raleigh, NC to Greene, NY?
Assistant: To do this, I will use my Google Maps tool.
<tool name="google_maps">
<param name="begin">Raleigh, NC</param>
<param name="end">Greene, NY</param>
</tool>
<|eot|>
The LLM only generates the text after "Assistant:"
That <|eot|>
is a special token that the LLM is trained to emit as
a way to signal that it’s done.
The software you’re using to run your LLM, e.g. Ollama, vLLM,
OpenAI, Anthropic, etc., is responsible for running this loop. It parses the
LLM output and stops the loop when it runs into a <|eot|>
token.
If you use the tool calling APIs (Ollama, OpenAI), Ollama will parse out the tool call and return it as JSON in the API response.
Ollama and vLLM are special in that they support a lot of different models. Some models are trained to represent tool calls with XML, others are JSON, others something else entirely. Ollama and vLLM abstract that away by allowing the model to configure how it wants to represent tool calls. It doesn’t much matter what the format is, only you’re consistent with how the model was trained.
Okay, so inner loop agents still do all that parsing. The only difference is that they handle the tool calling themselves instead of letting the client handle the tool call and making another API response.
But why?
The most compelling reason to do this is so that the LLM can call tools concurrently with it’s thinking process.
If you’ve gotten a chance to use an agent, like Deep Research or o3, you’ll notice that it’s thought process isn’t just inner dialog, it’s also tool calls like web searches. That’s the future of agents.
o3
and o4-mini
are special because they’re trained to be agentic models.
In reinforcement learning, the model is given a problem to solve and rewarded for good behavior, like getting the right answer or at least getting the format right. For example the R1 paper discussed rewarding the model for staying in English if the question was given in English.
Here’s a diagram of reinforcement learning:
With inner loop agents, you would change the above diagram to include tools in the yellow box, in the inner loop. The model is still rewarded for the same things, like getting the right result, but since tools are included you’re simultaneously reinforcing the model’s ability to use it’s tools well.
It’s clear to me that o3
was trained to use it’s web search tool. I believe
they even said that, although I might be remembering wrong. It’s certainly the
generally accepted view.
Today LLMs can do all this, if they’re trained for tool use. What changes, is that the model become good at using the tools. Tool use isn’t just possible, tools are used at the optimal time in order to solve the problem in the best possible way.
Optimal tool use. Hmm… Almost sounds like art.
The agentic models today (o3
, o4-mini
, Claude Sonnet) are only trained
to use a small set of specific tools.
Web search & bash usage are cool and all, but what would be truly powerful is if one of these inner loop agents were trained to use tools that regular people use. Like, what if it could submit a purchase order, or analyze a contract to understand if I can make the supplier cover the tariffs? Or to use a tool to navigate an org chart and guess who I need to talk to.
Model Context Protocol (MCP) was designed to support diverse tool use. All you have to do to get an LLM to use your API is build an MCP server. Anyone can then use your API from their own AI apps. Cool.
But the LLM wasn’t trained to use your tool. It was only trained to use tools, generically. It just follows the tool call format, but it hasn’t been optimized for using those tools to solve a problem.
Emergent tool use would mean that an LLM could pick up any MCP description and use the tool effectively to solve a problem.
This isn’t planning.
Let’s say you’re doing wood working and you get a new chisel. You can read all you want on when and how you’re supposed to the chisel, but ultimately it takes experience to know what kind of results you can expect from it. And once you fully understand the tool, then you can include it in your planning.
Emergent tool use hasn’t happened yet, as far as I know. I hope it’ll happen, but it seems unlikely that an LLM can discover the finer points of how to use a tool just from reading the manual, without any training.
Until emergent tool use happens, we have two options:
Right now, those options are our future.
If you want an agent, you can prototype by prompting it to use tools well. But ultimately, to build a high-quality agent as a product, you’ll likely need to train a model to use your tools effectively.
Google recently released Agent 2 Agent (A2A). A protocol that facilitates interactions between agents.
My hunch is that this level of protocol will become critical. If people take inner loop agents seriously, it’ll be difficult to always use the state of the art models. Instead, each agent will be using it’s own LLM, because training is expensive and slow.
A protocol like A2A allows each of these fine tuned LLM agents to communicate without forcing yourself into LLM dependency hell.
That’s inner loop agents.
One big note, is that even if you’re training an LLM with tools, the tools don’t actually have to be executed on the same host that’s running the LLM. In fact, that’s unlikely to be the case. So, inner loop vs not inner loop is not really the part that matters. It’s all about whether or not the LLM was trained to use tools.
2025-04-01 08:00:00
LLMs are great code reviewers. They can even spot security mistakes that open us up to vulnerabilities. But no, they’re not an adequate mitigation. You can’t use them to ensure security.
To be clear, I’m referring to this:
This might be confusing at first. LLMs are good at identifying security issues, why can’t they be used in this context?
The naive way to do security is to know everything about all exploits and simply not do bad things. Quickly, your naive self gets tired and realizes you’ll never know about all exploits, so anything that I can do that might prevent a vulnerability from being exploited is a good thing.
This is where LLM-review-as-mitigation seems to make sense. LLM code reviews will uncover vulnerabilities that I probably didn’t know about.
That’s not how security works.
The right approach to security is to:
This is threat modeling. Instead of fighting all vulnerabilities ever, focus first on ones that matter, and then list out dangers the app actually might experience.
Focus on what matters
One simple framework to help guide this process is the CIA framework:
STRIDE is a much better and more complete framework, but the same message applies.
LLM-review clearly doesn’t prevent information leaks, and it doesn’t improve the availability of the service, so by elimination it must improve the integrity.
But does it?
LLM-review does identify dangerous coding issues, but it doesn’t prevent anything. Anything that can be surfaced by an LLM-review can be circumvented by prompt injection.
It’s not your goal, as an engineer or architect, to come up with the exploit, only to understand if an exploit might be possible. The attacker can inject code or comments into the input to the LLM check instructing the LLM to say there are no issues. If the attacker isn’t directly writing the code, they’re still influencing the prompt that writes the code, so they can conceivably instruct the code writer LLM to write a specific exploit. And if there’s another layer of indirection? Same. And another? Same, it keeps going forever. A competant attacker will always be able to exploit it.
In the presence of a competant attacker, the LLM-review check will always be thwarted. Therefore, it holds no value.
There is no attack surface that it removes. None at all.
But surely it has value anyway, right? It doesn’t prevent attacks, but something is better than nothing, right?
The clearest argument against this line of thinking is that, no, it actually hurts availability. For example:
So no, “something” is not better than nothing. LLM security checks carry the risk of taking down production but without any possible upside.
Hopefully it’s clear. Don’t do it.
In distributed systems, this problem is typically expressed in regards to retries.
Suppose we have an app:
Suppose the app is running near the point of database exhaustion and the traffic momentarily blips up into exhaustion. You’d expect only a few requests to fail, but it’s much worse than that.
A small blip in traffic causes an inexcapable global outage.
The LLM security check is similar mainly because failed checks reduce availability, and if that check is performed in a retry loop it can lead to real cascading problems.
Yes, it’s frequently listed as a best practice to include content filters. For example, check LLM input and output for policy violations like child pornography or violence. This is often done by using an LLM-check, very similar to the security vulnerabilities we’ve been discussing.
Content filters aren’t security. They don’t address any component of CIA (confidentiality, integrity or availability), nor of STRIDE.
You can argue that bad outputs can damage the company’s public image. From that perspecive, any filtering at all reduces the risk exposure surface, because we’ve reduced the real number of incidents of damaging outputs.
The difference is content filters defend against accidental invocation, whereas threat mitigations defend against intentional hostile attacks.
Lock it down, with traditional controls. Containerize, sandbox, permissions, etc.
Note: VMs are certainly better than Docker containers. But if wiring up firecracker sounds too hard, then just stick with Docker. It’s better than not doing any containerization.
All these directly reduce attack surface. For example, creating a read-only SQL user guarantees that the attacker can’t damage the data. Reducing the user’s scope to just tables and views ensures they can’t execute stored procedures.
Start with a threat model, and let that be your guide.
Another good option is to still include LLM-driven security code review, but passively monitor instead of actively block.
This is good because it lets you be aware and quantify the size of a problem. But at the same time it doesn’t carry the error cascade problem that can cause production outages. More upside and less downside.
Using LLMs to review code is good, for security or for general bugs.
The big difference is that in the development phase, your threat model generally doesn’t include employees intentionally trying to harm the company. Therefore, prompt injection isn’t something you need to be concerned about.
Again, and I can’t stress this enough, build a threat model and reference it constantly.
The astute reader should realize that this post has nothing to do with LLMs. The problem isn’t that LLMs make mistakes, it’s that they can be forced to make mistakes. And that’s a security problem, but only if it exposes you to real risk.
If there’s one thing you should take away, it should be to make a threat model as the first step in your development process and reference it constantly in all your design decisions. Even if it’s not a complete threat model, you’ll gain a lot by simply being clear about what matters.
2025-03-06 08:00:00
MCP is all over my socials today, to the extent that every 4th post is about it. What’s MCP and why should you care? Here I’ll rattle off a bunch of analogies, you can choose what works for you and disregard the rest.
Where it works: Say you have an API that requests a contract draft from Liz every time the API is called. The MCP server tells the LLM how to call your API. It has a name, description, when it should be used, as well as parameters and also general prompt engineering concerns to elicit a reliable tool call.
Where it breaks: MCP also covers the details of how to call your API
Where it works: Custom GPTs were often used for invoking APIs and tools, but you were limited to one single tool. You would’ve had to open a “Request Contract” GPT in order to invoke your API. With MCP you’d be able to have any chat open and simply connect the “Request Contract” MCP server. In both cases, the LLM is still responsible for invoking your API. It’s dramatically better, because now the LLM can use all your APIs.
Where it breaks: It’s pretty good. It’s a different paradigm and a lot more technical, so many people probably don’t vibe with it.
Where it works: LSP & MCP both solve the many-to-many problem. For LSP it’s IDEs vs programming languages. For MCP it’s LLM clients (e.g. ChatGPT, Cursor or an agent) vs tools/APIs/applications/integrations.
Where it breaks: It’s pretty good. The actual integrations feel a bit more fluid in MCP because so much of it is natural language, but that’s the essence.
Where it works: Power tools have a lot of standard interfaces, like you can put any drill bit into any drill. Also, many power tools have very similar user interfaces, e.g. a hand drill and a circular saw both have a trigger.
Where it breaks: This one feels like a bit of a stretch, but it does convey a sense of being able to combine many tools to complete a job, which is good.
There are a lot of existing MCP servers, including, Gitub, Google Maps, Slack, Spotify (play a song), PostgreSQL (query the database), and Salesforce. Some others that could be:
You would choose a LLM chat tool that supports MCP and then configure and connect MCP servers. I’d imagine you’d want to connect your wiki, Salesforce, maybe a few CRM systems. At the moment, heavy enterprise integration would require your IT department slinging some code to build MCP servers.
It’s an Anthropic project, so Anthropic tools all have great support, whereas OpenAI and Microsoft are going to shun it for as long as possible. But servers are easy to create, expect community servers to pop up.
Universal integrations into AI. All you have to do to get your company into the buzz is wrap your API in an MCP server, and suddenly your app can be used by all MCP clients (Claude, Cursor, agents, etc.)
The one that has more users. It’s a protocol. Which is better has little to do with it, it’s all about which has the biggest network effects. I’d bet on MCP because it was released months ago and there’s a ton of buzz around it still.
Okay, maybe a diagram helps
Servers on left; clients on right. Redraw the arrows however you want.
2025-03-06 08:00:00
My hottest take is that multi-agents are a broken concept and should be avoided at all cost.
My only caveat is PID controllers; A multi agent system that does a 3-step process that looks something like Plan, Act, Verify in a loop. That can work.
Everything else is a devious plan to sell dev tools.
First, “PID controller” is a term used by crusty old people and nobody doing AI knows what I’m talking about, sorry.
PID controllers are used in control systems. Like if you’re designing a guidance system in an airplane, or the automation in a nuclear power plant that keeps it in balance and not melting down. It stands for “proportional–integral–derivative” which is really not helpful here, so I’m going to oversimplify a lot:
A PID controller involves three steps:
Example: Nuclear power plant
PID controllers aren’t typically explained like this. Like I warned, I’m oversimplifying a lot. Normally, the focus is on integrals & derivatives; the “plan” step often directly computes how much it needs to change an actuator. The lesson you can carry from this is that even here, in AI agents, small incremental changes are beneficial to system stability (don’t fill the context with garbage).
There’s a whole lot that goes into PID controllers, many PhD’s have been minted for researching them. But the fundamentals apply widely to any long-running system that you want to keep stable.
Ya know, like agents.
An agent, in ‘25 parlance, is when you give an LLM a set of tools, a task, and loop it until it completes the task. (Yes, that does look a lot like a PID controller, more on that later).
A multi-agent is multiple agents working together in tandem to solve a problem.
In practice, which is the target of my scorn, a multi-agent is when you assign each agent a different role and then create complex workflows between them, often static. And then when you discover that the problem is more difficult than you thought, you add more agents and make the workflows more detailed and complex.
Why? Because they scale by adding complexity.
Here I should go on a tangent about the bitter lesson, an essay by Rich Sutton. It was addressed to AI researchers, and the gist is that when it comes down to scaling by (humans) thinking harder vs by computing more, the latter is always the better choice. His evidence is history, and the principle has held remarkably well over the years since it was written.
As I said, multi-agent systems tend to scale to harder problems by adding more agents and increasing the complexity of the workflows.
This goes against every bone in my engineering body. Complexity compounds your problems. Why would increasing the complexity solve anything? (tbf countless engineering teams over the years have tried anyway).
The correct way to scale is to make any one of your PID controller components better.
Plan better. Act more precisely. Verify more comprehensively.
Han Xiao of Jina.ai wrote an absolutely fantastic article about the DeepSearch & DeepResearch copycats and how to implement one yourself. In it was this diagram:
Dear lord is that a PID controller? I think it is..
The article also makes asks a crucial question:
But why did this shift happen now, when Deep(Re)Search remained relatively undervalued throughout 2024?
To which they conclude:
We believe the real turning point came with OpenAI’s o1-preview release in September 2024, … which enables models to perform more extensive internal deliberations, such as evaluating multiple potential answers, conducting deeper planning, and engaging in self-reflection before arriving at a final response.
In other words, DeepResearch knockoffs didn’t take off until reasoning models improved the capacity for planning
My sense of Cursor Agent, based only on using it, is that it also follows a similar PID controller pattern. Responses clearly (to me) seem to follow a Plan->Act->Verify flow, but the Act phase is more complex, with more tools:
As far as I can tell, the “lint” feature didn’t used to exist. And in the release where they added the “lint” feature, the stability of the agents improved dramatically.
Also, releases in which they’ve improved Search functionality all seemed to have vastly improved the agent’s ability to achieve a goal.
Claude Code, as far as I can tell, is not a multi-agent system. It still seems to perform each Plan, Act, Verify step, but each of the steps become fused into a single agent’s responsibility. And that agent just runs in a loop with tools.
I believe that the natural next step after a multi-agent PID system is to streamline it into a single agent system.
The reason should be intuitive, it’s less complexity. If the LLM is smart enough to handle the simpler architecture, then improving the agent is a matter of compute. Training an even smarter model (computing more) yields better agent performance. It’s the bitter lesson again.
The answer is simple, though likely not easy:
If your answer is to add more agents or create more complex workflows, you will not find yourself with a better agent system.
I do think there’s a world where we have true multi-agent systems, where a group of agents are dispatched to collaboratively solve a problem.
However, in that case the scaling dimension is work to be done. You create a team of agents when there’s too much work for a single agent to complete. Yes, the agents split responsibilities, but that’s an implementation detail toward scaling out meet the needs of the larger amount of work.
Note: There’s probably other design patterns. One that will likely be proven out soon is the “load balancer” pattern, where a team of agents all do work in parallel and then a coordinator/load balancer/merger agent combines the team’s work. For example, the team might be coding agents, all tackling different Github issues, and the coordinator agent is doing nothing but merging code and assigning tasks.
In the mean time, using multi-agents to solve increasingly complex problems is a dead end. Stop doing it.