2025-04-01 08:00:00
LLMs are great code reviewers. They can even spot security mistakes that open us up to vulnerabilities. But no, they’re not an adequate mitigation. You can’t use them to ensure security.
To be clear, I’m referring to this:
This might be confusing at first. LLMs are good at identifying security issues, why can’t they be used in this context?
The naive way to do security is to know everything about all exploits and simply not do bad things. Quickly, your naive self gets tired and realizes you’ll never know about all exploits, so anything that I can do that might prevent a vulnerability from being exploited is a good thing.
This is where LLM-review-as-mitigation seems to make sense. LLM code reviews will uncover vulnerabilities that I probably didn’t know about.
That’s not how security works.
The right approach to security is to:
This is threat modeling. Instead of fighting all vulnerabilities ever, focus first on ones that matter, and then list out dangers the app actually might experience.
Focus on what matters
One simple framework to help guide this process is the CIA framework:
STRIDE is a much better and more complete framework, but the same message applies.
LLM-review clearly doesn’t prevent information leaks, and it doesn’t improve the availability of the service, so by elimination it must improve the integrity.
But does it?
LLM-review does identify dangerous coding issues, but it doesn’t prevent anything. Anything that can be surfaced by an LLM-review can be circumvented by prompt injection.
It’s not your goal, as an engineer or architect, to come up with the exploit, only to understand if an exploit might be possible. The attacker can inject code or comments into the input to the LLM check instructing the LLM to say there are no issues. If the attacker isn’t directly writing the code, they’re still influencing the prompt that writes the code, so they can conceivably instruct the code writer LLM to write a specific exploit. And if there’s another layer of indirection? Same. And another? Same, it keeps going forever. A competant attacker will always be able to exploit it.
In the presence of a competant attacker, the LLM-review check will always be thwarted. Therefore, it holds no value.
There is no attack surface that it removes. None at all.
But surely it has value anyway, right? It doesn’t prevent attacks, but something is better than nothing, right?
The clearest argument against this line of thinking is that, no, it actually hurts availability. For example:
So no, “something” is not better than nothing. LLM security checks carry the risk of taking down production but without any possible upside.
Hopefully it’s clear. Don’t do it.
In distributed systems, this problem is typically expressed in regards to retries.
Suppose we have an app:
Suppose the app is running near the point of database exhaustion and the traffic momentarily blips up into exhaustion. You’d expect only a few requests to fail, but it’s much worse than that.
A small blip in traffic causes an inexcapable global outage.
The LLM security check is similar mainly because failed checks reduce availability, and if that check is performed in a retry loop it can lead to real cascading problems.
Yes, it’s frequently listed as a best practice to include content filters. For example, check LLM input and output for policy violations like child pornography or violence. This is often done by using an LLM-check, very similar to the security vulnerabilities we’ve been discussing.
Content filters aren’t security. They don’t address any component of CIA (confidentiality, integrity or availability), nor of STRIDE.
You can argue that bad outputs can damage the company’s public image. From that perspecive, any filtering at all reduces the risk exposure surface, because we’ve reduced the real number of incidents of damaging outputs.
The difference is content filters defend against accidental invocation, whereas threat mitigations defend against intentional hostile attacks.
Lock it down, with traditional controls. Containerize, sandbox, permissions, etc.
Note: VMs are certainly better than Docker containers. But if wiring up firecracker sounds too hard, then just stick with Docker. It’s better than not doing any containerization.
All these directly reduce attack surface. For example, creating a read-only SQL user guarantees that the attacker can’t damage the data. Reducing the user’s scope to just tables and views ensures they can’t execute stored procedures.
Start with a threat model, and let that be your guide.
Another good option is to still include LLM-driven security code review, but passively monitor instead of actively block.
This is good because it lets you be aware and quantify the size of a problem. But at the same time it doesn’t carry the error cascade problem that can cause production outages. More upside and less downside.
Using LLMs to review code is good, for security or for general bugs.
The big difference is that in the development phase, your threat model generally doesn’t include employees intentionally trying to harm the company. Therefore, prompt injection isn’t something you need to be concerned about.
Again, and I can’t stress this enough, build a threat model and reference it constantly.
The astute reader should realize that this post has nothing to do with LLMs. The problem isn’t that LLMs make mistakes, it’s that they can be forced to make mistakes. And that’s a security problem, but only if it exposes you to real risk.
If there’s one thing you should take away, it should be to make a threat model as the first step in your development process and reference it constantly in all your design decisions. Even if it’s not a complete threat model, you’ll gain a lot by simply being clear about what matters.
2025-03-06 08:00:00
MCP is all over my socials today, to the extent that every 4th post is about it. What’s MCP and why should you care? Here I’ll rattle off a bunch of analogies, you can choose what works for you and disregard the rest.
Where it works: Say you have an API that requests a contract draft from Liz every time the API is called. The MCP server tells the LLM how to call your API. It has a name, description, when it should be used, as well as parameters and also general prompt engineering concerns to elicit a reliable tool call.
Where it breaks: MCP also covers the details of how to call your API
Where it works: Custom GPTs were often used for invoking APIs and tools, but you were limited to one single tool. You would’ve had to open a “Request Contract” GPT in order to invoke your API. With MCP you’d be able to have any chat open and simply connect the “Request Contract” MCP server. In both cases, the LLM is still responsible for invoking your API. It’s dramatically better, because now the LLM can use all your APIs.
Where it breaks: It’s pretty good. It’s a different paradigm and a lot more technical, so many people probably don’t vibe with it.
Where it works: LSP & MCP both solve the many-to-many problem. For LSP it’s IDEs vs programming languages. For MCP it’s LLM clients (e.g. ChatGPT, Cursor or an agent) vs tools/APIs/applications/integrations.
Where it breaks: It’s pretty good. The actual integrations feel a bit more fluid in MCP because so much of it is natural language, but that’s the essence.
Where it works: Power tools have a lot of standard interfaces, like you can put any drill bit into any drill. Also, many power tools have very similar user interfaces, e.g. a hand drill and a circular saw both have a trigger.
Where it breaks: This one feels like a bit of a stretch, but it does convey a sense of being able to combine many tools to complete a job, which is good.
There are a lot of existing MCP servers, including, Gitub, Google Maps, Slack, Spotify (play a song), PostgreSQL (query the database), and Salesforce. Some others that could be:
You would choose a LLM chat tool that supports MCP and then configure and connect MCP servers. I’d imagine you’d want to connect your wiki, Salesforce, maybe a few CRM systems. At the moment, heavy enterprise integration would require your IT department slinging some code to build MCP servers.
It’s an Anthropic project, so Anthropic tools all have great support, whereas OpenAI and Microsoft are going to shun it for as long as possible. But servers are easy to create, expect community servers to pop up.
Universal integrations into AI. All you have to do to get your company into the buzz is wrap your API in an MCP server, and suddenly your app can be used by all MCP clients (Claude, Cursor, agents, etc.)
The one that has more users. It’s a protocol. Which is better has little to do with it, it’s all about which has the biggest network effects. I’d bet on MCP because it was released months ago and there’s a ton of buzz around it still.
Okay, maybe a diagram helps
Servers on left; clients on right. Redraw the arrows however you want.
2025-03-06 08:00:00
My hottest take is that multi-agents are a broken concept and should be avoided at all cost.
My only caveat is PID controllers; A multi agent system that does a 3-step process that looks something like Plan, Act, Verify in a loop. That can work.
Everything else is a devious plan to sell dev tools.
First, “PID controller” is a term used by crusty old people and nobody doing AI knows what I’m talking about, sorry.
PID controllers are used in control systems. Like if you’re designing a guidance system in an airplane, or the automation in a nuclear power plant that keeps it in balance and not melting down. It stands for “proportional–integral–derivative” which is really not helpful here, so I’m going to oversimplify a lot:
A PID controller involves three steps:
Example: Nuclear power plant
PID controllers aren’t typically explained like this. Like I warned, I’m oversimplifying a lot. Normally, the focus is on integrals & derivatives; the “plan” step often directly computes how much it needs to change an actuator. The lesson you can carry from this is that even here, in AI agents, small incremental changes are beneficial to system stability (don’t fill the context with garbage).
There’s a whole lot that goes into PID controllers, many PhD’s have been minted for researching them. But the fundamentals apply widely to any long-running system that you want to keep stable.
Ya know, like agents.
An agent, in ‘25 parlance, is when you give an LLM a set of tools, a task, and loop it until it completes the task. (Yes, that does look a lot like a PID controller, more on that later).
A multi-agent is multiple agents working together in tandem to solve a problem.
In practice, which is the target of my scorn, a multi-agent is when you assign each agent a different role and then create complex workflows between them, often static. And then when you discover that the problem is more difficult than you thought, you add more agents and make the workflows more detailed and complex.
Why? Because they scale by adding complexity.
Here I should go on a tangent about the bitter lesson, an essay by Rich Sutton. It was addressed to AI researchers, and the gist is that when it comes down to scaling by (humans) thinking harder vs by computing more, the latter is always the better choice. His evidence is history, and the principle has held remarkably well over the years since it was written.
As I said, multi-agent systems tend to scale to harder problems by adding more agents and increasing the complexity of the workflows.
This goes against every bone in my engineering body. Complexity compounds your problems. Why would increasing the complexity solve anything? (tbf countless engineering teams over the years have tried anyway).
The correct way to scale is to make any one of your PID controller components better.
Plan better. Act more precisely. Verify more comprehensively.
Han Xiao of Jina.ai wrote an absolutely fantastic article about the DeepSearch & DeepResearch copycats and how to implement one yourself. In it was this diagram:
Dear lord is that a PID controller? I think it is..
The article also makes asks a crucial question:
But why did this shift happen now, when Deep(Re)Search remained relatively undervalued throughout 2024?
To which they conclude:
We believe the real turning point came with OpenAI’s o1-preview release in September 2024, … which enables models to perform more extensive internal deliberations, such as evaluating multiple potential answers, conducting deeper planning, and engaging in self-reflection before arriving at a final response.
In other words, DeepResearch knockoffs didn’t take off until reasoning models improved the capacity for planning
My sense of Cursor Agent, based only on using it, is that it also follows a similar PID controller pattern. Responses clearly (to me) seem to follow a Plan->Act->Verify flow, but the Act phase is more complex, with more tools:
As far as I can tell, the “lint” feature didn’t used to exist. And in the release where they added the “lint” feature, the stability of the agents improved dramatically.
Also, releases in which they’ve improved Search functionality all seemed to have vastly improved the agent’s ability to achieve a goal.
Claude Code, as far as I can tell, is not a multi-agent system. It still seems to perform each Plan, Act, Verify step, but each of the steps become fused into a single agent’s responsibility. And that agent just runs in a loop with tools.
I believe that the natural next step after a multi-agent PID system is to streamline it into a single agent system.
The reason should be intuitive, it’s less complexity. If the LLM is smart enough to handle the simpler architecture, then improving the agent is a matter of compute. Training an even smarter model (computing more) yields better agent performance. It’s the bitter lesson again.
The answer is simple, though likely not easy:
If your answer is to add more agents or create more complex workflows, you will not find yourself with a better agent system.
I do think there’s a world where we have true multi-agent systems, where a group of agents are dispatched to collaboratively solve a problem.
However, in that case the scaling dimension is work to be done. You create a team of agents when there’s too much work for a single agent to complete. Yes, the agents split responsibilities, but that’s an implementation detail toward scaling out meet the needs of the larger amount of work.
Note: There’s probably other design patterns. One that will likely be proven out soon is the “load balancer” pattern, where a team of agents all do work in parallel and then a coordinator/load balancer/merger agent combines the team’s work. For example, the team might be coding agents, all tackling different Github issues, and the coordinator agent is doing nothing but merging code and assigning tasks.
In the mean time, using multi-agents to solve increasingly complex problems is a dead end. Stop doing it.
2025-02-20 08:00:00
I recently got a job, but it was a bear going through rejections on repeat. It almost felt like nobody was even looking at my resume. Which made me think 🤔 that might be the case.
It turns out that hiring managers are swamped with stacks of resumes. Surprisingly (to me), they’re not really using AI to auto-reject, they just aren’t reading carefully.
If you’re a hiring manager with a stack of 200 resumes on your desk, how do you process them? I think I’d probably:
So you have to spoon feed the hiring manager. Sounds easy.
Except it’s not. One single resume won’t work, because it’s basically impossible to satisfy all potential job postings and also have it be succinct enough to properly spoon feed.
It seems you need to generate a different resume for every job opening. But that’s a ton of work. So I made a tool for myself, and I’m open sourcing it today. Here it is.
This breaks it down into 2 steps:
The flow is:
I’m not gonna lie, this is the most fun I’ve ever had writing a resume. Most of the time I want to tear my hair out from searching fruitlessly for something I did that can sound cool. But with this, you just kick back, relax, and brain dump like you’re talking with a friend over drinks.
And while all that is great, the most electrifying part was when it suggested accomplishments, and it struck me that, “dang, I’ve done some cool stuff, I never thought about that project that way”.
All of that, the summaries, the full conversations, all of it is stored alongside the normal resume items. For each job, I have like 30-40 skills and 8-12 accomplishments, mostly generated with some light editing.
The flow is:
The strategy is to use as much as possible verbatim text from the big resume. So generally you put effort into the big resume, not the small one.
When generating, very little generation is happening. It’s mostly just selecting content from the big resume that’s pertinent to the specific job posting based on analyzed needs.
Outside of generating the small resume, I also had a huge amount of success throwing the entire Big Resume into NotebookLM and having it generate a podcast to help prep me for interviews (😍 they are so nice 🥰😘). I’ve also done the same thing with ChatGPT in search mode to run recon on interviewers to prep.
The big resume is an XML document. So you really can just throw it into any AI tool verbatim. I could probably make some export functionality, but this actually works very well.
I’m open sourcing this because I got a job with it. It’s not done, it actually kinda sucks, but the approach to managing information is novel. Some people urged me to get VC funding and turn it into a product, but I’m tired and that just makes me feel even more tired. Idk, it can work, but something that excites me a lot is enabling others to thrive and not charging a dime.
The kinds of people who want to use it are also the kinds of people who might be motivated to bring it over the finish line. Right now, there’s a ton of tech people out of work, and thus a lot of people who are willing, able, and actually have enough time to contribute back. This could work.
Why use it? Because, at bare minimum you’ll end up recalling a lot of cool stuff you did.
Why contribute? Because, if you’re an engineer, you can put that on your resume too.
Again, if you missed it: Github Repo Here
2025-02-17 08:00:00
A new AI architecture is challenging the status quo. LLaDA is a diffusion model that generates text. Normally diffusion models generate images or video (e.g. Stable Diffusion). By using diffusion for text, LLaDA addresses a lot of issues that LLMs are running into, like hallucinations and doom loops.
(Note: I pronounce it “yada”, the “LL” is a “y” sound like in Spanish, and it just seems appropriate for a language model, yada yada yada…)
LLMs write one word after the other in sequence. In LLaDA, on the other hand, words appear randomly. Existing words can also be edited or deleted before the generation terminates.
Example: “Explain what artificial intelligence is”
Loosely speaking, you can think about it as starting with an outline and filling in details across the entire output progressively until all the details are filled in.
Traditional LLMs are autoregressive:
LLMs are autoregressive, meaning that all previous output is the input to the next word. So, it generates words one at a time.
That’s how it thinks, one word at a time. It can’t go back and “un-say” a word, it’s one-shotting everything top-to-bottom. The diffusion approach is unique in that it can back out and edit/delete lines of reasoning, kind of like writing drafts.
Since it’s writing everything at the same time, it’s inherently concurrent. Several thoughts are being developed at the same time globally across the entire output. That means that it’s easier for the model to be consistent and maintain a coherent line of thought.
Some problems benefit more than others. Text like employment agreements is mostly a hierarchy of sections. If you shuffled the sections, the contract would probably retain the same exact meaning. But it still needs to be globally coherent and consistent, that’s critical.
This part resonates with me. There’s clearly trade-offs between approaches. When writing blogs like this, I mostly write it top-to-bottom in a single pass. Because that’s what makes sense to me, it’s how it’s read. But when I review, I stand back, squint and think about it and how it flows globally, almost like manipulating shapes.
In agents, or even long LLM chats, I’ll notice the LLM starts to go around in circles, suggesting things that already didn’t work, etc. LLaDA offers better global coherence. Because it writes via progressive enhancement instead of left-to-right, it’s able to view generation globally and ensure that the output makes sense and is coherent.
Since LLMs are autoregressive, a mistake early on can become a widening gap from reality.
Have you ever had an LLM gaslight you? It’ll hallucinate some fact, but then that hallucination becomes part of it’s input, so it assumes it’s truth and will try to convince you of the hallucinated fact.
That’s partly due to how LLMs are trained. In training, all the input is ground truth, so it learns to trust it’s input. But in inference, the input is it’s previous output, it’s not ground truth but the model treats it like it is. There’s mitigations you can do in post-training, but it’s a fundamental flaw in LLMs that must be faced.
LLaDA is free from this problem, because it’s trained to re-create the ground truth, not trust it unconditionally.
In practice, I’m not sure how much this global coherence is beneficial. For example, if you have a turn-based chat app, like ChatGPT, the AI answers are still going to depend on previous output. Even in agents, a tool call requires that the AI emit a tool call and then continue (re-enter) with the tool output as input to process it.
So with our current AI applications, we would immediately turn these diffusion models into autoregressive models, effectively.
We also started producing reasoning models (o3, R1, S1). In the reasoning
traces, the LLM allows itself to make mistakes by using a passive unconvinced voice in the <think/>
block prior to
giving it’s final answer.
This effectively gives the LLM the ability to think globally for better coherence.
Initially I assumed this could only do fixed-width output. But it’s pretty easy to see how that’s not
the case. Emitting a simple <|eot|>
token to indicate the end of text/output is enough to get
around this.
LLaDA’s biggest contribution is that it succinctly showed what part of LLMs do the heavy lifting — the language modeling.
Autoregressive modeling (ARM) is an implementation of maximum likelihood estimation (MLE). LLaDA showed that this is functionally the same as [KL divergence][kl], which is what LLaDA used. Any approach that models the probability relationships between tokens will work just as well.
There will be more approaches, with new & different trade-offs.
Watch this space. Keep an open mind. We may see some wild shifts in architecture soon. Maybe it’s diffusion models, maybe it’s some other equivalent architecture with a new set of trade-offs.
2025-02-12 08:00:00
A fascinating new paper shows that LLMs can recursively self-improve. They can be trained on older versions of themselves and continuously get better. This immediately made me think, “this is it, it’s the AI singularity”, that moment when AI is able to autonomously self-improve forever and become a… (well that sentence can end a lot of ways)
Off the cuff, I don’t think it’s the singularity, but if this idea takes off then it’s going to look a lot like it. More on that later.
The idea is:
Yep, it goes forever.
Here’s an example, multiplying numbers together, with incrementally bigger numbers.
The yellow line (round 1) indicates base performance. The top purple line (round 10) is after blindly training without filtering. That cliff on round 10 is what model collapse looks like. They call it the error avalanche.
But performance doesn’t drop off immediately, it remains perfect for a couple rounds before dropping off. This is the key insight. If you generate problems that are just a little harder, then you can easily filter and keep pushing performance further.
When a single LLM evaluates correctness, the probability of a mistake is somewhat high. But with majority voting, as you add voters that probability is driven down toward zero. At some point it’s low enough to make it a cost effective strategy.
(No, they didn’t clarify how many voters are needed)
Okay, what can’t this do?
The problems have to have an incremental nature. e.g. They multiplied larger and larger numbers, or tweaked paths through a maze to make them slightly more complex. If you can’t break problems down, they likely won’t work for this.
Also, problems have to have a clear answer. Or at least, the voters should be able to unambiguously vote on the correctness of an answer.
So this might not work well with creative writing, where stories aren’t clearly right or wrong. And even if they were it’s not easy to make a story only slightly more complex.
Another elephant in the room — cost. Recall that R1 went to great lengths to avoid using an external LLM during RL training, mainly to control costs. But also recall that companies are scaling up to super-sized datacenters. This cost has definitely been factored in.
As far as I can tell, most benchmarks fit within those limitations, and so will be saturated. They’re typically clear and unambiguously correct, otherwise the questions couldn’t be used as a benchmark. My sense is that they’re typically decomposable problems, the kind that could be tweaked to be made slightly more complex.
If this recursive improvement becomes a thing, I imagine that most benchmarks are going to be quickly saturated. Saturated benchmarks are as good as no benchmarks.
It’s going to look like insane progress, but I don’t think it’s the singularity. The paper didn’t talk at all about emergent behavior. In fact it assumes that a behavior has already emerged in order to bootstrap the process. But once it’s emerged, this process can max out it’s potential.
It seems like agents might be a rich place to find problems that fit this mold well. The trouble is going to be creating benchmarks fast enough.
My hunch is that, going forward, we’ll lean on reinforcement learning (RL) to force behaviors to emerge, and then use some form of recursive self-improvement fine tuning to max out that behavior.
This year just keeps getting wilder..