MoreRSS

site iconTim KelloggModify

AI architect, software engineer, and tech enthusiast.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Kellogg

LLMs Are Not Security Mitigations

2025-04-01 08:00:00

LLMs are great code reviewers. They can even spot security mistakes that open us up to vulnerabilities. But no, they’re not an adequate mitigation. You can’t use them to ensure security.

To be clear, I’m referring to this:

  1. User sends a request to app
  2. App generates SQL code
  3. App asks LLM to do a security review, and then iterates with step 2 if it fails the review
  4. App executes generated code
  5. App uses results to prompt LLM
  6. App returns LLM response to user

This might be confusing at first. LLMs are good at identifying security issues, why can’t they be used in this context?

Bad Security

The naive way to do security is to know everything about all exploits and simply not do bad things. Quickly, your naive self gets tired and realizes you’ll never know about all exploits, so anything that I can do that might prevent a vulnerability from being exploited is a good thing.

This is where LLM-review-as-mitigation seems to make sense. LLM code reviews will uncover vulnerabilities that I probably didn’t know about.

That’s not how security works.

Good Security

The right approach to security is to:

  1. Identify what’s important
  2. Identify attack surfaces
  3. Reduce or remove attack surfaces

This is threat modeling. Instead of fighting all vulnerabilities ever, focus first on ones that matter, and then list out dangers the app actually might experience.

Focus on what matters

One simple framework to help guide this process is the CIA framework:

  • C — Confidentiality — Info is only accessible to authorized users
  • I — Integrity — Info is complete, accurate, and there is no unauthorized modification or deletion
  • A — Availability — Authorized users have timely and reliable access to information & resources when they need it

STRIDE is a much better and more complete framework, but the same message applies.

What does LLM-review address?

LLM-review clearly doesn’t prevent information leaks, and it doesn’t improve the availability of the service, so by elimination it must improve the integrity.

But does it?

LLM-review does identify dangerous coding issues, but it doesn’t prevent anything. Anything that can be surfaced by an LLM-review can be circumvented by prompt injection.

It’s not your goal, as an engineer or architect, to come up with the exploit, only to understand if an exploit might be possible. The attacker can inject code or comments into the input to the LLM check instructing the LLM to say there are no issues. If the attacker isn’t directly writing the code, they’re still influencing the prompt that writes the code, so they can conceivably instruct the code writer LLM to write a specific exploit. And if there’s another layer of indirection? Same. And another? Same, it keeps going forever. A competant attacker will always be able to exploit it.

In the presence of a competant attacker, the LLM-review check will always be thwarted. Therefore, it holds no value.

There is no attack surface that it removes. None at all.

Availability

But surely it has value anyway, right? It doesn’t prevent attacks, but something is better than nothing, right?

The clearest argument against this line of thinking is that, no, it actually hurts availability. For example:

  • Resource exhaustion — LLM-review checks consume LLM resources (e.g. token buckets), and therefore there’s less resources to be used by the primary application. One possible outcome is an outage.
  • False positives — LLMs are predisposed to completing their task. If they’re told to find security vulnerabilities, they’re biased toward finding issues even if there are none. That causes another kind of outage, where perfectly fine code is randomly rejected. If code is regenerated in a loop, this causes further resource exhaustion, that triggers global outages.

So no, “something” is not better than nothing. LLM security checks carry the risk of taking down production but without any possible upside.

Hopefully it’s clear. Don’t do it.

Error Cascades (The Spiral of Doom)

In distributed systems, this problem is typically expressed in regards to retries.

Suppose we have an app:

graph TD Frontend--3 retries-->Microserice--3 retries-->db[(Database)]

Suppose the app is running near the point of database exhaustion and the traffic momentarily blips up into exhaustion. You’d expect only a few requests to fail, but it’s much worse than that.

  1. When the DB fails, Microservice retries causing more traffic
  2. Frontend retries, causing even more retry traffic
  3. User gets angry and contributes further by also retrying

A small blip in traffic causes an inexcapable global outage.

The LLM security check is similar mainly because failed checks reduce availability, and if that check is performed in a retry loop it can lead to real cascading problems.

But Content Filters Are Good!

Yes, it’s frequently listed as a best practice to include content filters. For example, check LLM input and output for policy violations like child pornography or violence. This is often done by using an LLM-check, very similar to the security vulnerabilities we’ve been discussing.

Content filters aren’t security. They don’t address any component of CIA (confidentiality, integrity or availability), nor of STRIDE.

You can argue that bad outputs can damage the company’s public image. From that perspecive, any filtering at all reduces the risk exposure surface, because we’ve reduced the real number of incidents of damaging outputs.

The difference is content filters defend against accidental invocation, whereas threat mitigations defend against intentional hostile attacks.

What You Should Do Instead

Lock it down, with traditional controls. Containerize, sandbox, permissions, etc.

  • SQL — Use a special locked-down user, set timeouts, and consider running on a copy of production instead of directly on production.
  • Python — Run it in Docker, whitelist modules (blacklist by default), use containers to isolate users (e.g. new container for every user)

Note: VMs are certainly better than Docker containers. But if wiring up firecracker sounds too hard, then just stick with Docker. It’s better than not doing any containerization.

All these directly reduce attack surface. For example, creating a read-only SQL user guarantees that the attacker can’t damage the data. Reducing the user’s scope to just tables and views ensures they can’t execute stored procedures.

Start with a threat model, and let that be your guide.

Passive Monitoring

Another good option is to still include LLM-driven security code review, but passively monitor instead of actively block.

This is good because it lets you be aware and quantify the size of a problem. But at the same time it doesn’t carry the error cascade problem that can cause production outages. More upside and less downside.

Use LLMs In Your Dev Process!

Using LLMs to review code is good, for security or for general bugs.

The big difference is that in the development phase, your threat model generally doesn’t include employees intentionally trying to harm the company. Therefore, prompt injection isn’t something you need to be concerned about.

Again, and I can’t stress this enough, build a threat model and reference it constantly.

Closing

The astute reader should realize that this post has nothing to do with LLMs. The problem isn’t that LLMs make mistakes, it’s that they can be forced to make mistakes. And that’s a security problem, but only if it exposes you to real risk.

If there’s one thing you should take away, it should be to make a threat model as the first step in your development process and reference it constantly in all your design decisions. Even if it’s not a complete threat model, you’ll gain a lot by simply being clear about what matters.

Discussion

MCP Demystified

2025-03-06 08:00:00

MCP is all over my socials today, to the extent that every 4th post is about it. What’s MCP and why should you care? Here I’ll rattle off a bunch of analogies, you can choose what works for you and disregard the rest.

Analogy: API Docs For LLMs

Where it works: Say you have an API that requests a contract draft from Liz every time the API is called. The MCP server tells the LLM how to call your API. It has a name, description, when it should be used, as well as parameters and also general prompt engineering concerns to elicit a reliable tool call.

Where it breaks: MCP also covers the details of how to call your API

Analogy: It’s What the GPT Store Should Have Been

Where it works: Custom GPTs were often used for invoking APIs and tools, but you were limited to one single tool. You would’ve had to open a “Request Contract” GPT in order to invoke your API. With MCP you’d be able to have any chat open and simply connect the “Request Contract” MCP server. In both cases, the LLM is still responsible for invoking your API. It’s dramatically better, because now the LLM can use all your APIs.

Where it breaks: It’s pretty good. It’s a different paradigm and a lot more technical, so many people probably don’t vibe with it.

Analogy: LSP (Language Server Protocol) for LLMs

Where it works: LSP & MCP both solve the many-to-many problem. For LSP it’s IDEs vs programming languages. For MCP it’s LLM clients (e.g. ChatGPT, Cursor or an agent) vs tools/APIs/applications/integrations.

Where it breaks: It’s pretty good. The actual integrations feel a bit more fluid in MCP because so much of it is natural language, but that’s the essence.

Analogy: Power Tools for AI

Where it works: Power tools have a lot of standard interfaces, like you can put any drill bit into any drill. Also, many power tools have very similar user interfaces, e.g. a hand drill and a circular saw both have a trigger.

Where it breaks: This one feels like a bit of a stretch, but it does convey a sense of being able to combine many tools to complete a job, which is good.

MCP Server Ideas

There are a lot of existing MCP servers, including, Gitub, Google Maps, Slack, Spotify (play a song), PostgreSQL (query the database), and Salesforce. Some others that could be:

  • Browser use (load a page & click around)
  • Microsoft 365 (I’d love to get an org chart in an LLM)
  • Wikis & documentation
  • YouTube
  • Email (mainly searching & reading, but also maybe sending, 🤔 maybe)

FAQ: How do I integrate MCP into my enterprise?

You would choose a LLM chat tool that supports MCP and then configure and connect MCP servers. I’d imagine you’d want to connect your wiki, Salesforce, maybe a few CRM systems. At the moment, heavy enterprise integration would require your IT department slinging some code to build MCP servers.

It’s an Anthropic project, so Anthropic tools all have great support, whereas OpenAI and Microsoft are going to shun it for as long as possible. But servers are easy to create, expect community servers to pop up.

FAQ: Why?

Universal integrations into AI. All you have to do to get your company into the buzz is wrap your API in an MCP server, and suddenly your app can be used by all MCP clients (Claude, Cursor, agents, etc.)

FAQ: What if BIGCO X develops a cometitor? Who will win?

The one that has more users. It’s a protocol. Which is better has little to do with it, it’s all about which has the biggest network effects. I’d bet on MCP because it was released months ago and there’s a ton of buzz around it still.

FAQ: IDK it still seems hard

Okay, maybe a diagram helps

Servers on left; clients on right. Redraw the arrows however you want.

graph LR Slack-->Claude["Claude app"] Slack-->Cursor Slack-->Code["Claude Code (coding agent)"] Salesforce-->Claude Spotify-->Claude Github-->Claude Github-->Cursor Github-->Code SQL-->Code Sharepoint-->Claude

Multi-Agents Are Out, PID Controllers Are In

2025-03-06 08:00:00

My hottest take is that multi-agents are a broken concept and should be avoided at all cost.

My only caveat is PID controllers; A multi agent system that does a 3-step process that looks something like Plan, Act, Verify in a loop. That can work.

Everything else is a devious plan to sell dev tools.

PID Controllers

First, “PID controller” is a term used by crusty old people and nobody doing AI knows what I’m talking about, sorry.

PID controllers are used in control systems. Like if you’re designing a guidance system in an airplane, or the automation in a nuclear power plant that keeps it in balance and not melting down. It stands for “proportional–integral–derivative” which is really not helpful here, so I’m going to oversimplify a lot:

A PID controller involves three steps:

graph TD Plan-->Act-->Verify-->Plan

Example: Nuclear power plant

  • Verify: Read sensors for temperature, pressure, power needs, etc. and inform the “Plan” step
  • Plan: Calculate how much to move the control rods to keep the system stable, alive, and not melting down
  • Act: Move the rods into our out of the chamber

PID controllers aren’t typically explained like this. Like I warned, I’m oversimplifying a lot. Normally, the focus is on integrals & derivatives; the “plan” step often directly computes how much it needs to change an actuator. The lesson you can carry from this is that even here, in AI agents, small incremental changes are beneficial to system stability (don’t fill the context with garbage).

There’s a whole lot that goes into PID controllers, many PhD’s have been minted for researching them. But the fundamentals apply widely to any long-running system that you want to keep stable.

Ya know, like agents.

Multi-Agents

An agent, in ‘25 parlance, is when you give an LLM a set of tools, a task, and loop it until it completes the task. (Yes, that does look a lot like a PID controller, more on that later).

A multi-agent is multiple agents working together in tandem to solve a problem.

In practice, which is the target of my scorn, a multi-agent is when you assign each agent a different role and then create complex workflows between them, often static. And then when you discover that the problem is more difficult than you thought, you add more agents and make the workflows more detailed and complex.

Multi-Agents Don’t Work

Why? Because they scale by adding complexity.

Here I should go on a tangent about the bitter lesson, an essay by Rich Sutton. It was addressed to AI researchers, and the gist is that when it comes down to scaling by (humans) thinking harder vs by computing more, the latter is always the better choice. His evidence is history, and the principle has held remarkably well over the years since it was written.

As I said, multi-agent systems tend to scale to harder problems by adding more agents and increasing the complexity of the workflows.

This goes against every bone in my engineering body. Complexity compounds your problems. Why would increasing the complexity solve anything? (tbf countless engineering teams over the years have tried anyway).

The correct way to scale is to make any one of your PID controller components better.

Plan better. Act more precisely. Verify more comprehensively.

Deep Research: A Multi-Agent Success Story

Han Xiao of Jina.ai wrote an absolutely fantastic article about the DeepSearch & DeepResearch copycats and how to implement one yourself. In it was this diagram:

Dear lord is that a PID controller? I think it is..

  • Reason = Plan
  • Search = Act
  • Read = Verify

The article also makes asks a crucial question:

But why did this shift happen now, when Deep(Re)Search remained relatively undervalued throughout 2024?

To which they conclude:

We believe the real turning point came with OpenAI’s o1-preview release in September 2024, … which enables models to perform more extensive internal deliberations, such as evaluating multiple potential answers, conducting deeper planning, and engaging in self-reflection before arriving at a final response.

In other words, DeepResearch knockoffs didn’t take off until reasoning models improved the capacity for planning

Cursor Agent

My sense of Cursor Agent, based only on using it, is that it also follows a similar PID controller pattern. Responses clearly (to me) seem to follow a Plan->Act->Verify flow, but the Act phase is more complex, with more tools:

  • Search code
  • Read file
  • [Re]write file
  • Run command

As far as I can tell, the “lint” feature didn’t used to exist. And in the release where they added the “lint” feature, the stability of the agents improved dramatically.

Also, releases in which they’ve improved Search functionality all seemed to have vastly improved the agent’s ability to achieve a goal.

Multi-Agent => Smarter Single-Agent

Claude Code, as far as I can tell, is not a multi-agent system. It still seems to perform each Plan, Act, Verify step, but each of the steps become fused into a single agent’s responsibility. And that agent just runs in a loop with tools.

I believe that the natural next step after a multi-agent PID system is to streamline it into a single agent system.

The reason should be intuitive, it’s less complexity. If the LLM is smart enough to handle the simpler architecture, then improving the agent is a matter of compute. Training an even smarter model (computing more) yields better agent performance. It’s the bitter lesson again.

How To Improve Agents

The answer is simple, though likely not easy:

  • Plan — Make the model better. Improved reasoning is a time-tested strategy.
  • Act — Improve how actions are performed. Better search, better code-writing, etc.
  • Verify — Improve your verification techniques. Add static analysis, unit tests, etc.

If your answer is to add more agents or create more complex workflows, you will not find yourself with a better agent system.

Final Thoughts

I do think there’s a world where we have true multi-agent systems, where a group of agents are dispatched to collaboratively solve a problem.

However, in that case the scaling dimension is work to be done. You create a team of agents when there’s too much work for a single agent to complete. Yes, the agents split responsibilities, but that’s an implementation detail toward scaling out meet the needs of the larger amount of work.

Note: There’s probably other design patterns. One that will likely be proven out soon is the “load balancer” pattern, where a team of agents all do work in parallel and then a coordinator/load balancer/merger agent combines the team’s work. For example, the team might be coding agents, all tackling different Github issues, and the coordinator agent is doing nothing but merging code and assigning tasks.

In the mean time, using multi-agents to solve increasingly complex problems is a dead end. Stop doing it.

Discussion

Target Practice: Resumes, But Better

2025-02-20 08:00:00

I recently got a job, but it was a bear going through rejections on repeat. It almost felt like nobody was even looking at my resume. Which made me think 🤔 that might be the case.

It turns out that hiring managers are swamped with stacks of resumes. Surprisingly (to me), they’re not really using AI to auto-reject, they just aren’t reading carefully.

If you’re a hiring manager with a stack of 200 resumes on your desk, how do you process them? I think I’d probably:

  1. Scan for the most critical info (e.g. years of experience, industry focus, tech stack, etc.)
  2. Read the remaining ones more carefully.

So you have to spoon feed the hiring manager. Sounds easy.

Except it’s not. One single resume won’t work, because it’s basically impossible to satisfy all potential job postings and also have it be succinct enough to properly spoon feed.

It seems you need to generate a different resume for every job opening. But that’s a ton of work. So I made a tool for myself, and I’m open sourcing it today. Here it is.

This breaks it down into 2 steps:

  1. A huge verbose “resume”, that’s more of a knowledge bank
  2. A targeted resume, generated to be tailored to each job posting

Step 1: The Big Resume

The flow is:

  1. Start with your existing resume
  2. For each job:
    1. Open a chat dialog
    2. AI offers some icebreaker questions, like “what challenges did you run into while developing Miopter Pengonals for Project Orion?”
    3. Answer the question. Well, just type anything really. The point isn’t to interview, it’s to get everything in your head down on paper.
    4. AI asks followup questions
    5. Repeat 3-4 for a few turns
    6. Review/edit summarized version & save
  3. Have the AI suggest skills and accomplishments based on these AI interviews

I’m not gonna lie, this is the most fun I’ve ever had writing a resume. Most of the time I want to tear my hair out from searching fruitlessly for something I did that can sound cool. But with this, you just kick back, relax, and brain dump like you’re talking with a friend over drinks.

And while all that is great, the most electrifying part was when it suggested accomplishments, and it struck me that, “dang, I’ve done some cool stuff, I never thought about that project that way”.

All of that, the summaries, the full conversations, all of it is stored alongside the normal resume items. For each job, I have like 30-40 skills and 8-12 accomplishments, mostly generated with some light editing.

Step 2: The Small Resume

The flow is:

  1. Upload a job posting
  2. Analyze the job posting for explicit and implied requirements. Again, this is an AI collaboration, where an AI can go off and do recon on the company.
  3. Generate resume.
  4. Review and edit
  5. Export to PDF

The strategy is to use as much as possible verbatim text from the big resume. So generally you put effort into the big resume, not the small one.

When generating, very little generation is happening. It’s mostly just selecting content from the big resume that’s pertinent to the specific job posting based on analyzed needs.

Side Effects

Outside of generating the small resume, I also had a huge amount of success throwing the entire Big Resume into NotebookLM and having it generate a podcast to help prep me for interviews (😍 they are so nice 🥰😘). I’ve also done the same thing with ChatGPT in search mode to run recon on interviewers to prep.

The big resume is an XML document. So you really can just throw it into any AI tool verbatim. I could probably make some export functionality, but this actually works very well.

Status

I’m open sourcing this because I got a job with it. It’s not done, it actually kinda sucks, but the approach to managing information is novel. Some people urged me to get VC funding and turn it into a product, but I’m tired and that just makes me feel even more tired. Idk, it can work, but something that excites me a lot is enabling others to thrive and not charging a dime.

The kinds of people who want to use it are also the kinds of people who might be motivated to bring it over the finish line. Right now, there’s a ton of tech people out of work, and thus a lot of people who are willing, able, and actually have enough time to contribute back. This could work.

Why use it? Because, at bare minimum you’ll end up recalling a lot of cool stuff you did.

Why contribute? Because, if you’re an engineer, you can put that on your resume too.

Again, if you missed it: Github Repo Here

LLaDA: LLMs That Don't Gaslight You

2025-02-17 08:00:00

A new AI architecture is challenging the status quo. LLaDA is a diffusion model that generates text. Normally diffusion models generate images or video (e.g. Stable Diffusion). By using diffusion for text, LLaDA addresses a lot of issues that LLMs are running into, like hallucinations and doom loops.

(Note: I pronounce it “yada”, the “LL” is a “y” sound like in Spanish, and it just seems appropriate for a language model, yada yada yada…)

LLMs write one word after the other in sequence. In LLaDA, on the other hand, words appear randomly. Existing words can also be edited or deleted before the generation terminates.

Example: “Explain what artificial intelligence is”

Loosely speaking, you can think about it as starting with an outline and filling in details across the entire output progressively until all the details are filled in.

Diffusion vs Autoregressive Langage Models

Traditional LLMs are autoregressive:

  • auto — self, in this case the output is the “self”, the output is also the input to the next token
  • regressive — make a prediction, e.g. “linear regression”

LLMs are autoregressive, meaning that all previous output is the input to the next word. So, it generates words one at a time.

That’s how it thinks, one word at a time. It can’t go back and “un-say” a word, it’s one-shotting everything top-to-bottom. The diffusion approach is unique in that it can back out and edit/delete lines of reasoning, kind of like writing drafts.

Thinking Concurrently

Since it’s writing everything at the same time, it’s inherently concurrent. Several thoughts are being developed at the same time globally across the entire output. That means that it’s easier for the model to be consistent and maintain a coherent line of thought.

Some problems benefit more than others. Text like employment agreements is mostly a hierarchy of sections. If you shuffled the sections, the contract would probably retain the same exact meaning. But it still needs to be globally coherent and consistent, that’s critical.

This part resonates with me. There’s clearly trade-offs between approaches. When writing blogs like this, I mostly write it top-to-bottom in a single pass. Because that’s what makes sense to me, it’s how it’s read. But when I review, I stand back, squint and think about it and how it flows globally, almost like manipulating shapes.

Doom Loops

In agents, or even long LLM chats, I’ll notice the LLM starts to go around in circles, suggesting things that already didn’t work, etc. LLaDA offers better global coherence. Because it writes via progressive enhancement instead of left-to-right, it’s able to view generation globally and ensure that the output makes sense and is coherent.

Error Accumulation

Since LLMs are autoregressive, a mistake early on can become a widening gap from reality.

Have you ever had an LLM gaslight you? It’ll hallucinate some fact, but then that hallucination becomes part of it’s input, so it assumes it’s truth and will try to convince you of the hallucinated fact.

That’s partly due to how LLMs are trained. In training, all the input is ground truth, so it learns to trust it’s input. But in inference, the input is it’s previous output, it’s not ground truth but the model treats it like it is. There’s mitigations you can do in post-training, but it’s a fundamental flaw in LLMs that must be faced.

LLaDA is free from this problem, because it’s trained to re-create the ground truth, not trust it unconditionally.

Problem: It’s Still Autoregressive

In practice, I’m not sure how much this global coherence is beneficial. For example, if you have a turn-based chat app, like ChatGPT, the AI answers are still going to depend on previous output. Even in agents, a tool call requires that the AI emit a tool call and then continue (re-enter) with the tool output as input to process it.

So with our current AI applications, we would immediately turn these diffusion models into autoregressive models, effectively.

We also started producing reasoning models (o3, R1, S1). In the reasoning traces, the LLM allows itself to make mistakes by using a passive unconvinced voice in the <think/> block prior to giving it’s final answer.

This effectively gives the LLM the ability to think globally for better coherence.

Not A Problem: Fixed Width

Initially I assumed this could only do fixed-width output. But it’s pretty easy to see how that’s not the case. Emitting a simple <|eot|> token to indicate the end of text/output is enough to get around this.

New Approaches

LLaDA’s biggest contribution is that it succinctly showed what part of LLMs do the heavy lifting — the language modeling.

Autoregressive modeling (ARM) is an implementation of maximum likelihood estimation (MLE). LLaDA showed that this is functionally the same as [KL divergence][kl], which is what LLaDA used. Any approach that models the probability relationships between tokens will work just as well.

There will be more approaches, with new & different trade-offs.

Conclusion

Watch this space. Keep an open mind. We may see some wild shifts in architecture soon. Maybe it’s diffusion models, maybe it’s some other equivalent architecture with a new set of trade-offs.

Discussion

Recursive Improvement: AI Singularity Or Just Benchmark Saturation?

2025-02-12 08:00:00

A fascinating new paper shows that LLMs can recursively self-improve. They can be trained on older versions of themselves and continuously get better. This immediately made me think, “this is it, it’s the AI singularity”, that moment when AI is able to autonomously self-improve forever and become a… (well that sentence can end a lot of ways)

Off the cuff, I don’t think it’s the singularity, but if this idea takes off then it’s going to look a lot like it. More on that later.

Self-Improvement

The idea is:

  1. Start with a baseline model
  2. Use it to generate questions & answers
  3. Use majority voting to filter out bad answers or low-quality questions
  4. Train on the new corpus
  5. GOTO 2

Yep, it goes forever.

Here’s an example, multiplying numbers together, with incrementally bigger numbers.

The yellow line (round 1) indicates base performance. The top purple line (round 10) is after blindly training without filtering. That cliff on round 10 is what model collapse looks like. They call it the error avalanche.

But performance doesn’t drop off immediately, it remains perfect for a couple rounds before dropping off. This is the key insight. If you generate problems that are just a little harder, then you can easily filter and keep pushing performance further.

When a single LLM evaluates correctness, the probability of a mistake is somewhat high. But with majority voting, as you add voters that probability is driven down toward zero. At some point it’s low enough to make it a cost effective strategy.

(No, they didn’t clarify how many voters are needed)

Limitations

Okay, what can’t this do?

The problems have to have an incremental nature. e.g. They multiplied larger and larger numbers, or tweaked paths through a maze to make them slightly more complex. If you can’t break problems down, they likely won’t work for this.

Also, problems have to have a clear answer. Or at least, the voters should be able to unambiguously vote on the correctness of an answer.

So this might not work well with creative writing, where stories aren’t clearly right or wrong. And even if they were it’s not easy to make a story only slightly more complex.

Another elephant in the room — cost. Recall that R1 went to great lengths to avoid using an external LLM during RL training, mainly to control costs. But also recall that companies are scaling up to super-sized datacenters. This cost has definitely been factored in.

Benchmark Saturation

As far as I can tell, most benchmarks fit within those limitations, and so will be saturated. They’re typically clear and unambiguously correct, otherwise the questions couldn’t be used as a benchmark. My sense is that they’re typically decomposable problems, the kind that could be tweaked to be made slightly more complex.

If this recursive improvement becomes a thing, I imagine that most benchmarks are going to be quickly saturated. Saturated benchmarks are as good as no benchmarks.

It’s going to look like insane progress, but I don’t think it’s the singularity. The paper didn’t talk at all about emergent behavior. In fact it assumes that a behavior has already emerged in order to bootstrap the process. But once it’s emerged, this process can max out it’s potential.

It seems like agents might be a rich place to find problems that fit this mold well. The trouble is going to be creating benchmarks fast enough.

My hunch is that, going forward, we’ll lean on reinforcement learning (RL) to force behaviors to emerge, and then use some form of recursive self-improvement fine tuning to max out that behavior.

This year just keeps getting wilder..