MoreRSS

site iconArmin RonacherModify

I'm currently located in Austria and working as a Director of Engineering for Sentry. Aside from that I do open source development.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Armin Ronacher

LLM APIs are a Synchronization Problem

2025-11-22 08:00:00

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem.

At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle.

As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea.

The Basic Agent State

Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU — mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache.

One further clarification: when I talk about state I don’t just mean the visible token history because the model also carries an internal working state that isn’t captured by simply re-sending tokens. In other words: you can replay the tokens and regain the text content, but you won’t restore the exact derived state the model had built.

From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state.

There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it.

The Completion API

The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points.

And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end.

In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter.

The Responses API

One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed.

Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message.

State Sync API

Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways.

It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API.

The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time.

Learn From Local First?

There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today.

Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields.

At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch — which for instance the Responses API today does not allow.

Future Unified APIs

There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

Agent Design Is Still Hard

2025-11-21 08:00:00

I felt like it might be a good time to write about some new things I’ve learned. Most of this is going to be about building agents, with a little bit about using agentic coding tools.

TL;DR: Building agents is still messy. SDK abstractions break once you hit real tool use. Caching works better when you manage it yourself, but differs between models. Reinforcement ends up doing more heavy lifting than expected, and failures need strict isolation to avoid derailing the loop. Shared state via a file-system-like layer is an important building block. Output tooling is surprisingly tricky, and model choice still depends on the task.

Which Agent SDK To Target?

When you build your own agent, you have the choice of targeting an underlying SDK like the OpenAI SDK or the Anthropic SDK, or you can go with a higher level abstraction such as the Vercel AI SDK or Pydantic. The choice we made a while back was to adopt the Vercel AI SDK but only the provider abstractions, and to basically drive the agent loop ourselves. At this point we would not make that choice again. There is absolutely nothing wrong with the Vercel AI SDK, but when you are trying to build an agent, two things happen that we originally didn’t anticipate:

The first is that the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. With some of these higher-level SDKs you have to build on top of their existing abstractions, which might not be the ones you actually want in the end.

We also found it incredibly challenging to work with the Vercel SDK when it comes to dealing with provider-side tools. The attempted unification of messaging formats doesn’t quite work. For instance, the web search tool from Anthropic routinely destroys the message history with the Vercel SDK, and we haven’t yet fully figured out the cause. Also, in Anthropic’s case, cache management is much easier when targeting their SDK directly instead of the Vercel one. The error messages when you get things wrong are much clearer.

This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.

Someone else might have figured it out. If you’re reading this and think I’m wrong, please drop me a mail. I want to learn.

Caching Lessons

The different platforms have very different approaches to caching. A lot has been said about this already, but Anthropic makes you pay for caching. It makes you manage cache points explicitly, and this really changes the way you interact with it from an agent engineering level. I initially found the manual management pretty dumb. Why doesn’t the platform do this for me? But I’ve fully come around and now vastly prefer explicit cache management. It makes costs and cache utilization much more predictable.

Explicit caching allows you to do certain things that are much harder otherwise. For instance, you can split off a conversation and have it run in two different directions simultaneously. You also have the opportunity to do context editing. The optimal strategy here is unclear, but you clearly have a lot more control, and I really like having that control. It also makes it much easier to understand the cost of the underlying agent. You can assume much more about how well your cache will be utilized, whereas with other platforms we found it to be hit and miss.

The way we do caching in the agent with Anthropic is pretty straightforward. One cache point is after the system prompt. Two cache points are placed at the beginning of the conversation, where the last one moves up with the tail of the conversation. And then there is some optimization along the way that you can do.

Because the system prompt and the tool selection now have to be mostly static, we feed a dynamic message later to provide information such as the current time. Otherwise, this would trash the cache. We also leverage reinforcement during the loop much more.

Reinforcement In The Agent Loop

Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. You can also provide hints about how the tool call might succeed when a tool fails. Another use of reinforcement is to inform the system about state changes that happened in the background. If you have an agent that uses parallel processing, you can inject information after every tool call when that state changed and when it is relevant for completing the task.

Sometimes it’s enough for the agent to self-reinforce. In Claude Code, for instance, the todo write tool is a self-reinforcement tool. All it does is take from the agent a list of tasks that it thinks it should do and echo out what came in. It’s basically just an echo tool; it really doesn’t do anything else. But that is enough to drive the agent forward better than if the only task and subtask were given at the beginning of the context and too much has happened in the meantime.

We also use reinforcements to inform the system if the environment changed during execution in a way that’s problematic for the agent. For instance, if our agent fails and retries from a certain step forward but the recovery operates off broken data, we inject a message informing it that it might want to back off a couple of steps and redo an earlier step.

Isolate Failures

If you expect a lot of failures during code execution, there is an opportunity to hide those failures from the context. This can happen in two ways. One is to run tasks that might require iteration individually. You would run them in a subagent until they succeed and only report back the success, plus maybe a brief summary of approaches that did not work. It is helpful for an agent to learn about what did not work in a subtask because it can then feed that information into the next task to hopefully steer away from those failures.

The second option doesn’t exist in all agents or foundation models, but with Anthropic you can do context editing. So far we haven’t had a lot of success with context editing, but we believe it’s an interesting thing we would love to explore more. We would also love to learn if people have success with it. What is interesting about context editing is that you should be able to preserve tokens for further down the iteration loop. You can take out of the context certain failures that didn’t drive towards successful completion of the loop, but only negatively affected certain attempts during execution. But as with the point I made earlier: it is also useful for the agent to understand what didn’t work, but maybe it doesn’t require the full state and full output of all the failures.

Unfortunately, context editing will automatically invalidate caches. There is really no way around it. So it can be unclear when the trade-off of doing that compensates for the extra cost of trashing the cache.

Sub Agents / Sub Inference

As I mentioned a couple of times on this blog already, most of our agents are based on code execution and code generation. That really requires a common place for the agent to store data. Our choice is a file system—in our case a virtual file system—but that requires different tools to access it. This is particularly important if you have something like a subagent or subinference.

You should try to build an agent that doesn’t have dead ends. A dead end is where a task can only continue executing within the sub-tool that you built. For instance, you might build a tool that generates an image, but is only able to feed that image back into one more tool. That’s a problem because you might then want to put those images into a zip archive using the code execution tool. So there needs to be a system that allows the image generation tool to write the image to the same place where the code execution tool can read it. In essence, that’s a file system.

Obviously it has to go the other way around too. You might want to use the code execution tool to unpack a zip archive and then go back to inference to describe all the images so that the next step can go back to code execution and so forth. The file system is the mechanism that we use for that. But it does require tools to be built in a way that they can take file paths to the virtual file system to work with.

So basically an ExecuteCode tool would have access to the same file system as the RunInference tool which could take a path to a file on that same virtual file system.

The Use Of An Output Tool

One interesting thing about how we structured our agent is that it does not represent a chat session. It will eventually communicate something to the user or the outside world, but all the messages that it sends in between are usually not revealed. The question is: how does it create that message? We have one tool which is the output tool. The agent uses it explicitly to communicate to the human. We then use a prompt to instruct it when to use that tool. In our case the output tool sends an email.

But that turns out to pose a few other challenges. One is that it’s surprisingly hard to steer the wording and tone of that output tool compared to just using the main agent loop’s text output as the mechanism to talk to the user. I cannot say why this is, but I think it’s probably related to how these models are trained.

One attempt that didn’t work well was to have the output tool run another quick LLM like Gemini 2.5 Flash to adjust the tone to our preference. But this increases latency and actually reduces the quality of the output. In part, I think the model just doesn’t word things correctly and the subtool doesn’t have sufficient context. Providing more slices of the main agentic context into the subtool makes it expensive and also didn’t fully solve the problem. It also sometimes reveals information in the final output that we didn’t want to be there, like the steps that led to the end result.

Another problem with an output tool is that sometimes it just doesn’t call the tool. One of the ways in which we’re forcing this is we remember if the output tool was called. If the loop ends without the output tool, we inject a reinforcement message to encourage it to use the output tool.

Model Choice

Overall our choices for models haven’t dramatically changed so far. I think Haiku and Sonnet are still the best tool callers available, so they make for excellent choices in the agent loop. They are also somewhat transparent with regards to what the RL looks like. The other obvious choices are the Gemini models. We so far haven’t found a ton of success with the GPT family of models for the main loop.

For the individual sub-tools, which in part might also require inference, our current choice is Gemini 2.5 if you need to summarize large documents or work with PDFs and things like that. That is also a pretty good model for extracting information from images, in particular because the Sonnet family of models likes to run into a safety filter which can be annoying.

There’s also probably the very obvious realization that token cost alone doesn’t really define how expensive an agent. A better tool caller will do the job in fewer tokens. There are some cheaper models available than sonnet today, but they are not necessarily cheaper in a loop.

But all things considered, not that much has changed in the last couple of weeks.

Testing and Evals

We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here. Unfortunately, I have to report that at the moment we haven’t found something that really makes us happy. I hope we’re going to find a solution for this because it is becoming an increasingly frustrating aspect of building an agent.

Coding Agent Updates

As for my experience with coding agents, not really all that much has changed. The main new development is that I’m trialing Amp more. In case you’re curious why: it’s not that it’s objectively a better agent than what I’m using, but I really quite like the way they’re thinking about agents from what they’re posting. The interactions of the different sub agents like the Oracle with the main loop is beautifully done, and not many other harnesses do this today. It’s also a good way for me to validate how different agent designs work. Amp, similar to Claude Code, really feels like a product built by people who also use their own tool. I do not feel every other agent in the industry does this.

Quick Stuff I Read And Found

That’s just a random assortment of things that I feel might also be worth sharing:

  • What if you don’t need MCP at all?: Mario argues that many MCP servers are overengineered and include large toolsets that consume lots of context. He proposes a minimalist approach for browser-agent use-cases by relying on simple CLI tools (e.g., start, navigate, evaluate JS, screenshot) executed via Bash, which keeps token usage small and workflows flexible. I built a Claude/Amp Skill out of it.
  • The fate of “small” open source: The author argues that the age of tiny, single-purpose open-source libraries is coming to an end, largely because built-in platform APIs and AI tools can now generate simple utilities on demand. Thank fucking god.
  • Tmux is love. There is no article that goes with it, but the TLDR is that Tmux is great. If you have anything that remotely looks like an interactive system that an agent should work with, you should give it some Tmux skills.
  • LLM APIs are a Synchronization Problem. This was a separate realization that was too long for this post, so I wrote a separate one.

Absurd Workflows: Durable Execution With Just Postgres

2025-11-03 08:00:00

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution.

Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd 1, a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed.

Durable Execution 101

Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state.

Because Postgres is excellent at queues thanks to SELECT ... FOR UPDATE SKIP LOCKED, you can use it for the queue (e.g., with pgmq). And because it’s a database, you can also use it to store the state.

The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened.

Absurd At A High Level

Absurd at the core is a single .sql file (absurd.sql) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with.

The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps, which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run). The result of a step is stored in the database (a checkpoint). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again.

Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free.

With Agents

What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated:

absurd.registerTask({name: "my-agent"}, async (params, ctx) => {
  let messages = [{role: "user", content: params.prompt}];
  let step = 0;
  while (step++ < 20) {
    const { newMessages, finishReason } = await ctx.step("iteration", async () => {
      return await singleStep(messages);
    });
    messages.push(...newMessages);
    if (finishReason !== "tool-calls") {
      break;
    }
  }
});

This defines a single task named my-agent, and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be iteration, the second iteration#2, iteration#3, etc. Each state only stores the new messages it generated, not the entire message history.

If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks.

How do you kick it off? Simply enqueue it:

await absurd.spawn("my-agent", {
  prompt: "What's the weather like in Boston?"
}, {
  maxAttempts: 3,
});

And if you are curious, this is an example implementation of the singleStep function used above:

Single step function
async function singleStep(messages) {
  const result = await generateText({
    model: anthropic("claude-haiku-4-5"),
    system: "You are a helpful agent",
    messages,
    tools: {
      getWeather: { /* tool definition here */ }
    },
  });

  const newMessages = (await result.response).messages;
  const finishReason = await result.finishReason;

  if (finishReason === "tool-calls") {
    const toolResults = [];
    for (const toolCall of result.toolCalls) {
      /* handle tool calls here */
      if (toolCall.toolName === "getWeather") {
        const toolOutput = await getWeather(toolCall.input);
        toolResults.push({
          toolName: toolCall.toolName,
          toolCallId: toolCall.toolCallId,
          type: "tool-result",
          output: {type: "text", value: toolOutput},
        });
      }
    }
    newMessages.push({
      role: "tool",
      content: toolResults
    });
  }

  return { newMessages, finishReason };
}

Events and Sleeps

And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so:

await ctx.sleep(60 * 60 * 24 * 7); // sleep for 7 days

Or if you want to wait for an event:

const eventName = `email-confirmation-${userId}`;
try {
  const payload = await ctx.waitForEvent(eventName, {timeout: 60 * 5});
  // handle event payload
} catch (e) {
  if (e instanceof TimeoutError) {
    // handle timeout
  } else {
    throw e;
  }
}

Which someone else can emit:

const eventName = `email-confirmation-${userId}`;
await absurd.emitEvent(eventName, { confirmedAt: new Date().toISOString() });

That’s it!

Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration. Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing.

  1. It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years.

Regulation Isn’t the European Trap — Resignation Is

2025-10-21 08:00:00

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know.

I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that.

What bothers me isn’t the rules — it’s the posture that develops from it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it).

The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves.

There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome.

Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind.

So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people.

Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short.

Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control:

  • Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops.
  • Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches.
  • Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you.
  • Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress.

The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one.

That’s the contrast and it’s the part we control.


Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less.

And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance (I am not). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form.

Building an Agent That Leverages Throwaway Code

2025-10-17 08:00:00

In August I wrote about my experiments with replacing MCP (Model Context Protocol) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil. And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all?

I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think.

Pyodide is the Dark Horse

The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with.

Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to.

The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits.

A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there.

File Systems Are King

Another vital ingredient to a code interpreter is having a file system.

Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources.

The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again.

Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use Atomics.wait to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks.

That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync, the same approach does not work for the emscripten JavaScript FS API.

I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this:

// main thread: wrap a worker so fetch() looks synchronous
fetch(url) {
  const signalBuffer = new SharedArrayBuffer(4);
  const signal = new Int32Array(signalBuffer);
  const { port1, port2 } = new MessageChannel();
  this.worker.postMessage({url, signalBuffer, port: port2}, [port2]);

  Atomics.wait(signal, 0, 0);                   // park until worker flips the signal
  const message = receiveMessageOnPort(port1);  // MessageChannel gives the payload
  port1.close();

  if (message.message.status !== "ok") {
    throw new Error(message.message.error.message);
  }
  return message.message.data;
}

// worker thread: perform async fetch, then wake the main thread
parentPort.on("message", async ({ url, signalBuffer, port }) => {
  const signal = new Int32Array(signalBuffer);
  try {
    const bytes = await fetch(url).then(r => {
      if (!r.ok) throw new Error(`HTTP ${r.status}`);
      return r.arrayBuffer();
    });
    port.postMessage({ status: "ok", data: new Uint8Array(bytes) });
    Atomics.store(signal, 0, 1);          // mark success
  } catch (error) {
    port.postMessage({ status: "error", error: serialize(error) });
    Atomics.store(signal, 0, -1);         // mark failure
  } finally {
    Atomics.notify(signal, 0);            // unblock the waiting main thread
    port.close();
  }
});

Durable Execution

Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to.

What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq. (Update: we have since built our own solution to this problem called Absurd)

The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple:

function myAgenticLoop(taskID, initialState) {
  let stepCount = 0;
  let state = initialState;
  while (stepCount < MAX_STEPS) {
    let cacheKey = `${taskID}:${stepCount}`;
    let cachedState = loadStateFromCache(cacheKey);
    if (cachedState !== null) {
      state = cachedState.state;
    } else {
      state = runAgenticStep(state);
      storeStateInCache(cacheKey, state);
    }
    stepCount++;
    if (reachedEndCondition(state)) {
      break;
    }
  }
  return state;
}

You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system).

What Other Than Code?

What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go.

Some tools that I find interesting:

  • Describe: a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it.

  • Help: a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option.

Putting it Together

If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent.

When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun!

What an agent run looks like
Prompt: Figure out the current ip address and make me a picture of it.

Step 1:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    '# Read the current IP address\n' +
    "with open('/network/current-ip', 'r') as f:\n" +
    '    ip_address = f.read().strip()\n' +
    '    \n' +
    'print(f"Current IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Current IP Address: 255.255.255.255',
  stderr: '',
  success: true
}
[Cache] Saved state to: agent-cache/task-0-step-1.json

Step 2:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    'import matplotlib.pyplot as plt\n' +
    'import matplotlib.patches as patches\n' +
    'from matplotlib.patches import FancyBboxPatch\n' +
    'import numpy as np\n' +
    '\n' +
    '# Create figure and axis\n' +
    'fig, ax = plt.subplots(1, 1, figsize=(12, 8))\n' +
    'ax.set_xlim(0, 10)\n' +
    'ax.set_ylim(0, 10)\n' +
    "ax.axis('off')\n" +
    '\n' +
    '# Add background gradient\n' +
    'gradient = np.linspace(0, 1, 256).reshape(1, -1)\n' +
    'gradient = np.vstack([gradient] * 256)\n' +
    "ax.imshow(gradient, extent=[0, 10, 0, 10], aspect='auto', cmap='cool', alpha=0.3)\n" +
    '\n' +
    '# Title\n' +
    "ax.text(5, 9, 'Current IP Address', fontsize=28, weight='bold', \n" +
    "        ha='center', va='top', color='#1a1a1a')\n" +
    '\n' +
    '# IP Address Box\n' +
    'fancy_box = FancyBboxPatch((0.5, 4.5), 9, 2.5, \n' +
    '                           boxstyle="round,pad=0.2", \n' +
    "                           edgecolor='#2196F3', facecolor='#E3F2FD',\n" +
    '                           linewidth=3)\n' +
    'ax.add_patch(fancy_box)\n' +
    '\n' +
    '# IP Address Text\n' +
    "ax.text(5, 5.75, ip_address, fontsize=48, weight='bold', \n" +
    "        ha='center', va='center', color='#1976D2', family='monospace')\n" +
    '\n' +
    '# Decorative elements - network nodes\n' +
    'circle_positions = [(2, 2), (8, 2), (2, 8), (8, 8), (5, 1), (5, 9)]\n' +
    'for pos in circle_positions:\n' +
    "    circle = patches.Circle(pos, 0.3, color='#FF5722', alpha=0.7)\n" +
    '    ax.add_patch(circle)\n' +
    '\n' +
    '# Connection lines\n' +
    'for i in range(len(circle_positions)):\n' +
    '    for j in range(i+1, len(circle_positions)):\n' +
    '        x = [circle_positions[i][0], circle_positions[j][0]]\n' +
    '        y = [circle_positions[i][1], circle_positions[j][1]]\n' +
    "        ax.plot(x, y, color='#2196F3', alpha=0.2, linewidth=1)\n" +
    '\n' +
    '# Info text\n' +
    "ax.text(5, 3.8, '🌐 Your Network Location 🌐', fontsize=14, \n" +
    "        ha='center', va='top', color='#555', style='italic')\n" +
    '\n' +
    'plt.tight_layout()\n' +
    "plt.savefig('/output/ip_address.png', dpi=150, bbox_inches='tight', facecolor='white')\n" +
    'print("✅ Picture saved to /output/ip_address.png")\n' +
    'plt.close()\n' +
    '\n' +
    'print(f"\\n📍 Your IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Loading Pillow, contourpy, cycler, fonttools, kiwisolver, matplotlib, numpy, packaging, pyparsing, python-dateutil, pytz, sixLoaded Pillow, contourpy, cycler, fonttools, kiwisolver, matplotlib, numpy, packaging, pyparsing, python-dateutil, pytz, six',
  stderr: 'Traceback (most recent call last):\n' +
    '  File "/lib/python313.zip/_pyodide/_base.py", line 597, in eval_code_async\n' +
    '    await CodeRunner(\n' +
    '    ...<9 lines>...\n' +
    '    .run_async(globals, locals)\n' +
    '  File "/lib/python313.zip/_pyodide/_base.py", line 411, in run_async\n' +
    '    coroutine = eval(self.code, globals, locals)\n' +
    '  File "<exec>", line 8, in <module>\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 1613, in subplots\n' +
    '    fig = figure(**fig_kw)\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 934, in figure\n' +
    '    manager = new_figure_manager(\n' +
    '        num, figsize=figsize, dpi=dpi,\n' +
    '        facecolor=facecolor, edgecolor=edgecolor, frameon=frameon,\n' +
    '        FigureClass=FigureClass, **kwargs)\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 464, in new_figure_manager\n' +
    '    _warn_if_gui_out_of_main_thread()\n' +
    '    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 441, in _warn_if_gui_out_of_main_thread\n' +
    '    canvas_class = cast(type[FigureCanvasBase], _get_backend_mod().FigureCanvas)\n' +
    '                                                ~~~~~~~~~~~~~~~~^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 280, in _get_backend_mod\n' +
    '    switch_backend(rcParams._get("backend"))  # type: ignore[attr-defined]\n' +
    '    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 342, in switch_backend\n' +
    '    module = importlib.import_module(cbook._backend_module_name(newbackend))\n' +
    '  File "/lib/python313.zip/importlib/__init__.py", line 88, in import_module\n' +
    '    return _bootstrap._gcd_import(name[level:], package, level)\n' +
    '           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n' +
    '  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import\n' +
    '  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load\n' +
    '  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked\n' +
    '  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked\n' +
    '  File "<frozen importlib._bootstrap_external>", line 1026, in exec_module\n' +
    '  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/backends/backend_webagg.py", line 23, in <module>\n' +
    '    from js import document\n' +
    "ImportError: cannot import name 'document' from 'js' (unknown location)\n",
  success: false
}
[Cache] Saved state to: agent-cache/task-0-step-2.json

Step 3:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    'from PIL import Image, ImageDraw, ImageFont\n' +
    'import os\n' +
    '\n' +
    '# Read the current IP address\n' +
    "with open('/network/current-ip', 'r') as f:\n" +
    '    ip_address = f.read().strip()\n' +
    '\n' +
    '# Create a new image with a nice gradient-like background\n' +
    'width, height = 1200, 800\n' +
    "image = Image.new('RGB', (width, height), color='white')\n" +
    'draw = ImageDraw.Draw(image)\n' +
    '\n' +
    '# Draw background with gradient effect using rectangles\n' +
    'for i in range(height):\n' +
    '    ratio = i / height\n' +
    '    r = int(30 + (100 * ratio))\n' +
    '    g = int(100 + (150 * ratio))\n' +
    '    b = int(200 + (55 * ratio))\n' +
    '    draw.rectangle([(0, i), (width, i+1)], fill=(r, g, b))\n' +
    '\n' +
    '# Draw main box\n' +
    'box_margin = 100\n' +
    'box_top = 250\n' +
    'box_height = 300\n' +
    'draw.rectangle(\n' +
    '    [(box_margin, box_top), (width - box_margin, box_top + box_height)],\n' +
    '    fill=(240, 248, 255),\n' +
    '    outline=(33, 150, 243),\n' +
    '    width=5\n' +
    ')\n' +
    '\n' +
    '# Draw rounded corners effect\n' +
    'for i in range(20):\n' +
    '    draw.arc([(box_margin, box_top), (box_margin + 40, box_top + 40)], \n' +
    '             0, 90, fill=(33, 150, 243), width=5)\n' +
    '\n' +
    '# Add decorative network nodes\n' +
    'node_radius = 15\n' +
    'nodes = [(150, 150), (1050, 150), (150, 650), (1050, 650), (600, 100), (600, 700)]\n' +
    'for node in nodes:\n' +
    '    draw.ellipse(\n' +
    '        [(node[0] - node_radius, node[1] - node_radius),\n' +
    '         (node[0] + node_radius, node[1] + node_radius)],\n' +
    '        fill=(255, 87, 34),\n' +
    '        outline=(200, 50, 0),\n' +
    '        width=2\n' +
    '    )\n' +
    '\n' +
    '# Draw connection lines\n' +
    'for i, node1 in enumerate(nodes):\n' +
    '    for node2 in nodes[i+1:]:\n' +
    '        draw.line([node1, node2], fill=(33, 150, 243, 50), width=1)\n' +
    '\n' +
    '# Try to use a large default font\n' +
    'try:\n' +
    '    title_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 60)\n' +
    '    ip_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuMono-Bold.ttf", 90)\n' +
    '    subtitle_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 30)\n' +
    'except:\n' +
    '    title_font = ImageFont.load_default()\n' +
    '    ip_font = ImageFont.load_default()\n' +
    '    subtitle_font = ImageFont.load_default()\n' +
    '\n' +
    '# Draw title\n' +
    'draw.text((width // 2, 80), "Current IP Address", font=title_font, \n' +
    '          fill=(255, 255, 255), anchor="mm")\n' +
    '\n' +
    '# Draw IP address\n' +
    'draw.text((width // 2, box_top + box_height // 2), ip_address, font=ip_font, \n' +
    '          fill=(25, 118, 210), anchor="mm")\n' +
    '\n' +
    '# Draw subtitle\n' +
    'draw.text((width // 2, box_top + box_height + 50), "🌐 Your Network Location 🌐", \n' +
    '          font=subtitle_font, fill=(100, 100, 100), anchor="mm")\n' +
    '\n' +
    '# Save the image\n' +
    "output_path = '/output/ip_address.png'\n" +
    'image.save(output_path)\n' +
    'print(f"✅ Picture saved to {output_path}")\n' +
    'print(f"\\n📍 Your IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Pillow already loaded from default channelNo new packages to load✅ Picture saved to /output/ip_address.png📍 Your IP Address: 255.255.255.255',
  stderr: '',
  success: true
}
[Cache] Saved state to: agent-cache/task-0-step-3.json

Step 4:
Stop reason: end_turn

Final result: Perfect! 🎉 I've successfully:

1. **Found your IP address**: **255.255.255.255**
2. **Created a picture** showing your IP address with:
   - A nice gradient blue background
   - Your IP address prominently displayed in the center
   - Decorative network nodes and connection lines
   - Professional styling and formatting

The image has been saved to `/output/ip_address.png` and is ready for you to download!
[Cache] Saved state to: agent-cache/task-0-step-4.json

Total steps: 4

Making 1 file(s) available in ./output:
  ✓ ip_address.png

4he same approach has also been leveraged by Anthropic and Cloudflare. There is some further reading that might give you more ideas:

  • Claude Skills is fully leveraging code generation for working with documents or other interesting things. Comes with a (non Open Source) repository of example skills that the LLM and code executor can use: anthropics/skills
  • Cloudflare’s Code Mode which is the idea of creating TypeScript bindings for MCP tools and having the agent write code to use them in a sandbox.

90%

2025-09-29 08:00:00

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code”

Dario Amodei

Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close.

For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding.

The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue.

I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted.

Setting it in Context

Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that.

There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%.

As contrast: another quick prototype we built is a mess of unclear database tables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end.

Foundation Building

I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers.

I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up.

For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand:

  • Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM.

    The fact that I no longer have to write SQL because the AI does it for me is a game changer.

    I also use raw SQL for migrations now.

  • OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on.

Iteration

Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better.

I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem.

You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along.

I generally create PR-sized chunks that I can review. There are two paths to this:

  1. Agent loop with finishing touches: Prompt until the result is close, then clean up.

  2. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge.

It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles.

Where It Fails

The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database.

It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later.

Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t.

Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems.

Where It Shines

For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer.

Here are some of the things I enjoyed a lot on this project:

  • Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes.
    It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation.

  • Trying out things: I tried three different OpenAPI implementations and approaches in a day.

  • Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy.

  • Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs.

  • Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests.

  • SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering MERGE and WITH when writing it.

What does it mean?

Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way.

At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago.

That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow.

That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility.