RSS preview of Armin Ronacher

Rss preview of Blog of Armin Ronacher

Let’s Destroy The European Union!

2025-12-09 08:00:00

Elon Musk is not happy with the EU fining his X platform and is currently on a tweet rampage complaining about it. Among other things, he wants the whole EU to be abolished. He sadly is hardly the first wealthy American to share their opinions on European politics lately. I’m not a fan of this outside attention but I believe it’s noteworthy and something to pay attention to. In particular because the idea of destroying and ripping apart the EU is not just popular in the US; it’s popular over here too. Something that greatly concerns me.

We Have Genuine Problems

There is definitely a bunch of stuff we might want to fix over here. I have complained about our culture before. Unfortunately, I happen to think that our challenges are not coming from politicians or civil servants, but from us, the people. Europeans don’t like to take risks and are quite pessimistic about the future compared to their US counterparts. Additionally, we Europeans have been trained to feel a lot of guilt over the years, which makes us hesitant to stand up for ourselves. This has led to all kinds of interesting counter-cultural movements in Europe, like years of significant support for unregulated immigration and an unhealthy obsession with the idea of degrowth. Today, though, neither seems quite as popular as it once was.

Morally these things may be defensible, but in practice they have led to Europe losing its competitive edge and eroding social cohesion. The combination of a strong social state and high taxes in particular does not mix well with the kind of immigration we have seen in the last decade: mostly people escaping wars ending up in low-skilled jobs. That means it’s not unlikely that certain classes of immigrants are going to be net-negative for a very long time, if not forever, and increasingly society is starting to think about what the implications of that might be.

Yet even all of that is not where our problems lie, and it’s certainly not our presumed lack of free speech. Any conversation on that topic is foolish because it’s too nuanced. Society clearly wants to place some limits to free speech here, but the same is true in the US. In the US we can currently see a significant push-back against “woke ideologies,” and a lot of that push-back involves restricting freedom of expression through different avenues.

America Likes a Weak Europe

The US might try to lecture Europe right now on free speech, but what it should be lecturing us on is our economic model. Europe has too much fragmentation, incredibly strict regulation that harms innovation, ineffective capital markets, and a massive dependency on both the United States and China. If the US were to cut us off from their cloud providers, we would not be able to operate anything over here. If China were to stop shipping us chips, we would be in deep trouble too (we have seen this).

This is painful because the US is historically a great example when it comes to freedom of information, direct democracy at the state level, and rather low corruption. These are all areas where we’re not faring well, at least not consistently, and we should be lectured. Fundamentally, the US approach to capitalism is about as good as it’s going to get. If there was any doubt that alternative approaches might have worked out better, at this point there’s very little evidence in favor of that. Yet because of increased loss of civil liberties in the US, many Europeans now see everything that the US is doing as bad. A grave mistake.

Both China and the US are quite happy with the dependency we have on them and with us falling short of our potential. Europe’s attempt at dealing with the dependency so far has been to regulate and tax US corporations more heavily. That’s not a good strategy. The solution must be to become competitive again so that we can redirect that tax revenue to local companies instead. The Digital Services Act is a good example: we’re punishing Apple and forcing them to open up their platform, but we have no company that can take advantage of that opening.

Europe is Europe’s Biggest Problem

If you read my blog here, you might remember my musings about the lack of clarity of what a foreigner is in Europe. The reality is that Europe has been deeply integrated for a long time now as a result of how the EU works — but still not at the same level as the US. I think this is still the biggest problem. People point to languages as the challenge, but underneath the hood, the countries are still fighting each other. Austria wants to protect its local stores from larger competition in Germany and its carpenters from the cheaper ones coming from Slovenia. You can replace Austria with any other EU country and you will find the same thing.

The EU might not be perfect, but it’s hard to imagine that abolishing it would solve any problem given how national states have shown to behave. The moment the EU fell away, we would be warming up all border struggles again. We have already seen similar issues pop up in Northern Ireland after the UK left.

And we just have so much bureaucracy, so many non-functioning social systems, and such a tremendous amount of incoming governmental debt to support our flailing pension schemes. We need growth more than any other bloc, and we have such a low probability of actually accomplishing that.

Given how the EU is structured, it’s also acting as the punching bag for the failure of the nation states to come to agreements. It’s not that EU bureaucrats are telling Europeans to take in immigrants, to enact chat control or to enact cookie banners or attached plastic caps. Those are all initiatives that come from one or more member states. But the EU in the end will always take the blame because even local politicians that voted in support of some of these things can easily point towards “Brussels” as having created a problem.

The United States of Europe

A Europe in pieces does not sound appealing to me at all, and that’s because I can look at what China and the US have.

What China and the US have that Europe lacks is a strong national identity. Both countries have recognized that strength comes from unity. China in particular is fighting any kind of regionalism tooth and nail. The US has accomplished this through the pledge of allegiance, a civil war, the Department of Education pushing a common narrative in schools, and historically putting post offices and infrastructure everywhere. Europe has none of that. More importantly, Europeans don’t even want it. There is a mistaken belief that we can just become these tiny states again and be fine.

If Europe wants to be competitive, it seems unlikely that this can be accomplished without becoming a unified superpower. Yet there is no belief in Europe that this can or should happen, and the other superpowers have little interest in seeing it happen either.

What Would Fixing Actually Look Like?

If I had to propose something constructive, it would be this: Europe needs to stop pretending it can be 27 different countries with 27 different economic policies while also being a single market. The half-measures are killing us. We have a common currency in the Eurozone but no common fiscal policy. We have freedom of movement but wildly different social systems. We have common regulations but fragmented enforcement. 27 labor laws, 27 different legal systems, tax codes, complex VAT rules and so on.

The Draghi report from last year laid out many of these issues quite clearly: Europe needs massive investment in technology and infrastructure. It needs a genuine single market for services, not just goods. It needs capital markets that can actually fund startups at scale. None of this is news to anyone paying attention.

But here’s the uncomfortable truth: none of this will happen without Europeans accepting that more integration is the answer, not less. And right now, the political momentum is in the opposite direction. Every country wants the benefits of the EU without the obligations. Every country wants to protect its own industries while accessing everyone else’s markets.

One of the arguments against deeper integration is that Europe hinges on some quite unrelated issues. For instance, the EU is seen as non-democratic, but some of the criticism just does not sit right with me. Sure, I too would welcome more democracy in the EU, but at the same time, the system really is not undemocratic today. Take things like chat control: the reason this thing does not die, is because some member states and their elected representatives are pushing for it.

What stands in the way is that the member countries and their people don’t actually want to strengthen the EU further. The “lack of democracy” is very much intentional and the exact outcome you get if you want to keep the power with the national states.

Foreign Billionaires and European Sovereignty

So back to where we started: should the EU be abolished as Musk suggests? I think this is a profoundly unserious proposal from someone who has little understanding of European history and even less interest in learning. The EU exists because two world wars taught Europeans that nationalism without checks leads to catastrophe. It exists because small countries recognized they have more leverage negotiating as a bloc than individually.

I also take a lot of issue with the idea that European politics should be driven by foreign interests. Neither Russians nor Americans have any good reason for why they should be having so much interest in European politics. They are not living here; we are.

Would Europe be more “free” without the EU? Perhaps in some narrow regulatory sense. But it would also be weaker, more divided, and more susceptible to manipulation by larger powers — including the United States. I also find it somewhat rich that American tech billionaires are calling for the dissolution of the EU while they are greatly benefiting from the open market it provides. Their companies extract enormous value from the European market, more than even local companies are able to.

The real question isn’t whether Europe should have less regulation or more freedom. It’s whether we Europeans can find the political will to actually complete the project we started. A genuine federation with real fiscal transfers, a common defense policy, and a unified foreign policy would be a superpower. What we have now is a compromise that satisfies nobody and leaves us vulnerable to exactly the kind of pressure Musk and other oligarchs represent.

A Different Path

Europe doesn’t need fixing in the way the loud present-day critics suggest. It doesn’t need to become more like America or abandon its social model entirely. What it needs is to decide what it actually wants to be. The current state of perpetual ambiguity is unsustainable.

It also should not lose its values. Europeans might no longer be quite as hot on the human rights that the EU provides, and they might no longer want to have the same level of immigration. Yet simultaneously, Europeans are presented with a reality that needs all of these things. We’re all highly dependent on movement of labour, and that includes people from abroad. Unfortunately, the wars of the last decade have dominated any migration discourse, and that has created ground for populists to thrive. Any skilled tech migrant is running into the same walls as everyone else, which has made it less and less appealing to come.

Or perhaps we’ll continue muddling through, which historically has been Europe’s preferred approach. It’s not inspiring, but it’s also not going to be the catastrophe the internet would have you believe either.

Is there reason to be optimistic? On a long enough timeline the graph goes up and to the right. We might be going through some rough patches, but structurally the whole thing here is still pretty solid. And it’s not as if the rest of the world is cruising along smoothly: the US, China, and Russia are each dealing with their own crises. That shouldn’t serve as an excuse, but it does offer context. As bleak as things can feel, we’re not alone in having challenges, but ours are uniquely ours and we will face them. One way or another.

LLM APIs are a Synchronization Problem

2025-11-22 08:00:00

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem.

At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle.

As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea.

The Basic Agent State

Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU — mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache.

One further clarification: when I talk about state I don’t just mean the visible token history because the model also carries an internal working state that isn’t captured by simply re-sending tokens. In other words: you can replay the tokens and regain the text content, but you won’t restore the exact derived state the model had built.

From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state.

There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it.

The Completion API

The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points.

And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end.

In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter.

The Responses API

One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed.

Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message.

State Sync API

Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways.

It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API.

The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time.

Learn From Local First?

There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today.

Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields.

At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch — which for instance the Responses API today does not allow.

Future Unified APIs

There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

Agent Design Is Still Hard

2025-11-21 08:00:00

I felt like it might be a good time to write about some new things I’ve learned. Most of this is going to be about building agents, with a little bit about using agentic coding tools.

TL;DR: Building agents is still messy. SDK abstractions break once you hit real tool use. Caching works better when you manage it yourself, but differs between models. Reinforcement ends up doing more heavy lifting than expected, and failures need strict isolation to avoid derailing the loop. Shared state via a file-system-like layer is an important building block. Output tooling is surprisingly tricky, and model choice still depends on the task.

Which Agent SDK To Target?

When you build your own agent, you have the choice of targeting an underlying SDK like the OpenAI SDK or the Anthropic SDK, or you can go with a higher level abstraction such as the Vercel AI SDK or Pydantic. The choice we made a while back was to adopt the Vercel AI SDK but only the provider abstractions, and to basically drive the agent loop ourselves. At this point we would not make that choice again. There is absolutely nothing wrong with the Vercel AI SDK, but when you are trying to build an agent, two things happen that we originally didn’t anticipate:

The first is that the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. With some of these higher-level SDKs you have to build on top of their existing abstractions, which might not be the ones you actually want in the end.

We also found it incredibly challenging to work with the Vercel SDK when it comes to dealing with provider-side tools. The attempted unification of messaging formats doesn’t quite work. For instance, the web search tool from Anthropic routinely destroys the message history with the Vercel SDK, and we haven’t yet fully figured out the cause. Also, in Anthropic’s case, cache management is much easier when targeting their SDK directly instead of the Vercel one. The error messages when you get things wrong are much clearer.

This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.

Someone else might have figured it out. If you’re reading this and think I’m wrong, please drop me a mail. I want to learn.

Caching Lessons

The different platforms have very different approaches to caching. A lot has been said about this already, but Anthropic makes you pay for caching. It makes you manage cache points explicitly, and this really changes the way you interact with it from an agent engineering level. I initially found the manual management pretty dumb. Why doesn’t the platform do this for me? But I’ve fully come around and now vastly prefer explicit cache management. It makes costs and cache utilization much more predictable.

Explicit caching allows you to do certain things that are much harder otherwise. For instance, you can split off a conversation and have it run in two different directions simultaneously. You also have the opportunity to do context editing. The optimal strategy here is unclear, but you clearly have a lot more control, and I really like having that control. It also makes it much easier to understand the cost of the underlying agent. You can assume much more about how well your cache will be utilized, whereas with other platforms we found it to be hit and miss.

The way we do caching in the agent with Anthropic is pretty straightforward. One cache point is after the system prompt. Two cache points are placed at the beginning of the conversation, where the last one moves up with the tail of the conversation. And then there is some optimization along the way that you can do.

Because the system prompt and the tool selection now have to be mostly static, we feed a dynamic message later to provide information such as the current time. Otherwise, this would trash the cache. We also leverage reinforcement during the loop much more.

Reinforcement In The Agent Loop

Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. You can also provide hints about how the tool call might succeed when a tool fails. Another use of reinforcement is to inform the system about state changes that happened in the background. If you have an agent that uses parallel processing, you can inject information after every tool call when that state changed and when it is relevant for completing the task.

Sometimes it’s enough for the agent to self-reinforce. In Claude Code, for instance, the todo write tool is a self-reinforcement tool. All it does is take from the agent a list of tasks that it thinks it should do and echo out what came in. It’s basically just an echo tool; it really doesn’t do anything else. But that is enough to drive the agent forward better than if the only task and subtask were given at the beginning of the context and too much has happened in the meantime.

We also use reinforcements to inform the system if the environment changed during execution in a way that’s problematic for the agent. For instance, if our agent fails and retries from a certain step forward but the recovery operates off broken data, we inject a message informing it that it might want to back off a couple of steps and redo an earlier step.

Isolate Failures

If you expect a lot of failures during code execution, there is an opportunity to hide those failures from the context. This can happen in two ways. One is to run tasks that might require iteration individually. You would run them in a subagent until they succeed and only report back the success, plus maybe a brief summary of approaches that did not work. It is helpful for an agent to learn about what did not work in a subtask because it can then feed that information into the next task to hopefully steer away from those failures.

The second option doesn’t exist in all agents or foundation models, but with Anthropic you can do context editing. So far we haven’t had a lot of success with context editing, but we believe it’s an interesting thing we would love to explore more. We would also love to learn if people have success with it. What is interesting about context editing is that you should be able to preserve tokens for further down the iteration loop. You can take out of the context certain failures that didn’t drive towards successful completion of the loop, but only negatively affected certain attempts during execution. But as with the point I made earlier: it is also useful for the agent to understand what didn’t work, but maybe it doesn’t require the full state and full output of all the failures.

Unfortunately, context editing will automatically invalidate caches. There is really no way around it. So it can be unclear when the trade-off of doing that compensates for the extra cost of trashing the cache.

Sub Agents / Sub Inference

As I mentioned a couple of times on this blog already, most of our agents are based on code execution and code generation. That really requires a common place for the agent to store data. Our choice is a file system—in our case a virtual file system—but that requires different tools to access it. This is particularly important if you have something like a subagent or subinference.

You should try to build an agent that doesn’t have dead ends. A dead end is where a task can only continue executing within the sub-tool that you built. For instance, you might build a tool that generates an image, but is only able to feed that image back into one more tool. That’s a problem because you might then want to put those images into a zip archive using the code execution tool. So there needs to be a system that allows the image generation tool to write the image to the same place where the code execution tool can read it. In essence, that’s a file system.

Obviously it has to go the other way around too. You might want to use the code execution tool to unpack a zip archive and then go back to inference to describe all the images so that the next step can go back to code execution and so forth. The file system is the mechanism that we use for that. But it does require tools to be built in a way that they can take file paths to the virtual file system to work with.

So basically an ExecuteCode tool would have access to the same file system as the RunInference tool which could take a path to a file on that same virtual file system.

The Use Of An Output Tool

One interesting thing about how we structured our agent is that it does not represent a chat session. It will eventually communicate something to the user or the outside world, but all the messages that it sends in between are usually not revealed. The question is: how does it create that message? We have one tool which is the output tool. The agent uses it explicitly to communicate to the human. We then use a prompt to instruct it when to use that tool. In our case the output tool sends an email.

But that turns out to pose a few other challenges. One is that it’s surprisingly hard to steer the wording and tone of that output tool compared to just using the main agent loop’s text output as the mechanism to talk to the user. I cannot say why this is, but I think it’s probably related to how these models are trained.

One attempt that didn’t work well was to have the output tool run another quick LLM like Gemini 2.5 Flash to adjust the tone to our preference. But this increases latency and actually reduces the quality of the output. In part, I think the model just doesn’t word things correctly and the subtool doesn’t have sufficient context. Providing more slices of the main agentic context into the subtool makes it expensive and also didn’t fully solve the problem. It also sometimes reveals information in the final output that we didn’t want to be there, like the steps that led to the end result.

Another problem with an output tool is that sometimes it just doesn’t call the tool. One of the ways in which we’re forcing this is we remember if the output tool was called. If the loop ends without the output tool, we inject a reinforcement message to encourage it to use the output tool.

Model Choice

Overall our choices for models haven’t dramatically changed so far. I think Haiku and Sonnet are still the best tool callers available, so they make for excellent choices in the agent loop. They are also somewhat transparent with regards to what the RL looks like. The other obvious choices are the Gemini models. We so far haven’t found a ton of success with the GPT family of models for the main loop.

For the individual sub-tools, which in part might also require inference, our current choice is Gemini 2.5 if you need to summarize large documents or work with PDFs and things like that. That is also a pretty good model for extracting information from images, in particular because the Sonnet family of models likes to run into a safety filter which can be annoying.

There’s also probably the very obvious realization that token cost alone doesn’t really define how expensive an agent. A better tool caller will do the job in fewer tokens. There are some cheaper models available than sonnet today, but they are not necessarily cheaper in a loop.

But all things considered, not that much has changed in the last couple of weeks.

Testing and Evals

We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here. Unfortunately, I have to report that at the moment we haven’t found something that really makes us happy. I hope we’re going to find a solution for this because it is becoming an increasingly frustrating aspect of building an agent.

Coding Agent Updates

As for my experience with coding agents, not really all that much has changed. The main new development is that I’m trialing Amp more. In case you’re curious why: it’s not that it’s objectively a better agent than what I’m using, but I really quite like the way they’re thinking about agents from what they’re posting. The interactions of the different sub agents like the Oracle with the main loop is beautifully done, and not many other harnesses do this today. It’s also a good way for me to validate how different agent designs work. Amp, similar to Claude Code, really feels like a product built by people who also use their own tool. I do not feel every other agent in the industry does this.

Quick Stuff I Read And Found

That’s just a random assortment of things that I feel might also be worth sharing:

What if you don’t need MCP at all?: Mario argues that many MCP servers are overengineered and include large toolsets that consume lots of context. He proposes a minimalist approach for browser-agent use-cases by relying on simple CLI tools (e.g., start, navigate, evaluate JS, screenshot) executed via Bash, which keeps token usage small and workflows flexible. I built a Claude/Amp Skill out of it.
The fate of “small” open source: The author argues that the age of tiny, single-purpose open-source libraries is coming to an end, largely because built-in platform APIs and AI tools can now generate simple utilities on demand. Thank fucking god.
Tmux is love. There is no article that goes with it, but the TLDR is that Tmux is great. If you have anything that remotely looks like an interactive system that an agent should work with, you should give it some Tmux skills.
LLM APIs are a Synchronization Problem. This was a separate realization that was too long for this post, so I wrote a separate one.

Absurd Workflows: Durable Execution With Just Postgres

2025-11-03 08:00:00

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution.

Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd ¹, a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed.

Durable Execution 101

Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state.

Because Postgres is excellent at queues thanks to SELECT ... FOR UPDATE SKIP LOCKED, you can use it for the queue (e.g., with pgmq). And because it’s a database, you can also use it to store the state.

The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened.

Absurd At A High Level

Absurd at the core is a single .sql file (absurd.sql) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with.

The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps, which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run). The result of a step is stored in the database (a checkpoint). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again.

Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free.

With Agents

What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated:

absurd.registerTask({name: "my-agent"}, async (params, ctx) => {
  let messages = [{role: "user", content: params.prompt}];
  let step = 0;
  while (step++ < 20) {
    const { newMessages, finishReason } = await ctx.step("iteration", async () => {
      return await singleStep(messages);
    });
    messages.push(...newMessages);
    if (finishReason !== "tool-calls") {
      break;
    }
  }
});

This defines a single task named my-agent, and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be iteration, the second iteration#2, iteration#3, etc. Each state only stores the new messages it generated, not the entire message history.

If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks.

How do you kick it off? Simply enqueue it:

await absurd.spawn("my-agent", {
  prompt: "What's the weather like in Boston?"
}, {
  maxAttempts: 3,
});

And if you are curious, this is an example implementation of the singleStep function used above:

Single step function

async function singleStep(messages) {
  const result = await generateText({
    model: anthropic("claude-haiku-4-5"),
    system: "You are a helpful agent",
    messages,
    tools: {
      getWeather: { /* tool definition here */ }
    },
  });

  const newMessages = (await result.response).messages;
  const finishReason = await result.finishReason;

  if (finishReason === "tool-calls") {
    const toolResults = [];
    for (const toolCall of result.toolCalls) {
      /* handle tool calls here */
      if (toolCall.toolName === "getWeather") {
        const toolOutput = await getWeather(toolCall.input);
        toolResults.push({
          toolName: toolCall.toolName,
          toolCallId: toolCall.toolCallId,
          type: "tool-result",
          output: {type: "text", value: toolOutput},
        });
      }
    }
    newMessages.push({
      role: "tool",
      content: toolResults
    });
  }

  return { newMessages, finishReason };
}

Events and Sleeps

And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so:

await ctx.sleep(60 * 60 * 24 * 7); // sleep for 7 days

Or if you want to wait for an event:

const eventName = `email-confirmation-${userId}`;
try {
  const payload = await ctx.waitForEvent(eventName, {timeout: 60 * 5});
  // handle event payload
} catch (e) {
  if (e instanceof TimeoutError) {
    // handle timeout
  } else {
    throw e;
  }
}

Which someone else can emit:

const eventName = `email-confirmation-${userId}`;
await absurd.emitEvent(eventName, { confirmedAt: new Date().toISOString() });

That’s it!

Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration. Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing.

It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years.↩

Regulation Isn’t the European Trap — Resignation Is

2025-10-21 08:00:00

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know.

I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that.

What bothers me isn’t the rules — it’s the posture that develops from it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it).

The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves.

There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome.

Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind.

So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people.

Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short.

Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control:

Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops.
Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches.
Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you.
Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress.

The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one.

That’s the contrast and it’s the part we control.

Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less.

And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance (I am not). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form.

Building an Agent That Leverages Throwaway Code

2025-10-17 08:00:00

In August I wrote about my experiments with replacing MCP (Model Context Protocol) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil. And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all?

I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think.

Pyodide is the Dark Horse

The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with.

Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to.

The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits.

A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there.

File Systems Are King

Another vital ingredient to a code interpreter is having a file system.

Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources.

The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again.

Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use Atomics.wait to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks.

That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync, the same approach does not work for the emscripten JavaScript FS API.

I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this:

// main thread: wrap a worker so fetch() looks synchronous
fetch(url) {
  const signalBuffer = new SharedArrayBuffer(4);
  const signal = new Int32Array(signalBuffer);
  const { port1, port2 } = new MessageChannel();
  this.worker.postMessage({url, signalBuffer, port: port2}, [port2]);

  Atomics.wait(signal, 0, 0);                   // park until worker flips the signal
  const message = receiveMessageOnPort(port1);  // MessageChannel gives the payload
  port1.close();

  if (message.message.status !== "ok") {
    throw new Error(message.message.error.message);
  }
  return message.message.data;
}

// worker thread: perform async fetch, then wake the main thread
parentPort.on("message", async ({ url, signalBuffer, port }) => {
  const signal = new Int32Array(signalBuffer);
  try {
    const bytes = await fetch(url).then(r => {
      if (!r.ok) throw new Error(`HTTP ${r.status}`);
      return r.arrayBuffer();
    });
    port.postMessage({ status: "ok", data: new Uint8Array(bytes) });
    Atomics.store(signal, 0, 1);          // mark success
  } catch (error) {
    port.postMessage({ status: "error", error: serialize(error) });
    Atomics.store(signal, 0, -1);         // mark failure
  } finally {
    Atomics.notify(signal, 0);            // unblock the waiting main thread
    port.close();
  }
});

Durable Execution

Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to.

What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq. (Update: we have since built our own solution to this problem called Absurd)

The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple:

function myAgenticLoop(taskID, initialState) {
  let stepCount = 0;
  let state = initialState;
  while (stepCount < MAX_STEPS) {
    let cacheKey = `${taskID}:${stepCount}`;
    let cachedState = loadStateFromCache(cacheKey);
    if (cachedState !== null) {
      state = cachedState.state;
    } else {
      state = runAgenticStep(state);
      storeStateInCache(cacheKey, state);
    }
    stepCount++;
    if (reachedEndCondition(state)) {
      break;
    }
  }
  return state;
}

You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system).

What Other Than Code?

What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go.

Some tools that I find interesting:

Describe: a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it.
Help: a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option.

Putting it Together

If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent.

When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun!

What an agent run looks like

Prompt: Figure out the current ip address and make me a picture of it.

Step 1:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    '# Read the current IP address\n' +
    "with open('/network/current-ip', 'r') as f:\n" +
    '    ip_address = f.read().strip()\n' +
    '    \n' +
    'print(f"Current IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Current IP Address: 255.255.255.255',
  stderr: '',
  success: true
}
[Cache] Saved state to: agent-cache/task-0-step-1.json

Step 2:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    'import matplotlib.pyplot as plt\n' +
    'import matplotlib.patches as patches\n' +
    'from matplotlib.patches import FancyBboxPatch\n' +
    'import numpy as np\n' +
    '\n' +
    '# Create figure and axis\n' +
    'fig, ax = plt.subplots(1, 1, figsize=(12, 8))\n' +
    'ax.set_xlim(0, 10)\n' +
    'ax.set_ylim(0, 10)\n' +
    "ax.axis('off')\n" +
    '\n' +
    '# Add background gradient\n' +
    'gradient = np.linspace(0, 1, 256).reshape(1, -1)\n' +
    'gradient = np.vstack([gradient] * 256)\n' +
    "ax.imshow(gradient, extent=[0, 10, 0, 10], aspect='auto', cmap='cool', alpha=0.3)\n" +
    '\n' +
    '# Title\n' +
    "ax.text(5, 9, 'Current IP Address', fontsize=28, weight='bold', \n" +
    "        ha='center', va='top', color='#1a1a1a')\n" +
    '\n' +
    '# IP Address Box\n' +
    'fancy_box = FancyBboxPatch((0.5, 4.5), 9, 2.5, \n' +
    '                           boxstyle="round,pad=0.2", \n' +
    "                           edgecolor='#2196F3', facecolor='#E3F2FD',\n" +
    '                           linewidth=3)\n' +
    'ax.add_patch(fancy_box)\n' +
    '\n' +
    '# IP Address Text\n' +
    "ax.text(5, 5.75, ip_address, fontsize=48, weight='bold', \n" +
    "        ha='center', va='center', color='#1976D2', family='monospace')\n" +
    '\n' +
    '# Decorative elements - network nodes\n' +
    'circle_positions = [(2, 2), (8, 2), (2, 8), (8, 8), (5, 1), (5, 9)]\n' +
    'for pos in circle_positions:\n' +
    "    circle = patches.Circle(pos, 0.3, color='#FF5722', alpha=0.7)\n" +
    '    ax.add_patch(circle)\n' +
    '\n' +
    '# Connection lines\n' +
    'for i in range(len(circle_positions)):\n' +
    '    for j in range(i+1, len(circle_positions)):\n' +
    '        x = [circle_positions[i][0], circle_positions[j][0]]\n' +
    '        y = [circle_positions[i][1], circle_positions[j][1]]\n' +
    "        ax.plot(x, y, color='#2196F3', alpha=0.2, linewidth=1)\n" +
    '\n' +
    '# Info text\n' +
    "ax.text(5, 3.8, '🌐 Your Network Location 🌐', fontsize=14, \n" +
    "        ha='center', va='top', color='#555', style='italic')\n" +
    '\n' +
    'plt.tight_layout()\n' +
    "plt.savefig('/output/ip_address.png', dpi=150, bbox_inches='tight', facecolor='white')\n" +
    'print("✅ Picture saved to /output/ip_address.png")\n' +
    'plt.close()\n' +
    '\n' +
    'print(f"\\n📍 Your IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Loading Pillow, contourpy, cycler, fonttools, kiwisolver, matplotlib, numpy, packaging, pyparsing, python-dateutil, pytz, sixLoaded Pillow, contourpy, cycler, fonttools, kiwisolver, matplotlib, numpy, packaging, pyparsing, python-dateutil, pytz, six',
  stderr: 'Traceback (most recent call last):\n' +
    '  File "/lib/python313.zip/_pyodide/_base.py", line 597, in eval_code_async\n' +
    '    await CodeRunner(\n' +
    '    ...<9 lines>...\n' +
    '    .run_async(globals, locals)\n' +
    '  File "/lib/python313.zip/_pyodide/_base.py", line 411, in run_async\n' +
    '    coroutine = eval(self.code, globals, locals)\n' +
    '  File "<exec>", line 8, in <module>\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 1613, in subplots\n' +
    '    fig = figure(**fig_kw)\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 934, in figure\n' +
    '    manager = new_figure_manager(\n' +
    '        num, figsize=figsize, dpi=dpi,\n' +
    '        facecolor=facecolor, edgecolor=edgecolor, frameon=frameon,\n' +
    '        FigureClass=FigureClass, **kwargs)\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 464, in new_figure_manager\n' +
    '    _warn_if_gui_out_of_main_thread()\n' +
    '    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 441, in _warn_if_gui_out_of_main_thread\n' +
    '    canvas_class = cast(type[FigureCanvasBase], _get_backend_mod().FigureCanvas)\n' +
    '                                                ~~~~~~~~~~~~~~~~^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 280, in _get_backend_mod\n' +
    '    switch_backend(rcParams._get("backend"))  # type: ignore[attr-defined]\n' +
    '    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/pyplot.py", line 342, in switch_backend\n' +
    '    module = importlib.import_module(cbook._backend_module_name(newbackend))\n' +
    '  File "/lib/python313.zip/importlib/__init__.py", line 88, in import_module\n' +
    '    return _bootstrap._gcd_import(name[level:], package, level)\n' +
    '           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n' +
    '  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import\n' +
    '  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load\n' +
    '  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked\n' +
    '  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked\n' +
    '  File "<frozen importlib._bootstrap_external>", line 1026, in exec_module\n' +
    '  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed\n' +
    '  File "/lib/python3.13/site-packages/matplotlib/backends/backend_webagg.py", line 23, in <module>\n' +
    '    from js import document\n' +
    "ImportError: cannot import name 'document' from 'js' (unknown location)\n",
  success: false
}
[Cache] Saved state to: agent-cache/task-0-step-2.json

Step 3:
Stop reason: tool_use
Tool call: execute_python {
  code: '\n' +
    'from PIL import Image, ImageDraw, ImageFont\n' +
    'import os\n' +
    '\n' +
    '# Read the current IP address\n' +
    "with open('/network/current-ip', 'r') as f:\n" +
    '    ip_address = f.read().strip()\n' +
    '\n' +
    '# Create a new image with a nice gradient-like background\n' +
    'width, height = 1200, 800\n' +
    "image = Image.new('RGB', (width, height), color='white')\n" +
    'draw = ImageDraw.Draw(image)\n' +
    '\n' +
    '# Draw background with gradient effect using rectangles\n' +
    'for i in range(height):\n' +
    '    ratio = i / height\n' +
    '    r = int(30 + (100 * ratio))\n' +
    '    g = int(100 + (150 * ratio))\n' +
    '    b = int(200 + (55 * ratio))\n' +
    '    draw.rectangle([(0, i), (width, i+1)], fill=(r, g, b))\n' +
    '\n' +
    '# Draw main box\n' +
    'box_margin = 100\n' +
    'box_top = 250\n' +
    'box_height = 300\n' +
    'draw.rectangle(\n' +
    '    [(box_margin, box_top), (width - box_margin, box_top + box_height)],\n' +
    '    fill=(240, 248, 255),\n' +
    '    outline=(33, 150, 243),\n' +
    '    width=5\n' +
    ')\n' +
    '\n' +
    '# Draw rounded corners effect\n' +
    'for i in range(20):\n' +
    '    draw.arc([(box_margin, box_top), (box_margin + 40, box_top + 40)], \n' +
    '             0, 90, fill=(33, 150, 243), width=5)\n' +
    '\n' +
    '# Add decorative network nodes\n' +
    'node_radius = 15\n' +
    'nodes = [(150, 150), (1050, 150), (150, 650), (1050, 650), (600, 100), (600, 700)]\n' +
    'for node in nodes:\n' +
    '    draw.ellipse(\n' +
    '        [(node[0] - node_radius, node[1] - node_radius),\n' +
    '         (node[0] + node_radius, node[1] + node_radius)],\n' +
    '        fill=(255, 87, 34),\n' +
    '        outline=(200, 50, 0),\n' +
    '        width=2\n' +
    '    )\n' +
    '\n' +
    '# Draw connection lines\n' +
    'for i, node1 in enumerate(nodes):\n' +
    '    for node2 in nodes[i+1:]:\n' +
    '        draw.line([node1, node2], fill=(33, 150, 243, 50), width=1)\n' +
    '\n' +
    '# Try to use a large default font\n' +
    'try:\n' +
    '    title_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 60)\n' +
    '    ip_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuMono-Bold.ttf", 90)\n' +
    '    subtitle_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 30)\n' +
    'except:\n' +
    '    title_font = ImageFont.load_default()\n' +
    '    ip_font = ImageFont.load_default()\n' +
    '    subtitle_font = ImageFont.load_default()\n' +
    '\n' +
    '# Draw title\n' +
    'draw.text((width // 2, 80), "Current IP Address", font=title_font, \n' +
    '          fill=(255, 255, 255), anchor="mm")\n' +
    '\n' +
    '# Draw IP address\n' +
    'draw.text((width // 2, box_top + box_height // 2), ip_address, font=ip_font, \n' +
    '          fill=(25, 118, 210), anchor="mm")\n' +
    '\n' +
    '# Draw subtitle\n' +
    'draw.text((width // 2, box_top + box_height + 50), "🌐 Your Network Location 🌐", \n' +
    '          font=subtitle_font, fill=(100, 100, 100), anchor="mm")\n' +
    '\n' +
    '# Save the image\n' +
    "output_path = '/output/ip_address.png'\n" +
    'image.save(output_path)\n' +
    'print(f"✅ Picture saved to {output_path}")\n' +
    'print(f"\\n📍 Your IP Address: {ip_address}")\n'
}
Tool result: {
  stdout: 'Pillow already loaded from default channelNo new packages to load✅ Picture saved to /output/ip_address.png📍 Your IP Address: 255.255.255.255',
  stderr: '',
  success: true
}
[Cache] Saved state to: agent-cache/task-0-step-3.json

Step 4:
Stop reason: end_turn

Final result: Perfect! 🎉 I've successfully:

1. **Found your IP address**: **255.255.255.255**
2. **Created a picture** showing your IP address with:
   - A nice gradient blue background
   - Your IP address prominently displayed in the center
   - Decorative network nodes and connection lines
   - Professional styling and formatting

The image has been saved to `/output/ip_address.png` and is ready for you to download!
[Cache] Saved state to: agent-cache/task-0-step-4.json

Total steps: 4

Making 1 file(s) available in ./output:
  ✓ ip_address.png

4he same approach has also been leveraged by Anthropic and Cloudflare. There is some further reading that might give you more ideas:

Claude Skills is fully leveraging code generation for working with documents or other interesting things. Comes with a (non Open Source) repository of example skills that the LLM and code executor can use: anthropics/skills
Cloudflare’s Code Mode which is the idea of creating TypeScript bindings for MCP tools and having the agent write code to use them in a sandbox.

Armin RonacherModify