MoreRSS

site iconArmin RonacherModify

I'm currently located in Austria and working as a Director of Engineering for Sentry. Aside from that I do open source development.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Armin Ronacher

Tools: Code Is All You Need

2025-07-03 08:00:00

If you've been following me on Twitter, you know I'm not a big fan of MCP (Model Context Protocol) right now. It's not that I dislike the idea; I just haven't found it to work as advertised. In my view, MCP suffers from two major flaws:

  1. It isn’t truly composable. Most composition happens through inference.
  2. It demands too much context. You must supply significant upfront input, and every tool invocation consumes even more context than simply writing and running code.

A quick experiment makes this clear: try completing a GitHub task with the GitHub MCP, then repeat it with the gh CLI tool. You'll almost certainly find the latter uses context far more efficiently and you get to your intended results quicker.

But MCP is the Future!

I want to address some of the feedback I've received on my stance on this. I evaluated MCP extensively in the context of agentic coding, where its limitations were easiest to observe. One piece of feedback is that MCP might not make a ton of sense for general code generation, because models are already very good at that but they make a lot of sense for end-user applications, like, say, automating a domain-specific task in a financial company. Another one is that I need to look at the world of the future, where models will be able to reach many more tools and handle much more complex tasks.

My current take is that my data indicates that current MCP will always be harder to use than writing code, primarily due to the reliance on inference. If you look at the approaches today for pushing towards higher tool counts, the proposals all include a layer of filtering. You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn't been much better approaches proposed.

The main reason I believe this will most likely also hold true — that you shouldn't be using MCP in its current form even for non-programming, domain-specific tasks — is that even in those cases code generation just is the better choice because of the ability to compose.

Replace Yourself With A Shellscript

The way to think about this problem is that when you don't have an AI, and you're solving a problem as a software engineer, your tool of choice is code. Perhaps as a non-software engineer, code is out of reach. Many many tasks people do by hand are actually automatable through software. The challenge is finding someone to write that software. If you're working in a niche environment and you're not a programmer yourself, you might not pick up a programming book to learn how to code, and you might not find a developer willing to provide you with a custom piece of software to solve your specific problem. And yes, maybe your task requires some inference, but many do need them all the time.

There is a reason we say “to replace oneself with a shell script”, it's because that has been happening for a long time. With LLMs and programming, the idea is that rather than replacing yourself with a shell script, you're replacing yourself with an LLM. But you run into three problems: cost, speed, and general reliability. All these problems are what we need to deal with before we can even think of tool usage or MCP. We need to figure out how to ensure that our automated task actually works correctly at scale.

Automation at Scale

The key to automation is really to automate things that will happen over and over. You're not going to automate a one-shot change that will never recur. You're going to start automating the things where the machine can truly give you a productivity boost because you're going to do it once or twice, figure out how to make it work, and then have the machine repeat it a thousand times. For that repetition, there's a very strong argument to be made for always using code. That's because if we instruct the machine to use inference to do it, it might work, particularly for small tasks, but it requires validation which can take almost the same time as doing it in the first place. Getting an LLM to calculate for you sort of works, but it's much better for the LLM to write the Python code to do the calculation. Why? First, you can review the formula, not the calculated result. We can write it ourselves or we can use the LLM as a judge to figure out if the approach is correct. Don't really have to validate that Python calculates correct, you can rely on that. So, by opting for code generation for task solving, we get a little closer to being able to verify and validate the process ourselves, rather than hoping the LLM inferred correctly.

This obviously goes way beyond calculation. Take, for instance, this blog. I converted this entire blog from reStructuredText to Markdown recently. I put this conversion off for a really long time, partly because I was a little too lazy. But also, when I was lazy enough to consider deploying an LLM for it, I just didn't trust it to do the conversion itself without regressing somewhere. I was worried that if it ran out of context, it might start hallucinating text or change wording slightly. It's just that I worried about subtle regressions too much.

I still used an LLM for it, but I asked it to do that transformation in a different way: through code.

LLM to Code to LLM

  1. I asked the LLM to perform the core transformation from reStructuredText to Markdown but I also asked it to do this in a way that uses the underlying AST (Abstract Syntax Tree). So, I instructed it to parse the reStructuredText into an actual reStructuredText AST, then convert that to a Markdown AST, and finally render it to HTML, just like it did before. This gave me an intermediate transformation step and a comparable end result.

  2. Then, I asked it to write a script that compares the old HTML with the new HTML, performs the diffing after some basic cleanup it deemed necessary for comparison. I asked it to consider what kind of conversion errors were actually acceptable. So, it read through its own scripts to see where it might not match the original output due to known technical limitations (e.g., footnotes render differently between the Markdown library I'm using and the reStructuredText library, so even if the syntax matches correctly, the HTML would look different). I asked it to compensate for this in that script.

  3. After that was done, I asked it to create a third script, which I could run over the output of hundreds of files to analyze the differece to go back into the agentic loop for another iteration tep.

Then I kicked this off in a loop. I did not provide all the posts, I started with 10 until differences were low and then had it do it for all. It did this for maybe 30 minutes or so until I came back to it and found it in a pretty acceptable state.

What's key about this transformation is not so much that the LLM was capable of pulling it off, but that I actually trusted this process at the end because I could review the approach. Not only that, I also tried to ask another LLM what it thinks of the code that another LLM wrote, and the changes. It gave me much higher confidence that what was going on would not lose data. It felt right to me. It felt like a mechanical process that was fundamentally correct, and I was able to observe it and do spot checks. At worst, the regressions were minor Markdown syntax errors, but the text itself wouldn't have been corrupted.

Another key here is also that because the inference is rather constant, the cost of inference in this process scales with the number of iteration steps and the sample size, but it doesn't depend on how many documents I'm wanting to convert overall. Eventually, I just had it run over all documents all the time but running it over 15 docs vs 150 docs is more or less the same effort, because the final LLM based analysis step did not have that many more things to review (it already skipped over all minor differences in the files).

MCP Cannot Do That

This is a long-winded way of saying that this entire transformation went through code. It's a pipeline that starts with human input, produces code, does an LLM as a judge step and iterates. And you can take this transformation and apply it to a general task as well.

To give an example, one MCP you might be using is Playwright. I find it very hard to replace Playwright with a code approach for all cases because what you're essentially doing is remotely controlling your browser. The task you're giving it largely involves reading the page, understanding what's on it, and clicking the next button. That's the kind of scenario where it's very hard to eliminate inference at each step.

However, if you already know what the page is — for instance, if you're navigating your own app you're working on — then you can actually start telling it to write a Playwright Python script instead and run that. This script can perform many of those steps sequentially without any inference. I've noticed that this approach is significantly quicker, and because it understands your code, it still generally produces correct results. It doesn't need to navigate, read page contents, find a button, or press an input in real-time. Instead, it will write a single Python script that automates the entire process in one go, requiring very little context by comparison.

This process is repeatable. Once the script is written, I can execute it 100, 200, or even 300 times without requiring any further inference. This is a significant advantage that an MCP typically cannot offer. It's incredibly challenging to get an LLM to understand generic, abstract MCP tool calls. I wish I could, for example, embed an MCP client directly into a shell script, allowing me to run remote MCP services efficiently via code generation, but actually doing that is incredibly hard because the tools are not written with non inference based automation in mind.

Also, as ironic as it is: I'm a human, not an MCP client. I can run and debug a script, I cannot even figure out how to reliably do MCP calls. It's always a gamble and incredibly hard to debug. I love using the little tools that Claude Code generates while generating code. Some of those I had it convert into long term additions to my development process.

Where does this take us?

I don't know. But it's an interesting moment to think what we could potentially do to make code generation for purposeful agentic coding better. The weird thing is that MCP is actually pretty great when it works. But it feels in the current form too much like a dead end that cannot be scaled up, particularly to automation at scale because it relies on inference too much.

So maybe we need to look at ways to find a better abstraction for what MCP is great at, and code generation. For that that we might need to build better sandboxes and maybe start looking at how we can expose APIs in ways that allow an agent to do some sort of fan out / fan in for inference. Effectively we want to do as much in generated code as we can, but then use the magic of LLMs after bulk code execution to judge what we did.

I can also imagine that it might be quite interesting to do code generation in a way that also provides enough context for an LLM to explain in human language to a non programmer what the script is doing. That might enable these flows to be used by human users that are not developers themselves.

In any case I can only encourage people to bypass MCP and to explore what else is possible. LLMs can do so much more if you give them the power to write code.

Further Reading

Here are some more posts you might want to read or videos you might want to watch:

  • My Agentic Coding Talk where I go into this topic a bit.
  • Drew Breunig's post “How to fix your context” which covers some attempts to improve MCP tool selection if you cannot avoid it.
  • Manuel Odendahl's excellent “MCPs are Boring” talk from AI Engineer that was one of the first to point to the challenges with MCP.

My First Open Source AI Generated Library

2025-06-21 08:00:00

I'm currently evaluating how different models perform when generating XML versus JSON. Not entirely unexpectedly, XML is doing quite well — except for one issue: the models frequently return invalid XML. That made it difficult to properly assess the quality of the content itself, independent of how well the models serialize data. So I needed a sloppy XML parser.

After a few failed attempts of getting Claude to just fix up my XML parsing in different ways (it tried html5lib, the html lxml parser etc.) which all resulted in various kinds of amusing failures, I asked Claude to ultrathink and write me a proper XML library from scratch. I gave it some basic instructions of what this should look like and it one-shotted something. Afterwards I prompted it ~20 more times to do various smaller fixes as a response to me reviewing it (briefly) and using it and to create an extensive test suite.

While that was taking place I had 4o create a logo. After that I quickly converted it into an SVG with Illustrator and had Claude make it theme-aware for dark and light modes, which it did perfectly.

On top of that, Claude fully set up CI and even remotely controlled my browser to configure the trusted PyPI publisher for the package for me.

In summary, here is what AI did:

  • It wrote ~1100 lines of code for the parser

  • It wrote ~1000 lines of tests

  • It configured the entire Python package, CI, PyPI publishing

  • Generated a README, drafted a changelog, designed a logo, made it theme-aware

  • Did multiple refactorings to make me happier

The initial prompt that started it all (including typos):

I want you to implement a single-file library here that parses XML sloppily. It should implement two functions:

  • stream_parse which yields a stream of events (use named tuples) for the XML stream
  • tree_parse which takes the output of stream_parse and assembles an element tree. It should default to xml.etree.ElementTree and optoinally allow you to provide lxml too (or anything else)

It should be fast but use pre-compiled regular expressions to parse the stream. The idea here is that the output comes from systems that just pretend to speak XML but are not sufficiently protected against bad outoput (eg: LLMs)

So for instance & should turn into & but if &x is used (which is invalid) it should just retain as &x in the output. Additionally if something is an invalid CDATA section we just gracefully ignore it. If tags are incorrectly closed within a larger tag we recover the structure. For instance <foo><p>a<p>b</foo> will just close the internal structures when </foo> comes around.

Use ultrathink. Break down the implementation into

  1. planning
  2. api stubs
  3. implementation Use sub tasks and sub agents to conserve context

Now if you look at that library, you might not find it amazingly beautiful. It probably is a bit messy and might have a handful of bugs I haven't found yet. It however works well enough for me right now for what I'm doing and it definitely unblocked me. In total it worked for about 30-45 minutes to write the initial implementation while I was doing something else. I kept prompting it for a little while to make some progress as I ran into issues using it.

If you want to look at what it looks like:

To be clear: this isn't an endorsement of using models for serious Open Source libraries. This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.

Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.


Postscriptum: Yes, I did slap an Apache 2 license on it. Is that even valid when there's barely a human in the loop? A fascinating question but not one I'm not eager to figure out myself. It is however certainly something we'll all have to confront sooner or later.

We Can Just Measure Things

2025-06-17 08:00:00

This week I spent time with friends to letting agents go wild and see what we could build in 24 hours. I took some notes for myself to reflect on that experience. I won't bore you with another vibecoding post, but you can read Peter's post about how that went.

As fun as it was, it also was frustrating in other ways and in entire predictable ways. It became a meme about how much I hated working with Xcode for this project. This got me thinking quite a bit more that this has been an entirely unacceptable experience for a long time, but with programming agents, the pain becomes measurable.

When I first dove into programming I found the idea of RTFM quite hilarious. “Why are you asking dumb questions, just read it up.” The unfortunate reality is that the manual often doesn't exist — or is wrong. In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong, for programs it can be impossible to navigate UI or lack of error messages. There are many different ways in which we humans get stuck.

What agents change about this is, is that I can subject them to something I wouldn't really want to subject other developers to: measuring. I picked the language for my current project by running basic evals and it worked well. I learned from that, that there are objectively better and worse language when it comes to my particular problem. The choice however is not just how much the AI knows about the language from the corpus of examples during training. It's also tooling, the inherent capabilities of the language, ecosystem churn and other aspects.

Using agents to measure code quality is great because agents don't judge me, but they do judge the code they are writing. Not all agents will swear, but they will express frustration with libraries when loops don't go well or give up. That opens up an opportunity to bring some measurements into not agent performance, but the health of a project.

We should pay more attention to how healthy engineering teams are, and that starts with the code base. Using agents we can put some numbers to it in which we cannot do with humans (or in a very slow and expensive way). We can figure out how successful agents are in using the things are are creating in rather objective ways which is in many ways a proxy for how humans experience working with the code. Getting together with fresh souls to walk them through a tutorial or some tasks is laborious and expensive. Getting agents that have never seen a codebase start using a library is repeatable, rather cheap, fast and if set up the right way very objective. It also takes the emotion out of it or running the experiment multiple times.

Now obviously we can have debates over if the type of code we would write with an agent is objectively beautiful or if the way agents execute tools creates the right type of tools. This is a debate worth having. Right at this very moment though what programming agents need to be successful is rather well aligned with what humans need.

So what works better than other things? For now these are basic indicators, for agents and humans alike:

  • Good test coverage: they help with future code writing but they also greatly help preventing regressions. Hopefully no surprise to anyone. I would add though that this is not just for the tests, but also for examples and small tools that a user and agent can run to validate behavior manually.

  • Good error reporting: a compiler, tool or an API that does not provide good error reporting is a bad tool. I have been harping on this for years when working at Sentry, but with agents it becomes even clearer that this investment pays off. It also means errors should be where they can be found. If errors are hidden in an obscure log neither human nor agent will find it.

  • High ecosystem stability: if your ecosystem churns a lot, if APIs keep changing you will not just upset humans, you will also slow down the agent. It will find outdated docs, examples and patterns and it will slow down / write bad code.

  • Few superfluous abstractions: too many layers just make data flow and refactoring expensive. We might even want to start questioning the value proposition of (most) ORMs today because of how much harder they make things.

  • Everything needs to be fast and user friendly: The quicker tools respond (and the less useless output they produce) the better. Crashes are tolerable; hangs are problematic. uv for instance is a much better experience in Python than any of the rest of the ecosystem, even though most of the ecosystem points at pip. Agents are super happy to use and keep using uv because they get good infos out of it, and low failure rates.

  • A good dev environment: If stuff only reproduces in CI you have to move your agent into CI. That's not a good experience. Give your agent a way to run Docker locally. If you write a backend, make sure there is a database to access and introspect, don't just mock it out (badly). Deferring things into a CI flow is not an option. It's also important that it's clear when the devenv is broken vs the code is broken. For both human and agent it can be hard to distinguish this if the tooling is not set up correctly.

When an agent struggles, so does a human. There is a lot of code and tooling out there which is objectively not good, but because of one reason or another became dominant. If you want to start paying attention to technology choices or you want to start writing your own libraries, now you can use agents to evaluate the developer experience.

Because so can your users. I can confidently say it's not just me that does not like Xcode, my agent also expresses frustration — measurably so.

Agentic Coding Recommendations

2025-06-12 08:00:00

There is currently an explosion of people sharing their experiences with agentic coding. After my last two posts on the topic, I received quite a few questions about my own practices. So, here goes nothing.

Preface

For all intents and purposes, here's what I do: I predominently use Claude Code with the cheaper Max subscription for $100 a month 1. That works well for several reasons:

  • I exclusively use the cheaper Sonnet model. It's perfectly adequate for my needs, and in fact, I prefer its outputs over the more expensive Opus model.

  • I optimize my tool usage to be token efficient. I avoid screenshots and browser interactions wherever possible. More on that later.

My general workflow involves assigning a job to an agent (which effectively has full permissions) and then waiting for it to complete the task. I rarely interrupt it, unless it's a small task. Consequently, the role of the IDE — and the role of AI in the IDE — is greatly diminished; I mostly use it for final edits. This approach has even revived my usage of Vim, which lacks AI integration.

One caveat: I expect this blog post to age very poorly. The pace of innovation here is insane; what was true a month ago barely holds true today. That's why I'm sticking to concepts I believe have staying power.

If you want to a small session of me working on an Open Source library with it, I have a recording you can watch.

The Basics

I disable all permission checks. Which basically means I run claude --dangerously-skip-permissions. More specifically I have an alias called claude-yolo set up. Now you can call that irresponsible and there are definitely risks with it, but you can manage those risks with moving your dev env into docker. I will however say that if you can watch it do its thing a bit, it even works surprisingly well without dockerizing. YMMV.

MCP. This is a term you cannot avoid. It basically is a standardized protocol to give agents access to more tools. Honestly: at this point I barely use it, but I do use it. The reason I barely use it is because Claude Code is very capable of just running regular tools. So MCP for me is really only needed if I need to give Claude access to something that finds too hard to use otherwise. A good example for this is the playwright-mcp for browser automation. I use it because I haven't found anything better yet. But for instance when I want my agent to poke around in my database, I just uses whatever it finds to be available. In my case it loves to use psql and that's more than good enough.

In general I really only start using MCP if the alternative is too unreliable. That's because MCP servers themselves are sometimes not super reliable and they are an extra thing that can go wrong. Trying to keep things very simple. My custom tools are normal scripts that it just runs.

Choice Of Language

I've evaluated agent performance across different languages my workload, and if you can choose your language, I strongly recommend Go for new backend projects. Several factors strongly favor Go:

  • Context system: Go provides a capable copy-on-write data bag that explicitly flows through the code execution path, similar to contextvars in Python or .NET's execution context. Its explicit nature greatly simplifies things for AI agents. If the agent needs to pass stuff to any call site, it knows how to do it.

  • Test caching: Surprisingly crucial for efficient agentic loops. In Rust, agents sometimes fail because they misunderstand cargo test's invocation syntax. In Go, tests run straightforwardly and incrementally, significantly enhancing the agentic workflow. It does not need to figure out which tests to run, go does.

  • Go is sloppy: Rob Pike famously described Go as suitable for developers who aren't equipped to handle a complex language. Substitute “developers” with “agents,” and it perfectly captures why Go's simplicity benefits agentic coding.

  • Structural interfaces: interfaces in Go are structural. If a type has the methods an interface expects, then it conforms. This is incredibly easy for LLMs to “understand”. There is very little surprise for the agent.

  • Go has low eco-system churn: Go's entire ecosystem embraces backwards compatiblity and explicit version moves. This greatly reduces the likelihood of AI generating outdated code — starkly contrasting JavaScript's fast-moving ecosystem for instance.

For comparison, Python — my initial choice — often poses significant challenges. Agents struggle with Python's magic (eg: Pytest’s fixture injection) or complex runtime challenges (eg: wrong event loop when working with async), frequently producing incorrect code that even the agentic loop has challenges resolving. Python also has practical performance problems. I don't mean that it writes slow code, i mean that the agent loop is really slow. That's because the agent loves to spawn processes and test scripts, and it can take quite a while for the interpreter to boot up and initialize the entire application.

On the frontend I settled on tailwind, react with tanstack's query and router as well as vite. I'm not amazingly happy with it, but I found it better than the alternatives. Tailwind and vite are great, no complaints there. Tanstack's file based router does not make me super happy. In parts because it likes to have dollar signs in the file names and those really like to confuse the agent. For instance it's quite common that it tries to edit $param.tsx but it edits the file .tsx instead because it to confused by shell interpolation. It's a minor thing, but a very annoying thing.

Tools, Tools, Tools

Regardless of language, effective tooling is paramount. Key rules:

  • Anything can be a tool. A shell script can be a tool, an MCP server can be a tool, a log file can be a tool. If the agent can interact or observe it, it counts.

  • Tools need to be fast. The quicker they respond (and the less useless output they produce) the better. Crashes are tolerable; hangs are problematic.

  • Tools need to be user friendly! Tools must clearly inform agents of misuse or errors to ensure forward progress.

  • Tools need to be protected against an LLM chaos monkey using them completely wrong. There is no such things as user error or undefined behavior!

  • Tools need to provide the right debuggability and observability.

Example: I place critical tools into a Makefile. A make dev command starts my services via a process manager. I made two key changes to what I normally do:

  1. I had to protect the process manager against being spawend a second time when it's already running. I use a fork of shoreman which writes a pidfile. When spanwed a second time it errors and logs “services already running”. Why does that matter? Because the agent sometimes does not know if the server is already running and otherwise happily spawns it a second time resulting in two version of the service bound to the same port.

  2. I am not only printing the output to my terminal, I'm also always logging it to a file. This allows the agent to read the log file to diagnose what's happening.

Here an example agentic interaction that demonstrates this rather well:

⏺ Great! The formatting looks good. Now let me run the development server to test the
   dynamic onboarding:

⏺ Bash(make dev)
  ⎿  Error: error: services are already running
     make: *** [dev] Error 1

⏺ The development server is already running. Let me check the logs to see the frontend URL:

⏺ Bash(make tail-log)
  ⎿ 01:06:55 backend    | Starting server on localhost:8080
    01:07:39 backend    | [Running: sh -c 'cd backend && go run cmd/server/main.go serve']
    01:07:39 backend    | Starting server on localhost:8080
    08:45:34 backend    | "GET http://localhost:8080/api/users/me HTTP/1.1" from 127.0.0.1:61804 - 200
    39.252666ms
    … +96 lines (ctrl+r to expand)

⏺ Perfect! I can see the frontend is running on http://localhost:3000/ and the latest changes
   have been hot-reloaded. Let me test the dynamic onboarding functionality using playwright:

In general logging is super important. For instance my app currently has a sign in and register flow that sends an email to the user. In debug mode (which the agent runs in), the email is just logged to stdout. This is crucial! It allows the agent to complete a full sign-in with a remote controlled browser without extra assistence. It knows that emails are being logged thanks to a CLAUDE.md instruction and it automatically consults the log for the necessary link to click.

Does it count as a tool? In my book it does.

It's All About Speed

Agentic coding's inefficiency largely arises from inference cost and suboptimal tool usage. Let me reiterate: quick, clear tool responses are vital. What we did not talk about yet is that some tools are "emergent," temporarily written by agents themselves. Quick compilation and execution significantly boost productivity of the agent. So how can we help it?

With the right instructions it must be possible for the AI to create a new tool, by following existing conventions very quickly. This is necessary because you want the AI to write some new code, and run it. There is a big difference in the quality and speed of the flow, if that tool takes 3ms to run vs it compiles for 5 seconds and then needs another minute to boot and connect to database and kafka broker and 100 lines of nonsensical log output.

If your stuff is indeed slow, then consider vibe-coding a daemon that you can dynamically load stuff into. As an example Sentry takes too long to reload code and it takes too long to restart. To trial some agentic coding there my workaround was a module that watches a file system location and just imports and executes all python modules placed there, then writes the outputs into a log it can cat. That's not perfect, but it was a significant help for the agent to evaluate some basic code in the context of the application.

Balancing log verbosity is crucial: informative yet concise logs optimize token usage and inference speed, avoiding unnecessary costs and rate limits. If you cannot find the balance, provide some easy to turn knobs for the AI to control.

In an idea setup you get useful log output as a natural byproduct of the agent writing code. Getting observability from the first shot of code generation beats writing code, failing to run it and only then going back to a debug loop where debug information is added.

Stability and Copy/Paste

Stable ecosystems are what you really want. LLMs are great with Go and they love to use Flask, because those are quite stable ecosystems with little churn. The same thing is true for your codebase. The AI likes to leave all kinds of breadcrumbs lying around when writing code that can turn into confusion later. For instance I have seen the agents leave useful comments about why it chose one path over another. If you nilly-willy let the AI upgrade libraries where some of those decisions no longer make sense, you now might have the AI continue making a now outdated pattern.

In theory this should be the same for agents and humans, but the reality is that agents make upgrades so “cheap” that it's tempting to just let the AI do it and see if tests still pass. I do not find this to be a successful path at all. Be even more conservative about upgrades than before.

Likewise with AI I strongly prefer more code generation over using more dependencies. I wrote about why you should write your own code before, but the more I work with agentic coding, the more I am convinced of this.

Write Simple Code

Simple code significantly outperforms complex code in agentic contexts. I just recently wrote about ugly code and I think in the context of agents this is worth re-reading. Have the agent do “the dumbest possible thing that will work”.

  • Prefer functions with clear, descriptive and longer than usual function names over classes.

  • Avoid inheritance and overly clever hacks.

  • Use plain SQL. I mean it. You get excellent SQL out of agents and they can match the SQL they write with the SQL logs. That beats them min-maxing your ORM's capabilities and getting lost in the SQL output in a log.

  • Keep important checks local. You really want to make sure that permission checks are very clear to the AI, and that they are taking place where it AI can see it. Hiding permission checks in another file or some config file will amost guarantee you that the AI will forget to add permission checks in when adding new routes.

Make It Parallelizable

Agents aren't exceptionally fast individually, but parallelization boosts overall efficiency. Find a way to manage shared states like the file system, databases, or Redis instances so that you can run more than one. Avoid them, or find a way to quickly segment stuff out.

Your initial shared state is just the file system and a second check-out will do. But really I don't have an amazing solution yet. There are some good initial attempts. For instance one of the tools to watch is container-use. It's an MCP server that instructs Claude or other agents to run their experiments entirely in Docker.

Then there are tools like Cursor's background agents and Codex which are moving this entire stuff into CI which will be interesting. So far, I don't this is working for me yet, but let's see again in a month.

Learn To Refactor

Agentic coding alters refactoring priorities. Agents handle tasks effectively until project complexity surpasses some manageable thresholds. Too big here is defined by the total amount of stuff that it has to consider. So for instance you can vibe code your frontend together for a while, but eventually you reach the point where you absolutely need to tell it to make a component library. Why? Because if the total tailwind class mess is splitered across 50 files you will find it very hard to get the AI to make redesigns or extract components without major regressions.

An agentic workflow encourages good code maintenance and refactoring at the right moment. You don't want to do it too early and you definitely do not want to do it too late.

What Next?

Agentic coding is rapidly evolving, and my workflow today may look dramatically different tomorrow. What's clear though is that integrating agents into your development process can unlock significant productivity gains. I encourage you to keep experimenting. The tools and techniques will evolve, but the core principles — simplicity, stability, observability and smart parallelization — will remain essential.

Ultimately, the goal is not just to leverage agents to write code faster, but to write better, more maintainable, and resilient code. Already today the code looks nothing like the terrible slop from a few months ago. Stay adaptable, and happy coding!

  1. This is not an advertisment for Claude Code. It's just the agent I use at the moment. What else is there? Alternatives that are similar in their user experiences are OpenCode, goose, Codex and many others. There is also Devin and Cursor's background agents but they work a bit different in that they run in the cloud.

GenAI Criticism and Moral Quandaries

2025-06-10 08:00:00

I've received quite a bit of feedback on the last thing I wrote about AI, particularly around the idea that I'm too quick to brush aside criticism. Given that Glyph — who I respect a lot — wrote a lengthy piece on why he's largely opting out of AI with some thoughtfully articulated criticism, I thought is would be a good opportunity to respond.

Focusing on Code

For this discussion, I'm focusing on AI as a tool for generating text and code — not images, video, or music. My perspective is that there’s a clear difference between utilitarian outputs (code, simple text) and creative outputs that are meant to evoke emotion (art, music, well articulated writings, etc.). For example, when I get an email from a real estate broker, I expect clear information, not art. Similarly, when I add something to a virtual shopping cart, I don’t care how artistic the code is that makes it work. In fact, even today without AI, I better not know.

So, like Glyph, I want to focus on code.

Quality of Output and Adoption

If you read my earlier post, you probably picked up that I see a lot of potential in AI. That hasn't always been my stance, and I intend to remain critical, but right now I'm quite positive about its usefulness. That is in a stark contrast to Glyph's experience.

He writes:

My experiences of genAI are all extremely bad, but that is barely even anecdata. Their experiences are neutral-to-positive. Little scientific data exists. How to resolve this?

I can't judge Glyph's experiences, and I don't want to speculate about why they differ from mine. I've certainly had my own frustrations with AI tools.

The difference, I think, is that I've learned over time how to use these tools more effectively, and that's led to better results. For me, it's not just “neutral-to-positive” — it's been astonishingly positive. As I write this, my agent is fixing code in another window for me. I recorded a video of it fixing issues in a library if you want to see what this looks like.

Glyph also argues that adoption is being forced by management and that people would not voluntarily use it:

Despite this plethora of negative experiences, executives are aggressively mandating the use of AI6. It looks like without such mandates, most people will not bother to use such tools, so the executives will need muscular policies to enforce its use.7

This doesn't match what I've seen. In my experience, people are adopting AI on their own, often before their companies are even aware.

Even at Sentry the adoption of AI happened by employees before the company even put money behind it. In fact my memory is that if anything only at the point where an exceeding number of AI invoices showed up from IC expenses did we realize how widespread adoption has been. This was entirely grounds up. For my non techy friends they sometimes need to hide their AI usage from their employers because some companies try to prevent the adoption of AI, but they are paying for it themselves to help them with the work. Some of them pay for the expensive ChatGPT subscription even!

Yes, there are companies like Shopify that put AI on their banners and are mandating this, but there are probably many more companies that leverage AI via a secret grassroots adoption.

Enjoying Programming

Glyph makes the point that LLMs reduce code review to a non enjoyable part. For me code review is a fact of life and part of the job. That's just what we do as programmers. I don't do it because I want the person that wrote the code to grow and become a better programmer, I do it because I want code to be merged. That does not mean I do not care about the career opportunities or skills of the other person, I do! But that's an effort all on its own. Sometimes it takes place in a code review, most of the time however that's happening in a one-on-one setting. The reality is that we're often not in the mindset of wanting personal growth when receiving review comments either.

Now I admit that I do a lot more code review than I do programming at the moment, but I also find it quite enjoyable. On the one hand because the novelty of a machine programming hasn't worn off yet, on the other hand because it's a very patient recipient of feedback and change requests. You just tell it stuff, you don't spend too much time to think about how the other person is going to respond, if it's a good idea to nitpick a small thing and put extra load on them. It's quite freeing really and it does have a different feeling to me than a regular code review.

So is programming still enjoyable if I don't hit the keys? For me, yes. I still write code, just less of it, and it doesn't diminish the satisfaction at all. I'm still in control, and the quality still depends on the effort I put into guiding the tool.

Energy, Climate and Stealing

Glyph doesn't talk too much about the economics and the climate impact, but he does mention it. My stance on this is rather simple: margins will erode, there will be a lot of competition and we all will pay for the inference necessary and someone will make money. Energy usage will go up but we need more energy even without AI as we're electrifying our cars. AI might change this trajectory slightly, but we had a climate problem before all of this and we have give or take the same climate problem until we shift towards more renewable energy. In fact, this new increased energy consumption might actually do us a great service here. Solar is already the cheapest energy solution 1 on the market and if we need more, that's quite likely the source that we will build more of. Particularly now that cost of energy storage is also going down quickly.

As for copyright and “stealing”: I've always felt that copyright terms are too long, scraping is beneficial, and sharing knowledge is a net positive for humanity. That's what drew me to Open Source in the first place. Glyph argues that scrapers are more aggressive now, but I'm not sure if that is actually true. I think there are just more of them. We got so used that it was mostly a handful of search engines scraping lowering the cost of it to all. I tend to think that more competition is good here and we might just have to accept it for a little while.

Educational Impact

I addressed this in my previous article, but I believe LLMs have significant potential to improve learning. Glyph disagrees, partly because of concerns about cheating and that it will make it worse:

LLMs are making academic cheating incredibly rampant. […] For learning, genAI is a forklift at the gym. […] But it was within those inefficiencies and the inconveniences of the academic experience that real learning was, against all odds, still happening in schools.

I disagree strongly here. This is where I have the most first-hand experience, considering time spent with AI. Since the early days of ChatGPT, I've used LLMs extensively for learning. That's because I'm not great at learning from books, and I have found LLMs to make the process much more enjoyable and helpful to me.

To give you some ideas of how useful this can be, here is an excellent prompt that Darin Gordon shared for getting a GPT to act as a teacher of algorithms that uses the Socratic method: socratic_fp_learning.md. It works even super well if you dumb it down. I had this explain to my son how hash tables work and I did a modification to the prompt to help him understand entropy. It's surprisingly effective.

Now, that does not do much about the cheating part. But surely in a situation where students cheat, it wasn't about learning in the first case, it was about passing a test. That has not much to do with learning, but with performance assessment. When you feel the need to cheat, you probably did not learn something properly in the first place. AI might just make these pre-existing problems more visible, and even Glyph acknowledged that.

AI may complicate things for educators in the near team, but it can also offer real improvements. Either way, education needs reform to adapt to present realities.

Fatigue and Surrender

Glyph concludes by sharing that the pace of change is overwhelming him and opting out feels like the only sane response. I understand that. The pace of AI advancement can make anyone feel like they're falling behind and I too feel like that sometimes.

I offer a different view: just assume AI will win out and we will see agents! Then the path that takes us to that future is less relevant. Many of the things that are currently asking for people's attention are going to look different in a few years — or might not even exist any longer. I initially used GitHub Copilot just to move to Cursor, now to mostly move to Claude Code, maybe I will be back with Cursor's background agents in a month. First there was v0, then there was lovable, who knows what there be in a year. But the path for me is pretty clear: it's going towards me working together with the machine. I find that thought very calming and it takes out the stress. Taking a positive view gives you a form of an excited acceptance of the future.

In Closing

I really don't want to dismiss anyone's concerns. I just feel that, for me, the utility of these tools has become obvious enough that I don't feel the need to argue or justify my choices anymore.

AI Changes Everything

2025-06-04 08:00:00

At the moment I'm working on a new project. Even over the last two months, the way I do this has changed profoundly. Where I used to spend most of my time in Cursor, I now mostly use Claude Code, almost entirely hands-off.

Do I program any faster? Not really. But it feels like I've gained 30% more time in my day because the machine is doing the work. I alternate between giving it instructions, reading a book, and reviewing the changes. If you would have told me even just six months ago that I'd prefer being an engineering lead to a virtual programmer intern over hitting the keys myself, I would not have believed it. I can go can make a coffee, and progress still happens. I can be at the playground with my youngest while work continues in the background. Even as I'm writing this blog post, Claude is doing some refactorings.

While all this is happening, I've found myself reflecting a lot on what AI means to the world and I am becoming increasingly optimistic about our future. It's obvious now that we're undergoing a tremendous shift. AI is out of the bottle, and there's no putting it back. Even if we halted all progress today, froze the weights, halted the training, the systems already out there would still reshape how we live, work, learn, and communicate to one another.

What however took longer to accept is just how profound that change really is. As an engineer coming from a world of deterministic things, who deeply values the craft of engineering, to accept the messiness of what agents are doing took a while to digest. It took me a while to even warm up to tool usage by AI in the first place — just two years ago I was convinced AI might kill my wife. In those two years however we've come incredibly far. We have reached the point where even if we stopped here (and there is no indication we will) AI is already a new substrate for a lot of new innovation, ideas and creations and I'm here for it. It has moved beyond being a curious tool.

Never before have I seen a technology surface in every day life so quickly, so widely. Smartphones adoption felt slow in comparison. Today I can't finish a commute or coffee without spotting someone chatting with ChatGPT. I've had conversations with baristas, hairdressers, parents at the playground — people who aren't what I would consider “tech-savvy” — telling me how AI changed the way they write letters, search for recipes, help their kids with homework, or translate documents. The ripple effect is already massive. And still, the majority of the world hasn't touched these tools yet. Entire communities, professions, and economies are yet to begin exploring their transformation.

That's what makes this moment feel so strange — half revolutionary, half prelude. And yet, oddly, there are so many technologists who are holdouts. How could techies reject this change? Thomas Ptacek's piece “My AI Skeptic Friends Are All Nuts” really resonated with me. It takes a humorous stab at the push against AI that is taking place from my very circles. Why is it that so many people I've respected in tech for years — engineers, open source contributors — are the ones most resistant to what's happening? We've built something beyond what we imagined, and instead of being curious, many are dismissive and denying its capabilities. What is that?

Of course the implications are vast and real and the rapid development forces big questions. What does this mean for the education of our children? If AI can teach, explain, and personalize lessons better than a classroom of thirty ever could, what becomes of schools as we know them? And if kids grow up expecting to interact with intelligence — not just absorb content — how do we teach them to reason, create, and collaborate in ways that leverage this power without becoming dependent on it?

On the global stage, there are also ramifications that seem more fundamental than in previous cycles. It does not feel like the rise of search engines or social media, where the rest of the world was satisfied with being a consumer of US infrastructure. This feels more like the invention of the steam engine. Once it existed, there was no future without it. No country could afford to stay on the sidelines. But steam machines also became quickly commoditized and there was plenty of competition of manufacturers. It was just too obvious of a technological leap. With AI, every nation, every corporation will want its own models, its own control, its own stake in the future.

And so, as I alternate between delegating tasks to Claude and reading something thoughtful in between, I can't help but feel excited about being there when we're at the beginning of something irreversible and expansive.

I understand why it's tempting to be cynical or fearful. For sure the job of programmers and artists will change, but they won't vanish. I feel like all my skills that I learned as a programmer are more relevant than ever, just with a new kind of tool. Likewise the abundance of AI generated art also makes me so much more determined that I want to hire an excellent designer as soon as I can. People will always value well crafted products. AI might raise the bar for everybody all at once, but it's the act of careful deliberation and very intentional creation that sets you apart from everybody else.

Sure, I may have good personal reasons to be optimistic. But the more time I spend with these tools, the more I believe that optimism is the more reasonable stance for everyone. AI can dramatically increase human agency when used well. It can help us communicate across cultures. It can democratize access to knowledge. It can accelerate innovation in medicine, science, and engineering.

Right now it's messy and raw, but the path is clear: we are no longer just using machines, we are now working with them. And while it's early, I think we'll look back at this decade the way past generations looked at electricity or the printing press — not as a curiosity, but as the moment when everything changed.

I encourage you not meet that moment with cynicism or fear: meet it with curiosity, responsibility and the conviction that this future will be bright and worth embracing.