2025-07-30 08:00:00
Using Claude Code and other agentic coding tools has become all the rage. Not only is it getting millions of downloads, but these tools are also gaining features that help streamline workflows. As you know, I got very excited about agentic coding in May, and I’ve tried many of the new features that have been added. I’ve spent considerable time exploring everything on my plate.
But oddly enough, very little of what I attempted I ended up sticking with. Most of my attempts didn’t last, and I thought it might be interesting to share what didn’t work. This doesn’t mean these approaches won’t work or are bad ideas; it just means I didn’t manage to make them work. Maybe there’s something to learn from these failures for others.
The best way to think about the approach that I use is:
Non-working automations turn out to be quite common. Either I can’t get myself to use them, I forget about them, or I end up fine-tuning them endlessly. For me, deleting a failed workflow helper is crucial. You don’t want unused Claude commands cluttering your workspace and confusing others.
So I end up doing the simplest thing possible most of the time: just talk to the machine more, give it more context, keep the audio input going, and dump my train of thought into the prompt. And that is 95% of my workflow. The rest might be good use of copy/paste.
Slash commands allow you to preload prompts to have them readily available in a session. I expected these to be more useful than they ended up being. I do use them, but many of the ones that I added I ended up never using.
There are some limitations with slash commands that make them less useful than they could be. One limitation is that there’s only one way to pass arguments, and it’s unstructured. This proves suboptimal in practice for my uses. Another issue I keep running into with Claude Code is that if you do use a slash command, the argument to the slash command for some reason does not support file-based autocomplete.
To make them work better, I often ask Claude to use the current Git state to determine which files to operate on. For instance, I have a command in this blog that fixes grammar mistakes. It operates almost entirely from the current git status context because providing filenames explicitly is tedious without autocomplete.
Here is one of the few slash commands I actually do use:
## Context
- git status: !`git status`
- Explicitly mentioned file to fix: "$ARGUMENTS"
## Your task
Based on the above information, I want you to edit the mentioned file or files
for grammar mistakes. Make a backup (eg: change file.md to file.md.bak) so I
can diff it later. If the backup file already exists, delete it.
If a blog post was explicitly provided, edit that; otherwise, edit the ones
that have pending changes or are untracked.
My workflow now assumes that Claude can determine which files I mean from the Git status virtually every time, making explicit arguments largely unnecessary.
Here are some of the many slash commands that I built at one point but ended up not using:
/fix-bug
: I had a command that instructed Claude to fix bugs by pulling
issues from GitHub and adding extra context. But I saw no meaningful
improvement over simply mentioning the GitHub issue URL and voicing my
thoughts about how to fix it./commit
: I tried getting Claude to write good commit messages, but
they never matched my style. I stopped using this command, though I haven’t
given up on the idea entirely./add-tests
: I really hoped this would work. My idea was to have Claude
skip tests during development, then use an elaborate reusable prompt to
generate them properly at the end. But this approach wasn’t consistently better
than automatic test generation, which I’m still not satisfied with overall./fix-nits
: I had a command to fix linting issues and run formatters.
I stopped using it because it never became muscle memory, and Claude
already knows how to do this. I can just tell it “fix lint” in the
CLAUDE.md file without needing a slash command./next-todo
: I track small items in a to-do.md file
and had a command to pull the next item and work on it. Even
here, workflow automation didn’t help much.
I use this command far less than expected.So if I’m using fewer slash commands, what am I doing instead?
Copy/paste is really, really useful because of how fuzzy LLMs are. For instance, I maintain link collections that I paste in when needed. Sometimes I fetch files proactively, drop them into a git-ignored folder, and mention them. It’s simple, easy, and effective. You still need to be somewhat selective to avoid polluting your context too much, but compared to having it spelunk in the wrong places, more text doesn’t harm as much.
I tried hard to make hooks work, but I haven’t seen any efficiency gains from them yet. I think part of the problem is that I use yolo mode. I wish hooks could actually manipulate what gets executed. The only way to guide Claude today is through denies, which don’t work in yolo mode. For instance, I tried using hooks to make it use uv instead of regular Python, but I was unable to do so. Instead, I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools.
For instance, this is really my hack for making it use uv run python
instead
of python
more reliably:
#!/bin/sh
echo "This project uses uv, please use 'uv run python' instead."
exit 1
I really just have a bunch of these in .claude/interceptors
and preload that
folder onto PATH
before launching Claude:
CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR=1 \
PATH="`pwd`/.claude/interceptors:${PATH}" \
claude --dangerously-skip-permissions
I also found it hard to hook into the right moment. I wish I could run formatters at the end of a long edit session. Currently, you must run formatters after each Edit tool operation, which often forces Claude to re-read files, wasting context. Even with the Edit tool hook, I’m not sure if I’m going to keep using it.
I’m actually really curious whether people manage to get good use out of hooks. I’ve seen some discussions on Twitter that suggest there are some really good ways of making them work, but I just went with much simpler solutions instead.
I was initially very bullish on Claude’s print mode. I tried hard to have Claude generate scripts that used print mode internally. For instance, I had it create a mock data loading script — mostly deterministic code with a small inference component to generate test data using Claude Code.
The challenge is achieving reliability, which hasn’t worked well for me yet. Print mode is slow and difficult to debug. So I use it far less than I’d like, despite loving the concept of mostly deterministic scripts with small inference components. Whether using the Claude SDK or the command-line print flag, I haven’t achieved the results I hoped for.
I’m drawn to Print Mode because inference is too much like a slot machine. Many programming tasks are actually quite rigid and deterministic. We love linters and formatters because they’re unambiguous. Anything we can fully automate, we should. Using an LLM for tasks that don’t require inference is the wrong approach in my book.
That’s what makes print mode appealing. If only it worked better. Use an LLM for the commit message, but regular scripts for the commit and gh pr commands. Make mock data loading 90% deterministic with only 10% inference.
I still use it, but I see more potential than I am currently leveraging.
I use the task tool frequently for basic parallelization and context isolation. Anthropic recently launched an agents feature meant to streamline this process, but I haven’t found it easier to use.
Sub-tasks and sub-agents enable parallelism, but you must be careful. Tasks that don’t parallelize well — especially those mixing reads and writes — create chaos. Outside of investigative tasks, I don’t get good results. While sub-agents should preserve context better, I often get better results by starting new sessions, writing thoughts to Markdown files, or even switching to o3 in the chat interface.
What’s interesting about workflow automation is that without rigorous rules that you consistently follow as a developer, simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts.
For instance, I don’t use emojis or commit prefixes. I don’t enforce templates for pull requests either. As a result, there’s less structure for me to teach the machine.
I also lack the time and motivation to thoroughly evaluate all my created workflows. This prevents me from gaining confidence in their value.
Context engineering and management remain major challenges. Despite my efforts to help agents pull the right data from various files and commands, they don’t yet succeed reliably. They pull in too much or too little. Long sessions lead to forgotten context from the beginning. Whether done manually or with slash commands, the results feel too random. It’s hard enough with ad-hoc approaches, but static prompts and commands make it even harder.
The rule I have now is that if I do want to automate something, I must have done it a few times already, and then I evaluate whether the agent gets any better results through my automation. There’s no exact science to it, but I mostly measure that right now by letting it do the same task three times and looking at the variance manually as measured by: would I accept the result.
Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me.
Because there is a big hidden risk with automation through LLMs: it encourages mental disengagement. When you stop thinking like an engineer, quality drops, time gets wasted and you don’t understand and learn. LLMs are already bad enough as they are, but whenever I lean in on automation I notice that it becomes even easier to disengage. I tend to overestimate the agent’s capabilities with time. There are real dragons there!
You can still review things as they land, but it becomes increasingly harder to do so later. While LLMs are reducing the cost of refactoring, the cost doesn’t drop to zero, and regressions are common.
2025-07-26 08:00:00
Last November I wrote a post about how the programming interface of threads beats the one of async/await. In May, Mark Shannon brought up the idea of virtual threads for Python on Python’s discussion board and also referred back to that article that I wrote. At EuroPython we had a chat about that topic and that reminded me that I just never came around to writing part two of that article.
The first thing to consider is that async/await did actually produce one very good outcome for Python: it has exposed many more people to concurrent programming. By introducing a syntax element into the programming language, the problem of concurrent programming has been exposed to more people. The unfortunate side effect is that it requires a very complex internal machinery that leaks into the programming language to the user and it requires colored functions.
Threads, on the other hand, are in many ways a much simpler concept, but the threading APIs that have proliferated all over the place over the last couple of generations leave a lot to be desired. Without doubt, async/await in many ways improved on that.
One key part of how async/await works in Python is that nothing really happens until you call await. You’re guaranteed not to be suspended. Unfortunately, recent changes with free-threading make that guarantee rather pointless. Because you still need to write code to be aware of other threads, and so now we have the complexity of both the async ecosystem and the threading system at all times.
This is a good moment to rethink if we maybe have a better path in front of us by fully embracing threads.
Another really positive thing that came out of async in Python was that a lot of experimentation was made to improve the ergonomics of those APIs. The most important innovation has been the idea of structured concurrency. Structured concurrency is all about the idea of disallowing one task to outlive its parent. And this is also a really good feature because it allows, for instance, a task to also have a relationship to the parent task, which makes the flow of information (such as context variables) much clearer than traditional threads and thread local variables do, where threads have effectively no real relationships to their parents.
Unfortunately, task groups (the implementation of structured concurrency in Python) are a rather recent addition, and unfortunately its rather strict requirements on cancellation have often not been sufficiently implemented in many libraries. To understand why this matters, you have to understand how structured concurrency works. Basically, when you spawn a task as a child of another task, then any adjacent task that fails will also cause the cancellation of all the other ones. This requires robust cancellations.
And robust cancellations are really hard to do when some of those tasks involve
real threads. For instance, the very popular aiofiles
library uses a thread
pool to move I/O operations into an I/O thread since there is no really good way
on different platforms to get real async I/O behavior out of standard files.
However, cancellations are not supported. That causes a problem: if you spawn
multiple tasks, some of which are blocking on a read (with aiofiles
) that
would only succeed if another one of those tasks concludes, you can actually end
up deadlocking yourself in the light of cancellations. This is not a
hypothetical problem. There are, in fact, quite a few ways to end up in a
situation where the presence of aiofiles
in a task group will cause an
interpreter not to shut down properly. Worse, the exception that was actually
caught by the task group will be invisible until the blocking read on the other
thread pool has been interrupted by a signal like a keyboard interrupt. This is
a pretty disappointing developer experience.
In many ways, what we really want is to go back to the drawing board and say, “What does a world look like that only ever would have used threads with a better API?”
So if we have only threads, then we are back to some performance challenges that motivated asyncio in the first place. The solution for this will involve virtual threads. You can read all about them in the previous post.
One of the key parts of enabling virtual threads is also a commitment to handling many of the challenges with async I/O directly as part of the runtime. That means that if there is a blocking operation, we will have to ensure that the virtual thread is put back to the scheduler, and another one has a chance to run.
But that alone will feel a little bit like a regression because we also want to ensure that we do not lose structured concurrency.
Let’s start simple: what does Python code look like where we download an arbitrary number of URLs sequentially? It probably looks a bit like this:
def download_all(urls):
results = {}
for url in urls:
results[url] = fetch_url(url)
return results
No, this is intentionally not using any async or await, because this is not what we want. We want the most simple thing: blocking APIs.
The general behavior of this is pretty simple: we are going to download a bunch of URLs, but if any one of them fails, we’ll basically abort and raise an exception, and will not continue downloading any other ones. The results that we have collected up to that point are lost.
But how would we stick with this but introduce parallelism? How can we download more than one at a time? If a language were to support structured concurrency and virtual threads, we could achieve something similar with some imaginary syntax like this:
def download_all(urls):
results = {}
await:
for url in urls:
async:
results[url] = fetch_url(url)
return results
I’m intentionally using await and async here, but you can see from the usage that it’s actually inverted compared to what we have today. Here is what this would do:
Behind the scenes, something like this would happen:
from functools import partial
def download_all(urls):
results = {}
with ThreadGroup():
def _thread(url):
results[url] = fetch_url(url)
for url in urls:
ThreadGroup.current.spawn(partial(_thread, url))
return results
Note that all threads here are virtual threads. They behave like threads, but they might be scheduled on different kernel threads. If any one of those spawned threads fails, the thread group itself fails and also prevents further spawn calls from taking place. A spawn on a failed thread group would also no longer be permitted.
In the grand scheme of things, this is actually quite beautiful. Unfortunately, it does not match all that well to Python. This syntax would be unexpected because Python does not really have an existing concept of a hidden function declaration. Python’s scoping also prevents this from working all that well. Because Python doesn’t have syntax for variable declarations, Python actually only has a single scope for functions. This is quite unfortunate because, for instance, it means that a helper declared in a loop body cannot really close over the loop iteration variable.
Regardless, I think the important thing you should take away from this is that this type of programming does not require thinking about futures. Even though it could support futures, you can actually express a whole lot of programming code without needing to defer to an abstraction like that.
As a result, there are much fewer concepts that one has to consider when working with a system like this. I do not have to expose a programmer to the concept of futures or promises, async tasks, or anything like that.
Now, I don’t think that such a particular syntax would fit well into Python. And it is somewhat debatable if automatic thread groups are the right solution. You could also model this after what we have with async/await and make thread groups explicit:
from functools import partial
def download_and_store(results, url):
results[url] = fetch_url(url)
def download_all(urls):
results = {}
with ThreadGroup() as g:
for url in urls:
g.spawn(partial(download_and_store, results, url))
return results
This largely still has the same behavior, but it uses a little bit more explicit operations and it does require you to create more helper functions. But it still fully avoids having to work with promises or futures.
What is so important about this entire concept is that it moves a lot of the
complexity of concurrent programming where it belongs: into the interpreter and
the internal APIs. For instance, the dictionary in results
has to be locked
for this to work. Likewise, the APIs that fetch_url
would use need to support
cancellation and the I/O layer needs to suspend the virtual thread and go back to
the scheduler. But for the majority of programmers, all of this is hidden.
I also think that some of the APIs really aged badly for supporting well-behaved concurrent systems. For instance, I very much prefer Rust’s idea of enclosing values in a mutex over carrying a mutex somewhere on the side.
Also, semaphores are an incredibly potent system to limit concurrency and to create more stable systems. Something like this could also become a part of a thread group, directly limiting how many spawns can happen simultaniously.
from functools import partial
def download_and_store(results_mutex, url):
result = fetch_url(url)
with results_mutex.lock() as results:
results.store(url, result)
def download_all(urls):
results = Mutex(MyResultStore())
with ThreadGroup(max_concurrency=8) as g:
for url in urls:
g.spawn(partial(download_and_store, results, url))
return results
There will be plenty of reasons to use futures and they would continue to hang
around. One way to get a future is to hold on to the return value of the
spawn
method:
def download_and_store(results, url):
results[url] = fetch_url(url)
def download_all(urls):
futures = []
with ThreadGroup() as g:
for url in urls:
futures.append((url, g.spawn(lambda: fetch_url(url))))
return {url: future.result() for (url, future) in futures}
One big question is if spawn should work if there is no thread group. For instance, in Trio, which is a Python async library, the decision was made that you have to always have the equivalent of a thread group — they call it a nursery — to spawn an operation. I think that this is a very sensible policy, but there are situations where you cannot really do that. I can imagine various different alternatives for this, such as having a default thread group hang around for background tasks that is implicitly joined when the process is shutting down. However, I think the most important thing is to bring as much of the intended behavior to the default APIs.
Inside your system, what would be the future of async/await? Well, that is up for debate, but it does seem rather reasonable to find ways to continue asynchronous functionality for already existing code, but I do think it would be entirely unnecessary for code in the future.
I would like you to consider this as a conversation starter about virtual threads and less about a fully fleshed out idea. There are a lot of questions open about this, particularly in the context of Python, but the idea of no longer having to deal with colored functions really appeals to me and I hope we can explore it a bit.
2025-07-20 08:00:00
This post is addressed to the Python community, one I am glad to be a member of.
I’m product of my community. A decade ago I wrote about how much I owed the Python community. Recently I found myself reminiscing again. This year at EuroPython I even gave a brief lightning talk recalling my time in the community — it made me tear up a little.
There were two reasons for this trip down memory lane. First, I had the opportunity to be part of the new Python documentary, which brought back a flood of memories (good and bad). Second, I’ve found myself accidentally pulled towards agentic coding and vibe coders1. Over the last month and a half I have spoken with so many people on AI and programming and realized that a growing number of them are people I might not, in the past, have described as “programmers.” Even on the way to the conference I had the pleasure to engage in a multi-hour discussion on the train with an air traffic controller who ventured into programming because of ChatGPT to make his life easier.
I’m not sure where I first heard it, but I like the idea that you are what you do. If you’re painting (even your very first painting) you are a painter. Consequently if you create a program, by hand or with the aid of an agent, you are a programmer. Many people become programmers essentially overnight by picking up one of these tools.
Heading to EuroPython this year I worried that the community that shaped me might not be receptive to AI and agentic programming. Some of that fear felt warranted: over the last year I saw a number of dismissive posts in my circles about using AI for programming. Yet I have also come to realize that acceptance of AI has shifted significantly. More importantly there is pretty wide support of the notion that newcomers will and should be writing AI-generated code.
That matters, because my view is that AI will not lead to fewer programmers. In fact, the opposite seems likely. AI will bring more people into programming than anything else we have done in the last decade.
For the Python community in particular, this is a moment to reflect. Python has demonstrated its inclusivity repeatedly — think of how many people have become successful software engineers through outreach programs (like PyLadies) and community support. I myself can credit much of my early carreer from learning from others on the Python IRC channels.
We need to pay close attention to vibe coding. And that not because it might produce lower‑quality code, but because if we don’t intentionally welcome the next generation learning through these tools, they will miss out on important lessons many of us learned the hard way. It would be a mistake to treat them as outcasts or “not real” programmers. Remember that many of our first programs did not have functions, were a mess of GOTO and things copy/pasted together.
Every day someone becomes a programmer because they figured out how to make ChatGPT build something. Lucky for us: in many of those cases the AI picks Python. We should treat this as an opportunity and anticipate an expansion in the kinds of people who might want to attend a Python conference. Yet many of these new programmers are not even aware that programming communities and conferences exist. It’s in the Python community’s interest to find ways to pull them in.
Consider this: I can name the person who brought me into Python. But if you were brought in via ChatGPT or a programming agent, there may be no human there — just the AI. That lack of human connection is, I think, the biggest downside. So we will need to compensate: to reach out, to mentor, to create on‑ramps. To instil the idea that you should be looking for a community, because the AI won’t do that. We need to turn a solitary interaction with an AI into a shared journey with a community, and to move them towards learning the important lessons about engineering. We do not want to have a generation of developers held captive by a companies building vibe-coding tools with little incentive for their users to break from those shackles.
I’m using vibe coders here as people that give in to having the machine program for them. I believe that many programmers will start in this way before they transition to more traditional software engineering.↩
2025-07-03 08:00:00
If you’ve been following me on Twitter, you know I’m not a big fan of MCP (Model Context Protocol) right now. It’s not that I dislike the idea; I just haven’t found it to work as advertised. In my view, MCP suffers from two major flaws:
A quick experiment makes this clear: try completing a GitHub task with the
GitHub MCP, then repeat it with the gh
CLI tool. You’ll almost certainly
find the latter uses context far more efficiently and you get to your intended
results quicker.
I want to address some of the feedback I’ve received on my stance on this. I evaluated MCP extensively in the context of agentic coding, where its limitations were easiest to observe. One piece of feedback is that MCP might not make a ton of sense for general code generation, because models are already very good at that but they make a lot of sense for end-user applications, like, say, automating a domain-specific task in a financial company. Another one is that I need to look at the world of the future, where models will be able to reach many more tools and handle much more complex tasks.
My current take is that my data indicates that current MCP will always be harder to use than writing code, primarily due to the reliance on inference. If you look at the approaches today for pushing towards higher tool counts, the proposals all include a layer of filtering. You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn’t been much better approaches proposed.
The main reason I believe this will most likely also hold true — that you shouldn’t be using MCP in its current form even for non-programming, domain-specific tasks — is that even in those cases code generation just is the better choice because of the ability to compose.
The way to think about this problem is that when you don’t have an AI, and you’re solving a problem as a software engineer, your tool of choice is code. Perhaps as a non-software engineer, code is out of reach. Many many tasks people do by hand are actually automatable through software. The challenge is finding someone to write that software. If you’re working in a niche environment and you’re not a programmer yourself, you might not pick up a programming book to learn how to code, and you might not find a developer willing to provide you with a custom piece of software to solve your specific problem. And yes, maybe your task requires some inference, but many do need them all the time.
There is a reason we say “to replace oneself with a shell script”, it’s because that has been happening for a long time. With LLMs and programming, the idea is that rather than replacing yourself with a shell script, you’re replacing yourself with an LLM. But you run into three problems: cost, speed, and general reliability. All these problems are what we need to deal with before we can even think of tool usage or MCP. We need to figure out how to ensure that our automated task actually works correctly at scale.
The key to automation is really to automate things that will happen over and over. You’re not going to automate a one-shot change that will never recur. You’re going to start automating the things where the machine can truly give you a productivity boost because you’re going to do it once or twice, figure out how to make it work, and then have the machine repeat it a thousand times. For that repetition, there’s a very strong argument to be made for always using code. That’s because if we instruct the machine to use inference to do it, it might work, particularly for small tasks, but it requires validation which can take almost the same time as doing it in the first place. Getting an LLM to calculate for you sort of works, but it’s much better for the LLM to write the Python code to do the calculation. Why? First, you can review the formula, not the calculated result. We can write it ourselves or we can use the LLM as a judge to figure out if the approach is correct. Don’t really have to validate that Python calculates correct, you can rely on that. So, by opting for code generation for task solving, we get a little closer to being able to verify and validate the process ourselves, rather than hoping the LLM inferred correctly.
This obviously goes way beyond calculation. Take, for instance, this blog. I converted this entire blog from reStructuredText to Markdown recently. I put this conversion off for a really long time, partly because I was a little too lazy. But also, when I was lazy enough to consider deploying an LLM for it, I just didn’t trust it to do the conversion itself without regressing somewhere. I was worried that if it ran out of context, it might start hallucinating text or change wording slightly. It’s just that I worried about subtle regressions too much.
I still used an LLM for it, but I asked it to do that transformation in a different way: through code.
I asked the LLM to perform the core transformation from reStructuredText to Markdown but I also asked it to do this in a way that uses the underlying AST (Abstract Syntax Tree). So, I instructed it to parse the reStructuredText into an actual reStructuredText AST, then convert that to a Markdown AST, and finally render it to HTML, just like it did before. This gave me an intermediate transformation step and a comparable end result.
Then, I asked it to write a script that compares the old HTML with the new HTML, performs the diffing after some basic cleanup it deemed necessary for comparison. I asked it to consider what kind of conversion errors were actually acceptable. So, it read through its own scripts to see where it might not match the original output due to known technical limitations (e.g., footnotes render differently between the Markdown library I’m using and the reStructuredText library, so even if the syntax matches correctly, the HTML would look different). I asked it to compensate for this in that script.
After that was done, I asked it to create a third script, which I could run over the output of hundreds of files to analyze the differece to go back into the agentic loop for another iteration tep.
Then I kicked this off in a loop. I did not provide all the posts, I started with 10 until differences were low and then had it do it for all. It did this for maybe 30 minutes or so until I came back to it and found it in a pretty acceptable state.
What’s key about this transformation is not so much that the LLM was capable of pulling it off, but that I actually trusted this process at the end because I could review the approach. Not only that, I also tried to ask another LLM what it thinks of the code that another LLM wrote, and the changes. It gave me much higher confidence that what was going on would not lose data. It felt right to me. It felt like a mechanical process that was fundamentally correct, and I was able to observe it and do spot checks. At worst, the regressions were minor Markdown syntax errors, but the text itself wouldn’t have been corrupted.
Another key here is also that because the inference is rather constant, the cost of inference in this process scales with the number of iteration steps and the sample size, but it doesn’t depend on how many documents I’m wanting to convert overall. Eventually, I just had it run over all documents all the time but running it over 15 docs vs 150 docs is more or less the same effort, because the final LLM based analysis step did not have that many more things to review (it already skipped over all minor differences in the files).
This is a long-winded way of saying that this entire transformation went through code. It’s a pipeline that starts with human input, produces code, does an LLM as a judge step and iterates. And you can take this transformation and apply it to a general task as well.
To give an example, one MCP you might be using is Playwright. I find it very hard to replace Playwright with a code approach for all cases because what you’re essentially doing is remotely controlling your browser. The task you’re giving it largely involves reading the page, understanding what’s on it, and clicking the next button. That’s the kind of scenario where it’s very hard to eliminate inference at each step.
However, if you already know what the page is — for instance, if you’re navigating your own app you’re working on — then you can actually start telling it to write a Playwright Python script instead and run that. This script can perform many of those steps sequentially without any inference. I’ve noticed that this approach is significantly quicker, and because it understands your code, it still generally produces correct results. It doesn’t need to navigate, read page contents, find a button, or press an input in real-time. Instead, it will write a single Python script that automates the entire process in one go, requiring very little context by comparison.
This process is repeatable. Once the script is written, I can execute it 100, 200, or even 300 times without requiring any further inference. This is a significant advantage that an MCP typically cannot offer. It’s incredibly challenging to get an LLM to understand generic, abstract MCP tool calls. I wish I could, for example, embed an MCP client directly into a shell script, allowing me to run remote MCP services efficiently via code generation, but actually doing that is incredibly hard because the tools are not written with non inference based automation in mind.
Also, as ironic as it is: I’m a human, not an MCP client. I can run and debug a script, I cannot even figure out how to reliably do MCP calls. It’s always a gamble and incredibly hard to debug. I love using the little tools that Claude Code generates while generating code. Some of those I had it convert into long term additions to my development process.
I don’t know. But it’s an interesting moment to think what we could potentially do to make code generation for purposeful agentic coding better. The weird thing is that MCP is actually pretty great when it works. But it feels in the current form too much like a dead end that cannot be scaled up, particularly to automation at scale because it relies on inference too much.
So maybe we need to look at ways to find a better abstraction for what MCP is great at, and code generation. For that that we might need to build better sandboxes and maybe start looking at how we can expose APIs in ways that allow an agent to do some sort of fan out / fan in for inference. Effectively we want to do as much in generated code as we can, but then use the magic of LLMs after bulk code execution to judge what we did.
I can also imagine that it might be quite interesting to do code generation in a way that also provides enough context for an LLM to explain in human language to a non programmer what the script is doing. That might enable these flows to be used by human users that are not developers themselves.
In any case I can only encourage people to bypass MCP and to explore what else is possible. LLMs can do so much more if you give them the power to write code.
Here are some more posts you might want to read or videos you might want to watch:
2025-06-21 08:00:00
I’m currently evaluating how different models perform when generating XML versus JSON. Not entirely unexpectedly, XML is doing quite well — except for one issue: the models frequently return invalid XML. That made it difficult to properly assess the quality of the content itself, independent of how well the models serialize data. So I needed a sloppy XML parser.
After a few failed attempts of getting Claude to just fix up my XML
parsing in different ways (it tried html5lib, the html lxml parser etc.)
which all resulted in various kinds of amusing failures, I asked Claude
to ultrathink
and write me a proper XML library from scratch. I gave it
some basic instructions of what this should look like and it one-shotted
something.
Afterwards I prompted it ~20 more times to do various smaller fixes as a
response to me reviewing it (briefly) and using it and to create an
extensive test suite.
While that was taking place I had 4o create a logo. After that I quickly converted it into an SVG with Illustrator and had Claude make it theme-aware for dark and light modes, which it did perfectly.
On top of that, Claude fully set up CI and even remotely controlled my browser to configure the trusted PyPI publisher for the package for me.
In summary, here is what AI did:
It wrote ~1100 lines of code for the parser
It wrote ~1000 lines of tests
It configured the entire Python package, CI, PyPI publishing
Generated a README, drafted a changelog, designed a logo, made it theme-aware
Did multiple refactorings to make me happier
The initial prompt that started it all (including typos):
I want you to implement a single-file library here that parses XML sloppily. It should implement two functions:
stream_parse
which yields a stream of events (use named tuples) for the XML stream
tree_parse
which takes the output of stream_parse and assembles an element tree. It should default to xml.etree.ElementTree and optoinally allow you to provide lxml too (or anything else)
It should be fast but use pre-compiled regular expressions to parse the stream. The idea here is that the output comes from systems that just pretend to speak XML but are not sufficiently protected against bad outoput (eg: LLMs)
So for instance & should turn into & but if &x is used (which is invalid) it should just retain as &x in the output. Additionally if something is an invalid CDATA section we just gracefully ignore it. If tags are incorrectly closed within a larger tag we recover the structure. For instance
<foo><p>a<p>b</foo>
will just close the internal structures when</foo>
comes around.Use ultrathink. Break down the implementation into
- planning
- api stubs
- implementation Use sub tasks and sub agents to conserve context
Now if you look at that library, you might not find it amazingly beautiful. It probably is a bit messy and might have a handful of bugs I haven’t found yet. It however works well enough for me right now for what I’m doing and it definitely unblocked me. In total it worked for about 30-45 minutes to write the initial implementation while I was doing something else. I kept prompting it for a little while to make some progress as I ran into issues using it.
If you want to look at what it looks like:
To be clear: this isn’t an endorsement of using models for serious Open Source libraries. This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.
Treat it as a curious side project which says more about what’s possible today than what’s necessarily advisable.
Postscriptum: Yes, I did slap an Apache 2 license on it. Is that even valid when there’s barely a human in the loop? A fascinating question but not one I’m not eager to figure out myself. It is however certainly something we’ll all have to confront sooner or later.
2025-06-17 08:00:00
This week I spent time with friends to letting agents go wild and see what we could build in 24 hours. I took some notes for myself to reflect on that experience. I won’t bore you with another vibecoding post, but you can read Peter’s post about how that went.
As fun as it was, it also was frustrating in other ways and in entire predictable ways. It became a meme about how much I hated working with Xcode for this project. This got me thinking quite a bit more that this has been an entirely unacceptable experience for a long time, but with programming agents, the pain becomes measurable.
When I first dove into programming I found the idea of RTFM quite hilarious. “Why are you asking dumb questions, just read it up.” The unfortunate reality is that the manual often doesn’t exist — or is wrong. In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it’s a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong, for programs it can be impossible to navigate UI or lack of error messages. There are many different ways in which we humans get stuck.
What agents change about this is, is that I can subject them to something I wouldn’t really want to subject other developers to: measuring. I picked the language for my current project by running basic evals and it worked well. I learned from that, that there are objectively better and worse language when it comes to my particular problem. The choice however is not just how much the AI knows about the language from the corpus of examples during training. It’s also tooling, the inherent capabilities of the language, ecosystem churn and other aspects.
Using agents to measure code quality is great because agents don’t judge me, but they do judge the code they are writing. Not all agents will swear, but they will express frustration with libraries when loops don’t go well or give up. That opens up an opportunity to bring some measurements into not agent performance, but the health of a project.
We should pay more attention to how healthy engineering teams are, and that starts with the code base. Using agents we can put some numbers to it in which we cannot do with humans (or in a very slow and expensive way). We can figure out how successful agents are in using the things are are creating in rather objective ways which is in many ways a proxy for how humans experience working with the code. Getting together with fresh souls to walk them through a tutorial or some tasks is laborious and expensive. Getting agents that have never seen a codebase start using a library is repeatable, rather cheap, fast and if set up the right way very objective. It also takes the emotion out of it or running the experiment multiple times.
Now obviously we can have debates over if the type of code we would write with an agent is objectively beautiful or if the way agents execute tools creates the right type of tools. This is a debate worth having. Right at this very moment though what programming agents need to be successful is rather well aligned with what humans need.
So what works better than other things? For now these are basic indicators, for agents and humans alike:
Good test coverage: they help with future code writing but they also greatly help preventing regressions. Hopefully no surprise to anyone. I would add though that this is not just for the tests, but also for examples and small tools that a user and agent can run to validate behavior manually.
Good error reporting: a compiler, tool or an API that does not provide good error reporting is a bad tool. I have been harping on this for years when working at Sentry, but with agents it becomes even clearer that this investment pays off. It also means errors should be where they can be found. If errors are hidden in an obscure log neither human nor agent will find it.
High ecosystem stability: if your ecosystem churns a lot, if APIs keep changing you will not just upset humans, you will also slow down the agent. It will find outdated docs, examples and patterns and it will slow down / write bad code.
Few superfluous abstractions: too many layers just make data flow and refactoring expensive. We might even want to start questioning the value proposition of (most) ORMs today because of how much harder they make things.
Everything needs to be fast and user friendly: The quicker tools
respond (and the less useless output they produce) the better.
Crashes are tolerable; hangs are problematic. uv
for instance is a
much better experience in Python than any of the rest of the ecosystem,
even though most of the ecosystem points at pip
. Agents are super
happy to use and keep using uv
because they get good infos out of it,
and low failure rates.
A good dev environment: If stuff only reproduces in CI you have to move your agent into CI. That’s not a good experience. Give your agent a way to run Docker locally. If you write a backend, make sure there is a database to access and introspect, don’t just mock it out (badly). Deferring things into a CI flow is not an option. It’s also important that it’s clear when the devenv is broken vs the code is broken. For both human and agent it can be hard to distinguish this if the tooling is not set up correctly.
When an agent struggles, so does a human. There is a lot of code and tooling out there which is objectively not good, but because of one reason or another became dominant. If you want to start paying attention to technology choices or you want to start writing your own libraries, now you can use agents to evaluate the developer experience.
Because so can your users. I can confidently say it’s not just me that does not like Xcode, my agent also expresses frustration — measurably so.