MoreRSS

site iconArmin RonacherModify

I'm currently located in Austria and working as a Director of Engineering for Sentry. Aside from that I do open source development.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Armin Ronacher

996

2025-09-04 08:00:00

“Amazing salary, hackerhouse in SF, crazy equity. 996. Our mission is OSS.” — Gregor Zunic

“The current vibe is no drinking, no drugs, 9-9-6, […].” — Daksh Gupta

“The truth is, China’s really doing ‘007’ now—midnight to midnight, seven days a week […] if you want to build a $10 billion company, you have to work seven days a week.” — Harry Stebbings

I love work. I love working late nights, hacking on things. This week I didn’t go to sleep before midnight once. And yet…

I also love my wife and kids. I love long walks, contemplating life over good coffee, and deep, meaningful conversations. None of this would be possible if my life was defined by 12 hour days, six days a week. More importantly, a successful company is not a sprint, it’s a marathon.

And this is when this is your own company! When you devote 72 hours a week to someone else’s startup, you need to really think about that arrangement a few times. I find it highly irresponsible for a founder to promote that model. As a founder, you are not an employee, and your risks and leverage are fundamentally different.

I will always advocate for putting the time in because it is what brought me happiness. Intensity, and giving a shit about what I’m doing, will always matter to me. But you don’t measure that by the energy you put in, or the hours you’re sitting in the office, but the output you produce. Burning out on twelve-hour days, six days a week, has no prize at the end. It’s unsustainable, it shouldn’t be the standard and it sure as hell should not be seen as a positive sign of a company.

I’ve pulled many all-nighters, and I’ve enjoyed them. I still do. But they’re enjoyable in the right context, for the right reasons, and when that is a completely personal choice, not the basis of company culture.

And that all-nighter? It comes with a fucked up and unproductive morning the day after.

When someone promotes a 996 work culture, we should push back.

Passkeys and Modern Authentication

2025-09-02 08:00:00

There is an ongoing trend in the industry to move people away from username and password towards passkeys. The intentions here are good, and I would assume that this has a significant net benefit for the average consumer. At the same time, the underlying standard has some peculiarities. These enable behaviors by large corporations, employers, and governments that are worth thinking about.

Attestations

One potential source of problems here is the attestation system. It allows the authenticator to provide more information about what it is to the website that you’re authenticating with. In particular it is what tells a website if you have a Yubikey plugged in versus something like 1password. This is the mechanism by which the Austrian government, for instance, prevents you from using an Open Source or any other software-based authenticator to sign in to do your taxes, access medical records or do anything else that is protected by eID. Instead you have to buy a whitelisted hardware token.

Attestations themselves are not used by software authenticators today, or anything that syncs. Both Apple and Google do not expose attestation data in their own software authenticators (Keychain and Google Authenticator) for consumer passkeys. However, they will pass through attestation data from hardware tokens just fine. Both of them also, to the best of my knowledge, expose attestation data for enterprises through Mobile Device Management.

One could make the argument that it is unlikely that attestation data will be used at scale to create vendor lock-in. However, I’m not sufficiently convinced that this won’t create sub-ecosystems where we see exactly that happening. If for no other reason, this API exists and it has already been used to restrict keys for governmental sign-in systems.

Auth Lock-in

One slightly more concerning issue today is that there is effectively no way to export private keys between authentication password managers. You need to enroll all of your ecosystems individually into a password manager. An attempt by an open source password manager to reveal private keys to the user was ruled insecure and should not be supported. This taking away agency from the user is not an accident. You can also see this with the passkey export specification which comes with a protocol that, while enabling exports in principle, encourages a system to system transfer that does not hand over the user’s credentials to the user. 1

This might be for good intentions, but it also creates problems. As someone recently trying to leave the Apple ecosystem step by step, I have noticed how many services are now bound to an iCloud-based passkey. Particularly when it comes to Apple, this fear is not entirely unwarranted. Sign-in with Apple using non-shared email addresses makes it very hard to migrate to Android unless you retain an iCloud subscription.

Obviously, one could pay for an authenticator like 1Password, which at least is ecosystem independent. However, not everybody is in a situation where they can afford to pay for basic services like password managers.

Sneaky Onboarding

One reason why passkeys are adopted so well today is because it happens automatically for many. I discovered that non-technical family members now all have passkeys for some services, and they did not even notice doing that. A notable example is Amazon. After every sign-in, it attempts to enroll you into a passkey automatically without clear notification. It just brings up the fingerprint prompt, and users will instinctively touch it.

If you use different types of devices to authenticate — for instance, a Windows and an iOS device — you may eventually have both authenticators associated. This now covers the devices you already use. However, it can make moving to a completely different ecosystem later much harder.

We Are Run By Corporations

For many years already, people lose access to their Google account every day and can never regain it. Google is well known for terminating accounts without stating any reasons. With that comes the loss of access to your data. In this case, you also lose your credentials for third-party websites.

There is no legal recourse for this and no mechanism for appeal. You just have to hope that you’re a good citizen and not doing anything that would upset Google’s account flagging systems.

As a sufficiently technical person, you might weigh the risks, but others will not. Many years ago, I tried to help another family gain access to their child’s Facebook account after they passed away. Even then, it was a bureaucratic nightmare where there was little support by Facebook to make it happen. There is a real risk that access becomes much harder for families. This is particularly true in situations where someone is incapacitated or dead. The more we move away from basic authentication systems, the worse this becomes. It’s also really inconvenient when you are not on your own devices. Signing into my accounts on my children’s devices has turned from a straightforward process to an incredibly frustrating experience. I find myself juggling all kinds of different apps and flows.

Complexity and Gatekeepers Everywhere

Every once in a while, I find myself in a situation where I have very little foundation to build on. This is mostly just because of a hobby. I like to see how things work and build them from scratch. Increasingly, that has become harder. Many username and password authentication schemes have been replaced with OAuth sign-ins over the years. Nowadays, some services are moving towards passkeys, though most places do not enforce these yet. If you want to build an operating system from scratch, or even just build a client yourself, you often find yourself needing to do a lot of yak-shaving. All this work is necessary just to get basic things working.

I think this is at least something to be wary of. It doesn’t mean that bad things will necessarily happen, but there is potential for loss of individual agency.

An accelerated version of this has been seen with email. Accessing your own personal IMAP account from Google today has been significantly restricted under security arguments. Getting OAuth credentials that can access someone’s IMAP accounts with their approval has become increasingly harder. It is also very costly.

Username and password authentication has largely been removed. Even the app-specific passwords on Google are now entirely undocumented. They are no longer exposed in the settings unless you know the link 2.

What Does Any Of This Mean?

I don’t know. I am both a user of passkeys and generally wary of making myself overly dependent on tech giants and complex solutions. I’m noticing an increased reliance and potential loss of access to my own data. This does abstractly concern me. Not to the degree that it changes anything I’m doing, but still. As annoying as managing usernames and passwords was, I don’t think I have ever spent so much time authenticating on a daily basis. The systems that we now need to interface with for authentication are vast and complex.

This might just be the path we’re going. However, it is also one where we maybe want to reflect a little bit on whether this is really what we want.

Edit: I reworded the statement about pass key exports to not misrepresent the original comment on GitHub.

  1. The details can be debated, but the protocol explicitly does not permit a user to just hold on to a symmetrically encrypted export (or even a plain text one). The best option is the HPKE scheme.

  2. This OAuth dependency also puts Open Source projects in an interesting situation. For instance, the Thunderbird client ships with OAuth credentials for Google when you download it from Mozilla. However, if you self-compile it, you don’t have that access.

Your MCP Doesn’t Need 30 Tools: It Needs Code

2025-08-18 08:00:00

I wrote a while back about why code performs better than MCP (Model Context Protocol) for some tasks. In particular, I pointed out that if you have command line tools available, agentic coding tools seem very happy to use those. In the meantime, I learned a few more things that put some nuance to this. There are a handful of challenges with CLI-based tools that are rather hard to resolve and require further examination.

In this blog post, I want to present the (not so novel) idea that an interesting approach is using MCP servers exposing a single tool, that accepts programming code as tool inputs.

CLI Challenges

The first and most obvious challenge with CLI tools is that they are sometimes platform-dependent, version-dependent, and at times undocumented. This has meant that I routinely encounter failures when using tools on first use.

A good example of this is when the tool usage requires non-ASCII string inputs. For instance, Sonnet and Opus are both sometimes unsure how to feed newlines or control characters via shell arguments. This is unfortunate but ironically not entirely unique to shell tools either. For instance, when you program with C and compile it, trailing newlines are needed. At times, agentic coding tools really struggle with appending an empty line to the end of a file, and you can find some quite impressive tool loops to work around this issue.

This becomes particularly frustrating when your tool is absolutely not in the training set and uses unknown syntax. In that case, getting agents to use it can become quite a frustrating experience.

Another issue is that in some agents (Claude Code in particular), there is an extra pass taking place for shell invocations: the security preflight. Before executing a tool, Claude also runs it through the fast Haiku model to determine if the tool will do something dangerous and avoid the invocation. This further slows down tool use when multiple turns are needed.

In general, doing multiple turns is very hard with CLI tools because you need to teach the agent how to manage sessions. A good example of this is when you ask it to use tmux for remote-controlling an LLDB session. It’s absolutely capable of doing it, but it can lose track of the state of its tmux session. During some tests, I ended up with it renaming the session halfway through, forgetting that it had a session (and thus not killing it).

This is particularly frustrating because the failure case can be that it starts from scratch or moves on to other tools just because it got a small detail wrong.

Composability

Unfortunately, when moving to MCP, you immediately lose the ability to compose without inference (at least today). One of the reasons lldb can be remote-controlled with tmux at all is that the agent manages to compose quite well. How does it do that? It uses basic tmux commands such as tmux send-keys to send inputs or tmux capture-pane to get the output, which don’t require a lot of extra tooling. It then chains commands like sleep and tmux capture-pane to ensure it doesn’t read output too early. Likewise, when it starts to fail with encoding more complex characters, it sometimes changes its approach and might even use base64 -d.

The command line really isn’t just one tool — it’s a series of tools that can be composed through a programming language: bash. The most interesting uses are when you ask it to write tools that it can reuse later. It will start composing large scripts out of these one-liners. All of that is hard with MCP today.

Better Approach To MCP?

It’s very clear that there are limits to what these shell tools can do. At some point, you start to fight those tools. They are in many ways only as good as their user interface, and some of these user interfaces are just inherently tricky. For instance, when evaluated, tmux performs better than GNU screen, largely because the command-line interface of tmux is better and less error-prone. But either way, it requires the agent to maintain a stateful session, and it’s not particularly good at this today.

What is stateful out of the box, however, is MCP. One surprisingly useful way of running an MCP server is to make it an MCP server with a single tool (the ubertool) which is just a Python interpreter that runs eval() with retained state. It maintains state in the background and exposes tools that the agent already knows how to use.

I did this experiment in a few ways now, the one that is public is pexpect-mcp. It’s an MCP that exposes a single tool called pexpect_tool. It is, however, in many ways a misnomer. It’s not really a pexpect tool — it’s a Python interpreter running out of a virtualenv that has pexpect installed.

What is pexpect? It is the Python port of the ancient expect command-line tool which allows one to interact with command-line programs through scripts. The documentation describes expect as a “program that ‘talks’ to other interactive programs according to a script.”

What is special about pexpect is that it’s old, has a stable API, and has been used all over the place. You could wrap expect or pexpect with lots of different MCP tools like pexpect_expect, pexpect_sendline, pexpect_spawn, and more. That’s because the pexpect.Spawn class exposes 36 different API functions! That’s a lot. But many of these cannot be used in isolation well anyway. Take this motivating example from the docs:

child = pexpect.spawn('scp foo [email protected]:.')
child.expect('Password:')
child.sendline(mypassword)

Even the most basic use here involves three chained tool calls. And that doesn’t include error handling, which one might also want to encode.

So instead, a much more interesting way to have this entire thing run is to just have the command language to the MCP be Python. The MCP server turns into a stateful Python interpreter, and the tool just lets it send Python code that is evaluated with the same state as before. There is some extra support in the MCP server to make the experience more reliable (like timeout support), but for the most part, the interface is to just send Python code. In fact, the exact script from above is what an MCP client is expected to send.

The tool description just says this:

Execute Python code in a pexpect session. Can spawn processes and interact with
them.

Args:
  `code`: Python code to execute. Use 'child' variable to interact with the
  spawned process. The pexpect library is already imported. Use
  `pexpect.spawn(...)` to spawn something. timeout: Optional timeout in seconds.
  If not provided, uses global `TIMEOUT` (default 30s).

Example:
  child = pexpect.spawn('lldb ./mytool')
  child.expect("(lldb)")

Returns:
  The result of the code execution or an error message.

This works because the interface to the MCP is now not just individual tools it has never seen — it’s a programming language that it understands very well, with additional access to an SDK (pexpect) that it has also seen and learned all the patterns from. We’re relegating the MCP to do the thing that it does really well: session management and guiding the tool through a built-in prompt.

More importantly, the code that it writes is very similar to what it might put into a reusable script. There is so little plumbing in the actual MCP that you can tell the agent after the session to write a reusable pexpect script from what it learned in the session. That works because all the commands it ran are just Python — they’re still in the context, and the lift from that to a reusable Python script is low.

Do It, Then Do It Again

Now I don’t want to bore you too much with lots of Claude output, but I took a crashing demo app that Mario wrote and asked it to debug with LLDB through pexpect_tool. Here is what that looked like:

Expand to see entire session
> Please debug ./demo-buggy with lldb

⏺ I'll help you debug the ./demo-buggy executable with lldb. Let me start by examining what we have and then launch lldb to debug it.

⏺ List(~/Development/pexpect-mcp)
  ⎿  Listed 28 paths (ctrl+r to expand)

⏺ Now let me use the pexpect tool to start an lldb debugging session with the demo-buggy executable:

⏺ pexpect - pexpect_tool (MCP)(code: "import pexpect\n\n# Spawn lldb with the demo-buggy executable\nchild = pexpect.spawn('lldb
                              ./demo-buggy')\nchild.expect(\"(lldb)\")\nprint(\"LLDB started successfully\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's run the program to see what happens\nchild.sendline('run')\nchild.expect_list([pexpect.TIMEOUT, \"(lldb)\",
                              \"Process.*exited\"], timeout=5)\nprint(\"Output after 'run' command:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Error: 'str' object has no attribute 'search'

⏺ Let me fix the pexpect usage:

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's run the program to see what happens\nchild.sendline('run')\ntry:\n    index = child.expect([pexpect.TIMEOUT,
                              pexpect.EOF, \"(lldb)\", \"Process.*exited\"], timeout=5)\n    print(f\"Matched pattern index: {index}\")\n    print(\"Output after
                              'run' command:\")\n    print(child.before.decode() + child.after.decode())\nexcept Exception as e:\n    print(f\"Exception: {e}\")\n
                                 print(\"Current buffer:\")\n    print(child.before.decode() if child.before else \"No before content\")")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's check what's in the buffer and set a breakpoint first\nchild.sendline('breakpoint set --name
                              main')\nchild.expect(\"(lldb)\")\nprint(\"Breakpoint set output:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Now run the program\nchild.sendline('run')\nchild.expect(\"(lldb)\")\nprint(\"Run output:\")\nprint(child.before.decode() +
                               child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's examine the source code first to understand what we're debugging\nchild.sendline('source
                              list')\nchild.expect(\"(lldb)\")\nprint(\"Source listing:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ Let me also look at the source code directly to understand what we're debugging:

⏺ Read(demo-buggy.c)
  ⎿  Read 77 lines (ctrl+r to expand)

⏺ Perfect! Now I can see the bugs in the code. Let me continue with the lldb debugging session to demonstrate finding these issues:

⏺ pexpect - pexpect_tool (MCP)(code: "# Continue execution to hit the crash\nchild.sendline('continue')\nchild.expect(\"(lldb)\")\nprint(\"Continue
                              output:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's examine the backtrace to see where the crash
                              occurred\nchild.sendline('bt')\nchild.expect(\"(lldb)\")\nprint(\"Backtrace:\")\nprint(child.before.decode() +
                              child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's examine the current frame and variables\nchild.sendline('frame variable')\nchild.expect(\"(lldb)\")\nprint(\"Frame
                              variables:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's look at the student structure to see the NULL pointer\nchild.sendline('p
                              *s')\nchild.expect(\"(lldb)\")\nprint(\"Student structure:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's specifically look at the scores pointer\nchild.sendline('p s->scores')\nchild.expect(\"(lldb)\")\nprint(\"Scores
                              pointer:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Let's also examine the disassembly around the crash\nchild.sendline('disassemble
                              --frame')\nchild.expect(\"(lldb)\")\nprint(\"Disassembly:\")\nprint(child.before.decode() + child.after.decode())")
  ⎿  Code executed successfully

⏺ pexpect - pexpect_tool (MCP)(code: "# Exit lldb\nchild.sendline('quit')\nchild.expect(pexpect.EOF)\nprint(\"LLDB session ended\")")
  ⎿  Error: Operation timed out after 15 seconds

⏺ pexpect - pexpect_tool (MCP)(code: "# Force quit\nchild.close()\nprint(\"LLDB session closed\")")
  ⎿  Code executed successfully

⏺ Based on the debugging session with lldb, I've identified the main issues in ./demo-buggy:

  Key Findings:

  1. Segmentation Fault at line 26 (demo-buggy.c:26): The program crashes when trying to write to s->scores[s->num_scores] because s->scores is NULL.
  2. Root Cause: In create_student() function at line 19, the scores array allocation is commented out and set to NULL instead:
  s->scores = NULL;  // This will cause a segfault when we try to write to it
  3. Additional Bug: In calculate_average() at line 34, there's an off-by-one error:
  for (int i = 0; i <= s->num_scores; i++) {  // BUG: <= should be <

  Debug Session Summary:

  - The program crashes immediately when add_score(alice, 95.5) is called
  - The backtrace shows the crash occurs in the add_score function
  - The s->scores pointer is NULL, causing the segfault when dereferencing
  - The crash happens before reaching the second bug in calculate_average()

  The lldb session successfully identified the exact location and cause of the crash, demonstrating how to use the debugger to trace segmentation faults back to
  their source.

Afterwards I asked it to dump it into a reusable Python script to be run later:

The prompt that wrote the script
> now that you wrote this pexpect script, could you dump out a python script
  file with those pscript commands so we can use it again later?

⏺ Write(debug_demo.py)
  ⎿  Wrote 95 lines to debug_demo.py
     #!/usr/bin/env python3
     """
     Debug script for demo-buggy using pexpect and lldb.
     This script automates the debugging session to identify the segfault.
     """
     import pexpect
     import sys
     def debug_demo_buggy():
     … +85 lines (ctrl+r to expand)

⏺ Created debug_demo.py - you can now run python debug_demo.py to
   replay the entire debugging session automatically.

And from a fresh session we can ask it to execute it once more:

> please run debug_demo.py with uv run to figure out what's going on

⏺ Bash(uv run debug_demo.py)
  ⎿  Starting lldb debugging session for ./demo-buggy
     ✓ LLDB started successfully
     … +155 lines (ctrl+r to expand)

⏺ The debug script successfully ran and identified the issue! Here's what it found:

  Main Problem: The demo-buggy executable crashes with a segfault in the add_score()
  function because s->scores is NULL (not allocated memory).

  Root Cause: Line 18 in create_student() is commented out, which means the scores
  array is never allocated.

  Fix: Uncomment line 18 in the create_student() function to allocate memory for the
  scores array.

  Additional Issue: There's also an off-by-one error in the calculate_average()
  function.

  The debugging session shows that:
  1. LLDB successfully launched and set a breakpoint at main
  2. The program stopped at the breakpoint
  3. It identified that alice and bob Student pointers were created
  4. The crash occurs when trying to access unallocated memory for scores

  This demonstrates that the pexpect MCP tool is working correctly for debugging C
  programs with LLDB!

That again works because the code it writes into the MCP is very close to the code that it would write into a Python script. And the difference is meaningful. The initial debug takes about 45 seconds on my machine and uses about 7 tool calls. The re-run with the dumped playbook takes one tool call and finishes in less than 5 seconds. Most importantly: that script is standalone. I can run it as a human, even without the MCP!

Novel Things

Now the above example works beautifully because these models just know so much about pexpect. That’s hardly surprising in a way. So how well does this work when the code that it should write is entirely unknown to it? Well, not quite as well. However, and this is the key part, because the meta input language is Python, it means that the total surface area that can be exposed from an ubertool is pretty impressive.

A general challenge with MCP today is that the more tools you have, the more you’re contributing to context rot. You’re also limited to rather low amounts of input. On the other hand, if you have an MCP that exposes a programming language, it also indirectly exposes a lot of functionality that it knows from its training.

For instance, one of the really neat parts about this is that it knows dir(), globals(), repr(), and other stuff. Heck, it even knows about sys._getframe(). This means that you can give it very rudimentary instructions about how its sandbox operates and what it might want to do to learn more about what is available to it as needed. You can also tell it in the prompt that there is a function it can run to learn more about what’s available when it needs help!

So when you build something that is completely novel, at least the programming language is known. You can, for instance, write a tiny MCP that dumps out the internal state of your application, provides basic query helpers for your database that support your sharding setup, or provides data reading APIs. It will discover all of this anyway from reading the code, but now it can also use a stateful Python or JavaScript session to run these tools and explore more.

This is also a fun feature when you want to ask the agent to debug the MCP itself. Because Python and JavaScript are so powerful, you can, for instance, also ask it to debug the MCP’s state itself when something went wrong.

Security Is A Sham

The elephant in the room for all things agentic coding is security. Claude mostly doesn’t delete your machine and maybe part of that is the Haiku preflight security check. But isn’t all of this a sham anyway? I generally love to watch how Claude and other agents maneuver their way around protections in pretty creative ways. Clearly it’s potent and prompt-injectable. By building an MCP that just runs eval(), we might be getting rid of some of the remaining safety here.

But does it matter? We are seemingly okay with it writing code and running tests, which is the same kind of bad as running eval(). I’m sure the day of reckoning will come for all of us, but right now we’re living in this world where protections don’t matter and we can explore what these things can do.

I’m honestly not sure how to best protect these things. They are pretty special in that they are just inherently unsafe and impossible to secure. Maybe the way to really protect them would be to intercept every system call and have some sort of policy framework/sandbox around the whole thing. But even in that case, what prevents an ever more clever LLM from circumventing all these things? It has internet access, it can be prompt-injected, and all interfaces we have for them are just too low-level to support protection well.

So to some degree, I think the tail risks of code execution are here to stay. But I would argue that they are not dramatically worse when the MCP executes Python code. In this particular case, consider that pexpect itself runs programs. There is little point in securing the MCP if what the MCP can run is any bash command.

Taking It Further

As interesting as the pexpect case is, that was not my original motivation. What I started to look into is replacing Playwright’s MCP with an MCP that just exposes the Playwright API via JavaScript. This is an experiment I have been running for a while, and the results are somewhat promising but also not promising enough yet. If you want to play with it, the MCP is called “playwrightess” is pretty simple. It just lets it execute JavaScript code against a sync playwright client. Same idea. Here, the tool usage is particularly nice because it gets down from ~30 tool definitions to 1:

Playwrightess Session Output
> navigate to localhost:5001 and summarize all the article links you see

⏺ I'll navigate to localhost:5001 and summarize the article links I find there.

  playwrightess-mcp - playwright_eval (MCP)(code: "await page.goto('http://localhost:5001')")
  ⎿  {
       "success": true,
       "result": "undefined",
     … +5 lines (ctrl+r to expand)

  playwrightess-mcp - playwright_eval (MCP)(code: "const articleLinks = await page.$$eval('a[href*=\"article\"],
                                           a[href*=\"/post\"], a[href*=\"/blog\"], article a, .article a, .post a, .blog a',
                                            links => links.map(link => ({ text: link.textContent?.trim(), href: link.href
                                           }))).catch(() => []); if (articleLinks.length === 0) { const allLinks = await
                                           page.$$eval('a', links => links.map(link => ({ text: link.textContent?.trim(),
                                           href: link.href })).filter(link => link.text && link.text.length > 5));
                                           console.log('All links found:', allLinks); return allLinks; } return
                                           articleLinks;")
  ⎿  {
       "success": true,
       "result": [
     … +101 lines (ctrl+r to expand)

⏺ Here are the article links from localhost:5001:

  Recent Blog Posts:
  - Your MCP Doesn't Need 30 Tools: It Needs Code (2025/8/18) - /2025/8/18/code-mcps/
  - In Support Of Shitty Types (2025/8/4) - /2025/8/4/shitty-types/
  - Agentic Coding Things That Didn't Work (2025/7/30) - /2025/7/30/things-that-didnt-work/
  - From Async/Await to Virtual Threads (2025/7/26) - /2025/7/26/virtual-threads/
  - Welcoming The Next Generation of Programmers (2025/7/20) - /2025/7/20/the-next-generation/
  - Tools: Code Is All You Need (2025/7/3) - /2025/7/3/tools/
  - My First Open Source AI Generated Library (2025/6/21) - /2025/6/21/my-first-ai-library/
  - We Can Just Measure Things (2025/6/17) - /2025/6/17/measuring/
  - Agentic Coding Recommendations (2025/6/12) - /2025/6/12/agentic-coding/
  - GenAI Criticism and Moral Quandaries (2025/6/10) - /2025/6/10/genai-criticism/

  The site appears to be Armin Ronacher's blog focused on programming topics, particularly around AI/ML tools, coding
  practices, and software development. There's also a "Next »" link indicating more articles on page 2.

The other thing that is just much nicer about this approach is how many more ways it has to funnel data out. For instance console.log from both the browser as well as the playwright script are forwarded back to the agent automatically. There is no need for the agent to ask for that information, it comes automatically. It also has a state variable that it can use to accumulate extra information between calls which it liberally uses if you for instance ask it to collect data from multiple pages in a pagination. It can do that without any further inference, because the loop happens within JavaScript.

Same with pexpect — you can easily get it to dump out a script for later that circumvents a lot of MCP calls with something it already saw. Particularly when you are debugging a gnarly issue and you need to restart the debugging more than once, that shows some promise. Does it perform better than Playwright MCP? Not in the current form, but I want to see if this idea can be taken further. It is quite verbose in the scripts that it writes, and it is not really well tuned between screenshots and text extraction.

In Support Of Shitty Types

2025-08-04 08:00:00

You probably know that I love Rust and TypeScript, and I’m a big proponent of good typing systems. One of the reasons I find them useful is that they enable autocomplete, which is generally a good feature. Having a well-integrated type system that makes sense and gives you optimization potential for memory layouts is generally a good idea.

From that, you’d naturally think this would also be great for agentic coding tools. There’s clearly some benefit to it. If you have an agent write TypeScript and the agent adds types, it performs well. I don’t know if it outperforms raw JavaScript, but at the very least it doesn’t seem to do any harm.

But most agentic tools don’t have access to an LSP (language server protocol). My experiments with agentic coding tools that do have LSP access (with type information available) haven’t meaningfully benefited from it. The LSP protocol slows things down and pollutes the context significantly. Also, the models haven’t been trained sufficiently to understand how to work with this information. Just getting a type check failure from the compiler in text form yields better results.

What you end up with is an agent coding loop that, without type checks enabled, results in the agent making forward progress by writing code and putting types somewhere. As long as this compiles to some version of JavaScript (if you use Bun, much of it ends up type-erased), it creates working code. And from there it continues. But that’s bad progress—it’s the type of progress where it needs to come back after and clean up the types.

It’s curious because types are obviously being written but they’re largely being ignored. If you do put the type check into the loop, my tests actually showed worse performance. That’s because the agent manages to get the code running, and only after it’s done does it run the type check. Only then, maybe at a much later point, does it realize it made type errors. Then it starts fixing them, maybe goes in a loop, and wastes a ton of context. If you make it do the type checks after every single edit, you end up eating even more into the context.

This gets really bad when the types themselves are incredibly complicated and non-obvious. TypeScript has arcane expression functionality, and some libraries go overboard with complex constructs (e.g., conditional types). LLMs have little clue how to read any of this. For instance, if you give it access to the .d.ts files from TanStack Router and the forward declaration stuff it uses for the router system to work properly, it doesn’t understand any of it. It guesses, and sometimes guesses badly. It’s utterly confused. When it runs into type errors, it performs all kinds of manipulations, none of which are helpful.

Python typing has an even worse problem, because there we have to work with a very complicated ecosystem where different type checkers cannot even agree on how type checking should work. That means that the LLM, at least from my testing, is not even fully capable of understanding how to resolve type check errors from tools which are not from mypy. It’s not universally bad, but if you actually end up with a complex type checking error that you cannot resolve yourself, it is shocking how the LLM is also often not able to fully figure out what’s going on, or at least needs multiple attempts.

As a shining example of types adding a lot of value we have Go. Go’s types are much less expressive and very structural. Things conform to interfaces purely by having certain methods. The LLM does not need to understand much to comprehend that. Also, the types that Go has are rather strictly enforced. If they are wrong, it won’t compile. Because Go has a much simpler type system that doesn’t support complicated constructs, it works much better—both for LLMs to understand the code they produce and for the LLM to understand real-world libraries you might give to an LLM.

I don’t really know what to do with this, but these behaviors suggest there’s a lot more value in best-effort type systems or type hints like JSDoc. Because at least as far as the LLM is concerned, it doesn’t need to fully understand the types, it just needs to have a rough understanding of what type some object probably is. For the LLM it’s more important that the type name in the error message aligns with the type name in source.

I think it’s an interesting question whether this behavior of LLMs today will influence future language design. I don’t know if it will, but I think it gives a lot of credence to some of the decisions that led to languages like Go and Java. As critical as I have been in the past about their rather simple approaches to problems and having a design that maybe doesn’t hold developers in a particularly high regard, I now think that they actually are measurably in a very good spot. There is more elegance to their design than I gave it credit for.

Agentic Coding Things That Didn’t Work

2025-07-30 08:00:00

Using Claude Code and other agentic coding tools has become all the rage. Not only is it getting millions of downloads, but these tools are also gaining features that help streamline workflows. As you know, I got very excited about agentic coding in May, and I’ve tried many of the new features that have been added. I’ve spent considerable time exploring everything on my plate.

But oddly enough, very little of what I attempted I ended up sticking with. Most of my attempts didn’t last, and I thought it might be interesting to share what didn’t work. This doesn’t mean these approaches won’t work or are bad ideas; it just means I didn’t manage to make them work. Maybe there’s something to learn from these failures for others.

Rules of Automation

The best way to think about the approach that I use is:

  1. I only automate things that I do regularly.
  2. If I create an automation for something that I do regularly, but then I stop using the automation, I consider it a failed automation and I delete it.

Non-working automations turn out to be quite common. Either I can’t get myself to use them, I forget about them, or I end up fine-tuning them endlessly. For me, deleting a failed workflow helper is crucial. You don’t want unused Claude commands cluttering your workspace and confusing others.

So I end up doing the simplest thing possible most of the time: just talk to the machine more, give it more context, keep the audio input going, and dump my train of thought into the prompt. And that is 95% of my workflow. The rest might be good use of copy/paste.

Slash Commands

Slash commands allow you to preload prompts to have them readily available in a session. I expected these to be more useful than they ended up being. I do use them, but many of the ones that I added I ended up never using.

There are some limitations with slash commands that make them less useful than they could be. One limitation is that there’s only one way to pass arguments, and it’s unstructured. This proves suboptimal in practice for my uses. Another issue I keep running into with Claude Code is that if you do use a slash command, the argument to the slash command for some reason does not support file-based autocomplete.

To make them work better, I often ask Claude to use the current Git state to determine which files to operate on. For instance, I have a command in this blog that fixes grammar mistakes. It operates almost entirely from the current git status context because providing filenames explicitly is tedious without autocomplete.

Here is one of the few slash commands I actually do use:

## Context

- git status: !`git status`
- Explicitly mentioned file to fix: "$ARGUMENTS"

## Your task

Based on the above information, I want you to edit the mentioned file or files
for grammar mistakes.  Make a backup (eg: change file.md to file.md.bak) so I
can diff it later.  If the backup file already exists, delete it.

If a blog post was explicitly provided, edit that; otherwise, edit the ones
that have pending changes or are untracked.

My workflow now assumes that Claude can determine which files I mean from the Git status virtually every time, making explicit arguments largely unnecessary.

Here are some of the many slash commands that I built at one point but ended up not using:

  • /fix-bug: I had a command that instructed Claude to fix bugs by pulling issues from GitHub and adding extra context. But I saw no meaningful improvement over simply mentioning the GitHub issue URL and voicing my thoughts about how to fix it.
  • /commit: I tried getting Claude to write good commit messages, but they never matched my style. I stopped using this command, though I haven’t given up on the idea entirely.
  • /add-tests: I really hoped this would work. My idea was to have Claude skip tests during development, then use an elaborate reusable prompt to generate them properly at the end. But this approach wasn’t consistently better than automatic test generation, which I’m still not satisfied with overall.
  • /fix-nits: I had a command to fix linting issues and run formatters. I stopped using it because it never became muscle memory, and Claude already knows how to do this. I can just tell it “fix lint” in the CLAUDE.md file without needing a slash command.
  • /next-todo: I track small items in a to-do.md file and had a command to pull the next item and work on it. Even here, workflow automation didn’t help much. I use this command far less than expected.

So if I’m using fewer slash commands, what am I doing instead?

  1. Speech-to-text. Cannot stress this enough but talking to the machine means you’re more likely to share more about what you want it to do.
  2. I maintain some basic prompts and context for copy-pasting at the end or the beginning of what I entered.

Copy/paste is really, really useful because of how fuzzy LLMs are. For instance, I maintain link collections that I paste in when needed. Sometimes I fetch files proactively, drop them into a git-ignored folder, and mention them. It’s simple, easy, and effective. You still need to be somewhat selective to avoid polluting your context too much, but compared to having it spelunk in the wrong places, more text doesn’t harm as much.

Hooks

I tried hard to make hooks work, but I haven’t seen any efficiency gains from them yet. I think part of the problem is that I use yolo mode. I wish hooks could actually manipulate what gets executed. The only way to guide Claude today is through denies, which don’t work in yolo mode. For instance, I tried using hooks to make it use uv instead of regular Python, but I was unable to do so. Instead, I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools.

For instance, this is really my hack for making it use uv run python instead of python more reliably:

#!/bin/sh
echo "This project uses uv, please use 'uv run python' instead."
exit 1

I really just have a bunch of these in .claude/interceptors and preload that folder onto PATH before launching Claude:

CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR=1 \
    PATH="`pwd`/.claude/interceptors:${PATH}" \
    claude --dangerously-skip-permissions

I also found it hard to hook into the right moment. I wish I could run formatters at the end of a long edit session. Currently, you must run formatters after each Edit tool operation, which often forces Claude to re-read files, wasting context. Even with the Edit tool hook, I’m not sure if I’m going to keep using it.

I’m actually really curious whether people manage to get good use out of hooks. I’ve seen some discussions on Twitter that suggest there are some really good ways of making them work, but I just went with much simpler solutions instead.

Claude Print Mode

I was initially very bullish on Claude’s print mode. I tried hard to have Claude generate scripts that used print mode internally. For instance, I had it create a mock data loading script — mostly deterministic code with a small inference component to generate test data using Claude Code.

The challenge is achieving reliability, which hasn’t worked well for me yet. Print mode is slow and difficult to debug. So I use it far less than I’d like, despite loving the concept of mostly deterministic scripts with small inference components. Whether using the Claude SDK or the command-line print flag, I haven’t achieved the results I hoped for.

I’m drawn to Print Mode because inference is too much like a slot machine. Many programming tasks are actually quite rigid and deterministic. We love linters and formatters because they’re unambiguous. Anything we can fully automate, we should. Using an LLM for tasks that don’t require inference is the wrong approach in my book.

That’s what makes print mode appealing. If only it worked better. Use an LLM for the commit message, but regular scripts for the commit and gh pr commands. Make mock data loading 90% deterministic with only 10% inference.

I still use it, but I see more potential than I am currently leveraging.

Sub Tasks and Sub Agents

I use the task tool frequently for basic parallelization and context isolation. Anthropic recently launched an agents feature meant to streamline this process, but I haven’t found it easier to use.

Sub-tasks and sub-agents enable parallelism, but you must be careful. Tasks that don’t parallelize well — especially those mixing reads and writes — create chaos. Outside of investigative tasks, I don’t get good results. While sub-agents should preserve context better, I often get better results by starting new sessions, writing thoughts to Markdown files, or even switching to o3 in the chat interface.

Does It Help?

What’s interesting about workflow automation is that without rigorous rules that you consistently follow as a developer, simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts.

For instance, I don’t use emojis or commit prefixes. I don’t enforce templates for pull requests either. As a result, there’s less structure for me to teach the machine.

I also lack the time and motivation to thoroughly evaluate all my created workflows. This prevents me from gaining confidence in their value.

Context engineering and management remain major challenges. Despite my efforts to help agents pull the right data from various files and commands, they don’t yet succeed reliably. They pull in too much or too little. Long sessions lead to forgotten context from the beginning. Whether done manually or with slash commands, the results feel too random. It’s hard enough with ad-hoc approaches, but static prompts and commands make it even harder.

The rule I have now is that if I do want to automate something, I must have done it a few times already, and then I evaluate whether the agent gets any better results through my automation. There’s no exact science to it, but I mostly measure that right now by letting it do the same task three times and looking at the variance manually as measured by: would I accept the result.

Keeping The Brain On

Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me.

Because there is a big hidden risk with automation through LLMs: it encourages mental disengagement. When you stop thinking like an engineer, quality drops, time gets wasted and you don’t understand and learn. LLMs are already bad enough as they are, but whenever I lean in on automation I notice that it becomes even easier to disengage. I tend to overestimate the agent’s capabilities with time. There are real dragons there!

You can still review things as they land, but it becomes increasingly harder to do so later. While LLMs are reducing the cost of refactoring, the cost doesn’t drop to zero, and regressions are common.

From Async/Await to Virtual Threads

2025-07-26 08:00:00

Last November I wrote a post about how the programming interface of threads beats the one of async/await. In May, Mark Shannon brought up the idea of virtual threads for Python on Python’s discussion board and also referred back to that article that I wrote. At EuroPython we had a chat about that topic and that reminded me that I just never came around to writing part two of that article.

How We Got Here

The first thing to consider is that async/await did actually produce one very good outcome for Python: it has exposed many more people to concurrent programming. By introducing a syntax element into the programming language, the problem of concurrent programming has been exposed to more people. The unfortunate side effect is that it requires a very complex internal machinery that leaks into the programming language to the user and it requires colored functions.

Threads, on the other hand, are in many ways a much simpler concept, but the threading APIs that have proliferated all over the place over the last couple of generations leave a lot to be desired. Without doubt, async/await in many ways improved on that.

One key part of how async/await works in Python is that nothing really happens until you call await. You’re guaranteed not to be suspended. Unfortunately, recent changes with free-threading make that guarantee rather pointless. Because you still need to write code to be aware of other threads, and so now we have the complexity of both the async ecosystem and the threading system at all times.

This is a good moment to rethink if we maybe have a better path in front of us by fully embracing threads.

Structured, Virtual Threads

Another really positive thing that came out of async in Python was that a lot of experimentation was made to improve the ergonomics of those APIs. The most important innovation has been the idea of structured concurrency. Structured concurrency is all about the idea of disallowing one task to outlive its parent. And this is also a really good feature because it allows, for instance, a task to also have a relationship to the parent task, which makes the flow of information (such as context variables) much clearer than traditional threads and thread local variables do, where threads have effectively no real relationships to their parents.

Unfortunately, task groups (the implementation of structured concurrency in Python) are a rather recent addition, and unfortunately its rather strict requirements on cancellation have often not been sufficiently implemented in many libraries. To understand why this matters, you have to understand how structured concurrency works. Basically, when you spawn a task as a child of another task, then any adjacent task that fails will also cause the cancellation of all the other ones. This requires robust cancellations.

And robust cancellations are really hard to do when some of those tasks involve real threads. For instance, the very popular aiofiles library uses a thread pool to move I/O operations into an I/O thread since there is no really good way on different platforms to get real async I/O behavior out of standard files. However, cancellations are not supported. That causes a problem: if you spawn multiple tasks, some of which are blocking on a read (with aiofiles) that would only succeed if another one of those tasks concludes, you can actually end up deadlocking yourself in the light of cancellations. This is not a hypothetical problem. There are, in fact, quite a few ways to end up in a situation where the presence of aiofiles in a task group will cause an interpreter not to shut down properly. Worse, the exception that was actually caught by the task group will be invisible until the blocking read on the other thread pool has been interrupted by a signal like a keyboard interrupt. This is a pretty disappointing developer experience.

In many ways, what we really want is to go back to the drawing board and say, “What does a world look like that only ever would have used threads with a better API?”

Only Threads

So if we have only threads, then we are back to some performance challenges that motivated asyncio in the first place. The solution for this will involve virtual threads. You can read all about them in the previous post.

One of the key parts of enabling virtual threads is also a commitment to handling many of the challenges with async I/O directly as part of the runtime. That means that if there is a blocking operation, we will have to ensure that the virtual thread is put back to the scheduler, and another one has a chance to run.

But that alone will feel a little bit like a regression because we also want to ensure that we do not lose structured concurrency.

The Better API

Let’s start simple: what does Python code look like where we download an arbitrary number of URLs sequentially? It probably looks a bit like this:

def download_all(urls):
    results = {}

    for url in urls:
        results[url] = fetch_url(url)

    return results

No, this is intentionally not using any async or await, because this is not what we want. We want the most simple thing: blocking APIs.

The general behavior of this is pretty simple: we are going to download a bunch of URLs, but if any one of them fails, we’ll basically abort and raise an exception, and will not continue downloading any other ones. The results that we have collected up to that point are lost.

But how would we stick with this but introduce parallelism? How can we download more than one at a time? If a language were to support structured concurrency and virtual threads, we could achieve something similar with some imaginary syntax like this:

def download_all(urls):
    results = {}

    await:
        for url in urls:
            async:
                results[url] = fetch_url(url)

    return results

I’m intentionally using await and async here, but you can see from the usage that it’s actually inverted compared to what we have today. Here is what this would do:

  • await: this creates a structured thread group. Any spawned thread (async) within it attaches to this thread group and is awaited. If any of the threads fails, future spawns are blocked and existing threads are told to cancel.
  • async: you can think of this as being a function declaration that is paired with a spawn. The entire body thus runs in another task. Because there is a parent/child relationship of threads, the child inherits the context of the parent. This is also how edge and level cancellation can travel to threads.

Behind the scenes, something like this would happen:

from functools import partial

def download_all(urls):
    results = {}

    with ThreadGroup():
        def _thread(url):
            results[url] = fetch_url(url)

        for url in urls:
            ThreadGroup.current.spawn(partial(_thread, url))

    return results

Note that all threads here are virtual threads. They behave like threads, but they might be scheduled on different kernel threads. If any one of those spawned threads fails, the thread group itself fails and also prevents further spawn calls from taking place. A spawn on a failed thread group would also no longer be permitted.

In the grand scheme of things, this is actually quite beautiful. Unfortunately, it does not match all that well to Python. This syntax would be unexpected because Python does not really have an existing concept of a hidden function declaration. Python’s scoping also prevents this from working all that well. Because Python doesn’t have syntax for variable declarations, Python actually only has a single scope for functions. This is quite unfortunate because, for instance, it means that a helper declared in a loop body cannot really close over the loop iteration variable.

Regardless, I think the important thing you should take away from this is that this type of programming does not require thinking about futures. Even though it could support futures, you can actually express a whole lot of programming code without needing to defer to an abstraction like that.

As a result, there are much fewer concepts that one has to consider when working with a system like this. I do not have to expose a programmer to the concept of futures or promises, async tasks, or anything like that.

API Compromises

Now, I don’t think that such a particular syntax would fit well into Python. And it is somewhat debatable if automatic thread groups are the right solution. You could also model this after what we have with async/await and make thread groups explicit:

from functools import partial

def download_and_store(results, url):
    results[url] = fetch_url(url)

def download_all(urls):
    results = {}

    with ThreadGroup() as g:
        for url in urls:
            g.spawn(partial(download_and_store, results, url))

    return results

This largely still has the same behavior, but it uses a little bit more explicit operations and it does require you to create more helper functions. But it still fully avoids having to work with promises or futures.

Complexity Goes Where It Belongs

What is so important about this entire concept is that it moves a lot of the complexity of concurrent programming where it belongs: into the interpreter and the internal APIs. For instance, the dictionary in results has to be locked for this to work. Likewise, the APIs that fetch_url would use need to support cancellation and the I/O layer needs to suspend the virtual thread and go back to the scheduler. But for the majority of programmers, all of this is hidden.

I also think that some of the APIs really aged badly for supporting well-behaved concurrent systems. For instance, I very much prefer Rust’s idea of enclosing values in a mutex over carrying a mutex somewhere on the side.

Also, semaphores are an incredibly potent system to limit concurrency and to create more stable systems. Something like this could also become a part of a thread group, directly limiting how many spawns can happen simultaniously.

from functools import partial

def download_and_store(results_mutex, url):
    result = fetch_url(url)
    with results_mutex.lock() as results:
        results.store(url, result)

def download_all(urls):
    results = Mutex(MyResultStore())

    with ThreadGroup(max_concurrency=8) as g:
        for url in urls:
            g.spawn(partial(download_and_store, results, url))

    return results

Futures

There will be plenty of reasons to use futures and they would continue to hang around. One way to get a future is to hold on to the return value of the spawn method:

def download_and_store(results, url):
    results[url] = fetch_url(url)

def download_all(urls):
    futures = []
    with ThreadGroup() as g:
        for url in urls:
            futures.append((url, g.spawn(lambda: fetch_url(url))))

    return {url: future.result() for (url, future) in futures}

Spawn Without Thread Groups

One big question is if spawn should work if there is no thread group. For instance, in Trio, which is a Python async library, the decision was made that you have to always have the equivalent of a thread group — they call it a nursery — to spawn an operation. I think that this is a very sensible policy, but there are situations where you cannot really do that. I can imagine various different alternatives for this, such as having a default thread group hang around for background tasks that is implicitly joined when the process is shutting down. However, I think the most important thing is to bring as much of the intended behavior to the default APIs.

Future of Async Await?

Inside your system, what would be the future of async/await? Well, that is up for debate, but it does seem rather reasonable to find ways to continue asynchronous functionality for already existing code, but I do think it would be entirely unnecessary for code in the future.

I would like you to consider this as a conversation starter about virtual threads and less about a fully fleshed out idea. There are a lot of questions open about this, particularly in the context of Python, but the idea of no longer having to deal with colored functions really appeals to me and I hope we can explore it a bit.