MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Apple Took 50 Years for 3 CEOs — GUI Agents Went from Paper to Production in One

2026-04-22 12:56:34

Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.

Three people. Fifty years. Each transition spaced over a decade apart.

Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.

This article breaks down the technical evolution of GUI Agents — using Mano-P, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.

What Is a GUI Agent?

A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.

There are currently two main technical approaches:

Approach Mechanism Strength Limitation
API/DOM-driven Reads interface structure via accessibility APIs or DOM trees Precise element targeting Depends on app-specific interfaces
Pure vision Understands UI from screenshots alone Works across any application Higher demand on visual comprehension

Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.

Mano-P Open Source Architecture

Training: Bidirectional Self-Reinforcement Learning

The training pipeline follows a three-stage progressive framework:

Stage 1: SFT (Supervised Fine-Tuning)
    ↓  Build foundational capabilities
Stage 2: Offline Reinforcement Learning
    ↓  Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
    ↓  Continuously improve through real-environment interaction

Stage 1 — SFT: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.

Stage 2 — Offline RL: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.

Stage 3 — Online RL: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).

Inference: Think-Act-Verify Loop

The inference mechanism uses a think-act-verify cycle:

while task_not_complete:
    # Think: analyze current screen, plan next action
    thought = model.think(screenshot, task_context)

    # Act: execute GUI operation (click, type, scroll)
    action = model.act(thought)
    execute(action)

    # Verify: capture new screenshot, check result
    new_screenshot = capture_screen()
    verified = model.verify(new_screenshot, expected_state)

    if not verified:
        task_context.update(error_info)  # back to Think

This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.

Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.

Benchmark Performance

OSWorld: Mano-P's 72B model achieves 58.2% success rate, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.

WebRetriever Protocol I: Scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.

Mano-P Benchmark Overview

Edge Deployment: 4B Model Running On-Device

On-device deployment is a core feature of Mano-P. Here's the 4B quantized model (w4a16) performance on M4 Pro:

Metric Value
Prefill Speed 476 tokens/s
Decode Speed 76 tokens/s
Peak Memory 4.3 GB

The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.

Hardware requirement: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.

Getting Started

Open-sourced under the Apache 2.0 license:

# Install
brew tap HanningWang/tap && brew install mano-cua

GitHub: https://github.com/Mininglamp-AI/Mano-P

Wrapping Up

From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.

Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.

For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.

Tim Cook Steps Down — Is the Mac Becoming the Next AI Agent Platform?

2026-04-22 12:55:56

On April 20, Apple dropped a bombshell.

On April 20, Apple announced that Tim Cook will transition from CEO to Executive Chairman, with hardware engineering SVP John Ternus taking over on September 1. In its 50-year history, Apple has now had just three CEOs.

Cook's 14-year tenure defined two eras: making Apple the world's most valuable company, and driving the historic transition from Intel to Apple Silicon. Ternus's background is telling — he's not from the software or services side. He's Apple's hardware engineering chief, the person who shipped Apple Silicon. Choosing a hardware engineer as CEO is Apple signaling that hardware innovation remains the priority for the next decade.

This signal is especially interesting in the context of AI. For the past few years, AI development and deployment has been virtually synonymous with "NVIDIA GPUs + Windows/Linux." The Mac has been a non-factor in the AI ecosystem. But Apple Silicon is changing that — more and more developers are running AI workloads on Mac, and it's no longer just experimentation.

Why the Mac Couldn't Do AI Before

The answer is straightforward: the CUDA ecosystem. NVIDIA GPUs + CUDA have effectively monopolized AI training and inference infrastructure. Apple and NVIDIA parted ways after 2016 — Macs haven't shipped with NVIDIA GPUs since. Without CUDA, major deep learning frameworks (PyTorch, TensorFlow) treated Mac as a second-class citizen — technically supported, but performance-limited.

AI practitioners defaulted to Windows desktops or Linux servers. Mac was fine for writing code, but running models meant SSH-ing into a remote machine.

What Apple Silicon Changed

The M1 chip in 2020 was the inflection point. Apple Silicon's Unified Memory Architecture broke the traditional CPU-GPU separation — CPU and GPU share a single memory pool, eliminating the need to shuttle data between them. This design has natural advantages for AI inference:

  • No VRAM bottleneck: 32 GB or more of unified memory is directly available for model inference, unlike traditional GPUs constrained by dedicated VRAM
  • Superior power efficiency: Lower power consumption at equivalent compute, enabling MacBooks to run models on battery
  • Growing ecosystem: Apple launched MLX, a machine learning framework optimized for Apple Silicon; PyTorch now officially supports the MPS backend

From M1 through M4, each generation has delivered meaningful improvements in AI inference performance. With M4 and 32 GB RAM, Macs can now smoothly run models that previously required dedicated GPU servers.

A Real-World Example: GUI Agents on Mac

To make this concrete, consider GUI Agents — a fast-growing area in AI where models directly observe the screen, understand interface elements, and operate mouse and keyboard to complete complex computer tasks. These applications demand real-time local responsiveness, making them a natural fit for Mac deployment.

Mano-P is our open-source GUI Agent built specifically for Mac. "Mano" comes from the Spanish word for "hand," "P" stands for Person — AI for Personal. It uses pure vision — no accessibility APIs, no DOM parsing, just screenshot understanding. Everything runs locally on Mac; no data leaves the device.

Mano-P Open Source Architecture

How Does It Perform on Apple Silicon?

The question everyone cares about: is Apple Silicon actually fast enough for AI Agents?

OSWorld Benchmark (the standard end-to-end evaluation for GUI Agents): Mano-P's 72B model achieves 58.2% success rate, ranking #1. Second place scores 45.0% — a gap of over 13 percentage points.

WebRetriever Protocol I: Mano-P scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).

Mano-P Benchmark Overview

Local inference performance — Mano-P's 4B quantized model (w4a16) on M4 Pro:

Metric Value
Prefill Speed 476 tokens/s
Decode Speed 76 tokens/s
Peak Memory 4.3 GB

At 4.3 GB peak memory on a 32 GB Mac, you can run the Agent alongside your IDE, browser, Slack, and everything else without breaking a sweat.

Hardware requirement: Apple M4 chip + 32 GB RAM.

Technical Overview

Training: Bidirectional self-reinforcement learning with three progressive stages — SFT → Offline RL → Online RL.

Inference: Think-act-verify loop. Analyze the screen state, execute an action, verify the result. If something unexpected happens (popup, loading delay), the system self-corrects.

Core capabilities: Complex GUI automation, cross-system data integration, long-task planning and execution, intelligent report generation.

brew tap HanningWang/tap && brew install mano-cua

Open-sourced under Apache 2.0: https://github.com/Mininglamp-AI/Mano-P

The Mac AI Ecosystem Is Taking Shape

Mano-P is our contribution, but it's one data point. The bigger picture:

  • MLX gives developers an efficient way to run models on Apple Silicon
  • Ollama and LM Studio make running open-source LLMs on Mac as easy as installing an app
  • Core ML continues to improve, with Apple investing in on-device AI infrastructure

The old consensus — "doing AI means Windows/Linux + NVIDIA" — is loosening. Not because the Mac is replacing GPU servers for large-scale training, but because for inference, personal development, and on-device applications, the Mac is becoming a genuinely viable platform.

Apple just chose a hardware engineer as CEO. The Mac's AI capabilities are only going up from here. We've experienced this trend firsthand building GUI Agents on Mac, and we're excited to see more developers explore this direction.

the next software stack needs more than code generation

2026-04-22 12:55:17

Most people in software are staring at the wrong milestone. Models write API handlers, unit tests, and migrations fast enough that typing isn't the limiting factor anymore. In a world of high-concurrency agents, the act of writing code is no longer the bottleneck. That part of the problem is finished.

The real trouble starts the moment that code lands. Why was this change made? Which requirement forced it? And who actually checked the risky paths in the auth flow? You can still answer those questions today, but it takes a kind of technical archaeology—digging through PR threads, Slack messages, and documentation that was out of date the day it was written. That workflow held up while humans set the pace. It breaks the moment you stop being the bottleneck.

the velocity trap

Most teams run AI-assisted development through a loop of prompt, branch, code, review, and merge. At low volume, it holds up. Then usage increases. You start seeing changes that look fine but carry no clear origin story. A feature flag shows up in production with a name nobody recognizes. An environment variable gets added "just to make something work" and stays there for six months because nobody is sure what it’s gating.

Then we have a growing crowd of "psychosis coders" who think they are shipping masterpieces because they saw an agent move a cursor. They hit approve the second the diff looks plausible, never noticing the trail of empty TODO comments, shallow mocks, and tests that don't actually assert anything meaningful. They are shipping "passable" trash masquerading as velocity.

Maintaining real quality at agentic speeds requires a gauntlet. In my own work, I have to run Model B against Model A like a caffeine-fueled nitpicker for ten rounds just to reach consensus. Then Model C does the same dance. This cross-model review is mandatory to maintain velocity without the system collapsing into a pile of actual slop.

But even this gauntlet is a patch, not a solution. We are burning a mountain of tokens to force quality through a pipe that was never meant to handle it. This is "Approval Theater" as a survival strategy. No, your carefully crafted markdowns, prompt engineering nor harness stacking solves this.

why clean merges still fail

Agent A updates PricingEngine::price() to apply a discount based on User::join_date.

Agent B removes join_date from User and introduces a UserMetadata lookup that returns Option<NaiveDate>.

The pricing path now depends on a value that may not exist. In the failure case, the lookup returns None, and a later fallback resolves that missing value to Money::default(), producing 0.00.

Both changes compile. Both pass their unit tests. Because they don't touch the same lines of code, Git merges them without a single conflict.

In production, the pricing logic fails. Revenue doesn't drop to zero. That would be obvious. It becomes inconsistent instead. Some users are charged correctly. Others hit the missing metadata path and get a zero price. Support tickets appear first. Finance notices the reconciliation mismatch three weeks later.

You're left trying to unwind two changes that were never evaluated together. Each was correct in isolation; the failure only existed in the interaction. A human developer might have caught that by holding the context in their head, but that assumption doesn't scale when dozens of agents are moving at once.

the idempotency crisis

There is a deeper, uglier problem with agents and Git: retries. When a prompt fails or a network timeout hits, an agent often tries again. In a standard Git flow, this leads to double-commits, "dirty" working directories, or a messed-up HEAD state that requires a human to untangle. Then come additional worktrees and agents not checking if they're on the right branch in the right tree, or simply sticking to documentation paths you've specified instead of pollution the root with markdowns.

Git wasn't built for idempotent operations from a thousand concurrent workers. It was built for a human at a terminal who can see when a command failed. If the next stack doesn't have request-level idempotency built into the storage layer, you aren't building a system; you're building a race condition.

files are the wrong primitive now

Git shows you what changed in the text, but it doesn't show you why. You see two files modified, but you can’t see the requirement that triggered the edit. We review diffs and guess at intent.

Agents don't operate on files; they operate on relationships. A discount rule depends on a user attribute; a billing flow depends on an auth decision. When we take that rich graph of intent and flatten it into files, we lose the fidelity of the work. This mismatch leads to "clean" merges that are semantically murky, repeated edits to the same symbols, and retries that converge on something other than what we actually meant to build.

building the floor

I'm building a stack that treats intent as the primary object, not the diff. It's not one tool. It's a set of components doing work Git was never designed for.

aivcs is the version control core: a 9-crate Rust workspace. It uses blake3 for content-addressed hashing and groups changes around intent as an Episode instead of scattering them across commits. An Episode carries the requirement that triggered the change, the symbols actually touched, and the evidence (tests, benchmarks, profiles) attached when the work lands. It can import Git history as a baseline and export structured Episodes back into a branch, so teams don’t have to migrate all at once.

trstr is the parsing layer. It’s spec-grounded, not grammar-by-example. When an agent edits a symbol, the system knows what that symbol is, not just which bytes moved. Tree-sitter is built for editor features. This needs stricter guarantees.

sqry handles symbol-level indexing. It builds the graph from a rule like “apply a legacy discount” to every call site, call chain, and dependent type that touches it. That’s what lets an Episode carry semantic scope instead of a file list. It’s also how you catch the PricingEngine / UserMetadata class of failure before merge.

wsmux is the concurrency layer: a CRDT over the code graph. When dozens of agents edit the same repository, the merge surface isn’t text. It’s operations on symbols and relationships. wsmux makes those edits converge instead of producing two clean merges that disagree at runtime.

The storage layer is idempotent by construction. The same operation with the same content and intent resolves to the same Episode. Retries don’t duplicate work. A thousand workers hitting a flaky network stop being a race condition.

This doesn’t replace Git. It sits alongside it.

The goal is simple: when something changes, you can answer why without digging through history. Decisions travel with the change. Evidence is attached when the change is made, not reconstructed later.

The system remembers what changed. It should also remember why.

The bottleneck moved. The stack didn’t. That gap is where the risk lives.

I Tried Vinext The Build Felt Slightly Better

2026-04-22 12:55:00

I have been building a full-stack mentor app with Bun, tRPC, and Next.js. The project is not very large, but it has the usual pieces you would expect in a modern full-stack application: authentication flow, dashboard pages, scheduled classes, signup, server-side logic, and API communication through tRPC.

The development experience with Next.js was smooth overall. I was not struggling with the framework, and I was not trying to move away from it because of one major problem. But during regular development, one small thing kept coming back into my mind: the production build time.

It was not painfully slow. It did not block me for minutes. But when I was making small changes, testing them, and running builds again and again, even a few seconds started to matter. A build that feels fine once can start feeling heavy when it becomes part of your daily feedback loop.

That is what made me try Vinext.

I was not trying to write a deep benchmark report. I was not trying to prove that one framework is better than another. I simply wanted to take the same kind of app, run it through Vinext, and see how the build experience felt.

This post is just that: a practical view from one developer trying Vinext with an app built using Bun and tRPC.

Why Build Time Started to Matter

Build time is one of those things we usually ignore until it starts interrupting our rhythm.

When a project is small, we expect everything to feel quick. We expect the development server to start fast, changes to reflect quickly, and production builds to finish without making us wait too long. But modern full-stack frameworks do a lot of work behind the scenes. They compile client code, prepare server code, collect route information, generate static pages where needed, and perform type checks or framework-specific analysis.

None of that work is wrong. In fact, most of it is useful. The problem is that during active development, the feeling of speed matters almost as much as the actual number.

If I run a build once before deployment, five seconds may not feel like a big deal. But if I run builds many times while testing changes, comparing outputs, or checking production behavior, the waiting becomes more noticeable.

That was the situation in my project. The Next.js build was acceptable, but I wanted to see whether Vinext could make that loop feel lighter.

The App Setup

The app I tested was a mentor platform built with a modern TypeScript stack. The important pieces were:

  • Bun as the runtime and package manager
  • Next.js as the original framework
  • tRPC for type-safe API communication
  • A few practical app routes such as dashboard, scheduled classes, and signup

This matters because I was not testing Vinext with an empty starter template. I wanted to try it against the type of app I actually build.

At the same time, I want to be clear: this was still a small project. The result should be understood as a local observation, not as a universal statement about every Next.js or Vinext application.

The Next.js Build Output

Here is the build output I got from Next.js:

$ next build
▲ Next.js 16.1.6 (Turbopack)

✓ Compiled successfully in 4.7s
✓ Finished TypeScript in 2.3s
✓ Collecting page data using 7 workers in 343.7ms
✓ Generating static pages using 7 workers in 178.8ms

The first number that stood out to me was the 4.7s compile time. After that, the build still had to finish TypeScript, collect page data, and generate static pages.

To be fair, this is not a bad build time. For many projects, this would be completely acceptable. But for the size of this project, I expected the build to feel a little quicker.

❗️The important detail here is that I am not calling this a complete benchmark of Next.js. I am only looking at the build output from my own project and describing how it felt during repeated development.

What I Wanted to See From Vinext

Before trying Vinext, I had a simple question:

Can a Vite-powered Next.js-like framework make this build feel lighter?

That was the entire motivation. I was not expecting a miracle. I was not expecting the build to drop from seconds to milliseconds. I just wanted to see whether the experience felt cleaner and whether the output showed any practical improvement.

Vinext was interesting to me because it tries to keep a familiar mental model while using a different build pipeline. For someone who already works with Next.js, that matters. A tool becomes easier to try when it does not ask you to forget everything you already know.

Since my app was already using Bun and tRPC, this was also a useful test for my actual workflow.

Trying Vinext

I ran the project with Vinext and looked at the build output carefully. Again, this was not a formal benchmark suite. I did not run multiple cold builds, average them, or test across different machines.

This was a direct local test: run the build, observe the output, compare the visible numbers, and see how the process feels.

For a first experiment, that was enough. Sometimes you do not need a perfect benchmark to notice whether a tool feels worth exploring further.

Vinext Performs Slightly Better

Here is the Vinext build output from my test:

$ vinext build

[1/5] analyze client references...
[2/5] analyze server references... → built in 567ms
[3/5] build rsc environment...     → built in 899ms
[4/5] build client environment...  → built in 1.69s
[5/5] build ssr environment...     → built in 743ms

Build complete.

When I added the visible Vinext build stages, the total came to roughly 3.9s. Compared with the 4.7s compile number I saw from Next.js, this was not a huge difference, but it was still noticeable.

The difference was around 0.8s based on those visible numbers. On paper, that may look small. In practice, it was enough to make the build feel slightly lighter.

That is the main point of this post.

Vinext did not completely change the experience. It did not make the build instant. But it did give me a slightly better build result in this project, and the output made the process feel clearer.

The Build Output Felt Easier to Read

One thing I liked about Vinext was how the build process was separated into visible stages.

Instead of only looking at one big build step, I could see what was happening:

  • client references were analyzed
  • server references were analyzed
  • the RSC environment was built
  • the client environment was built
  • the SSR environment was built

This made the build output feel more understandable. I could immediately see where time was being spent and how the framework was preparing different parts of the app.

That kind of clarity is useful. Even when the time difference is small, good output helps developers understand the system better.

Why I Am Calling It "Slightly Better"

I am intentionally using the phrase "slightly better" because that is the honest way to describe this result.

The improvement was real in my test, but it was not dramatic. I do not want to turn a small local observation into a huge claim. Next.js is still mature, powerful, and widely used. Vinext is still something I am exploring.

But small improvements are still worth noticing.

In daily development, performance is not only about large wins. Sometimes it is about shaving off small delays, making output easier to understand, and reducing the friction that appears when you repeat the same task many times.

That is where Vinext felt good to me.

What I Liked About Vinext

The first thing I liked was the build clarity. The output showed each stage separately, and that made the process easier to follow.

The second thing was the slight improvement in build time. A difference of less than a second may not sound exciting, but when the build already runs in only a few seconds, even a small improvement is visible.

The third thing was that the setup felt promising with Bun and tRPC. I did not have to think about it as a completely different way of building apps. It still felt close to the Next.js mental model, which made the experiment more comfortable.

The fourth thing was the feeling of direction. Vinext feels like an attempt to bring the speed and simplicity people like from Vite into a full-stack React framework style. That idea itself is interesting.

What I Would Not Claim Yet

There are a few things I do not want to overstate.

I am not saying Vinext is always faster than Next.js. I tested one project, in one setup, from my local environment.

I am not saying everyone should migrate immediately. A framework decision is not only about build time. It is also about stability, documentation, deployment, ecosystem support, debugging, and long-term maintenance.

I am also not saying a 0.8s improvement will matter equally to every developer. For some people, it may not matter at all. For others, especially those who care about tight feedback loops, it may be enough to explore further.

The honest statement is simple: in my app built with Bun and tRPC, Vinext gave me a slightly better build experience.

Where Vinext Still Needs Improvement

Vinext still needs more improvement before I would speak about it as a confident production choice.

It needs more real-world testing. It needs stronger documentation. It needs more examples. It also needs better out-of-the-box support for different project structures and use cases.

That is not a negative point. That is how open-source tools grow.

Every framework starts with rough parts. The difference between a tool that disappears and a tool that becomes useful is often the community around it. If developers only wait for tools to become perfect, those tools move slowly. But if developers try them, find issues, fix what they need, and send pull requests, the tool improves faster.

Vinext feels like the kind of project where that mindset matters.

If something is missing and the project is open source, the answer should not always be to complain or wait. Fork it. Understand the code. Add the support you need. Open a pull request.

This is where I think the mindset of many Indian developers starts falling apart. We use open-source tools every day, but we often stay only on the consumer side. We depend on frameworks, libraries, runtimes, CLIs, and deployment tools, but when we find a gap, we rarely take the next step of contributing back.

That is a bigger topic, and I want to write about it separately. But Vinext reminded me of it clearly: if we want better tools, we should not only wait for someone else to improve them. We should also build the habit of participating.

Final Thought

For my app built with Bun and tRPC, Vinext gave me a slightly better build experience than what I saw with Next.js. The improvement was not huge, but it was enough to make me interested.

The Next.js build was acceptable. The Vinext build felt a little lighter. The output was clearer, and the staged build process made the experience easier to understand.

I would not say everyone should migrate immediately. I would say Vinext is worth trying if you are curious about Vite-powered builds in a Next.js-like setup.

It still needs improvement, but that is also the point. Good tools do not become mature only because people wait for them to become perfect. They improve when developers test them, report issues, and contribute back.

Playwright MCP burns 114k tokens for one workflow. Here's why, and what to do about it.

2026-04-22 12:50:04

A recent r/ClaudeAI post measured a single Playwright MCP workflow at 114,000 tokens. Not a complex task — a 7-step navigation + form submission that ran in under a minute. Same workflow as a compiled tap.run: zero tokens.

This isn't "Playwright MCP is bad." It's a structural property of running an LLM at runtime versus compile time.

Where the tokens go

Each Playwright MCP call sends back to the model:

  • The current page's accessibility tree (~5-15K tokens for a typical SPA)
  • A screenshot encoded as base64 (~2-8K tokens depending on quality)
  • The console output since last call
  • The action result + any error context

The model needs all of that to decide the next action. So a 7-step workflow = 7 × (~15K) = ~100K tokens. Add the schema injection at session start (~1.3K per tool, ~28 tools loaded eagerly = ~36K) and you're at the 114K observed.

The optimisations help — DOM compression, accessibility-only modes, smaller screenshots — but the per-step cost is still proportional to page complexity. Add interaction depth and the cost goes up linearly.

The compiler alternative

The insight tap forge is built on: most browser automation is a known workflow. You're not exploring; you're executing the same task on the same site, repeatedly. The LLM is needed to figure out how the first time. After that, it's overhead.

# First time — LLM authors the program
$ tap forge https://example.com/login → submit
✓ Inspected: form#login, 3 fields
✓ Verified: redirect to /dashboard, status 200
✓ Saved: example/login.tap.js   (47 lines of JavaScript)

# Forever after — no LLM, no tokens
$ tap example login user=alice pass=xxx   # 200ms, $0.00
$ tap example login user=alice pass=xxx   # 200ms, $0.00

For the workflow that cost 114K tokens with Playwright MCP, the equivalent .tap.js file is ~80 lines. It runs in 200ms. Token cost: 0 (after the one-time forge).

When each makes sense

Playwright MCP wins when:

  • The workflow is unique each time (agent exploration)
  • The site changes structure between runs (no stable program possible)
  • You're prototyping and don't yet know what you want to extract

Compiled taps win when:

  • You run the same workflow more than ~5 times
  • The site's structural pattern is stable (~95% of sites — most A/B tests don't change DOM, just CSS)
  • You need monitoring (deterministic output = row count is a signal)
  • You need offline execution

The break-even is low. Even at $5/MTok, one Playwright MCP run of 100K tokens = $0.50. Five runs = $2.50, more than the entire $9/mo Hacker tier of a compiler-based tool.

Two structural differences worth understanding

1. Output consistency. When the same site/same prompt produces slightly different extractions across runs (the LLM is non-deterministic), monitoring is structurally hard. Row count fluctuation isn't noise — it's the model. With a compiled tap, row count fluctuation IS noise, and you can alert on it.

2. Failure detection. Playwright MCP detects failure reactively — the tool call returns an error, the LLM sees it, retries with a different approach. By the time you notice, tokens are spent and time is lost. Compiled taps detect failure proactively via fingerprint diffing — tap doctor checks if the page structure changed BEFORE the run fires. If drifted, the run doesn't even start.

The benchmark question

Honest comparison: Playwright MCP has been the most flexible browser-agent setup for the past year. The 114K tokens is the price you pay for that flexibility. If your workflow is varied enough that you need it, pay it. If your workflow is the same automation run 1,000 times, paying it is leaving money on the table.

The broader pattern: every browser-agent tool faces this LLM-at-runtime vs LLM-at-compile-time tradeoff. The question isn't "which tool is better" — it's "does my workload repeat enough to amortize a one-time compile?"

For most production scrapers, the answer is yes.

Day 8: I Built a Timeline to See the Life of Data

2026-04-22 12:49:08

Once I added context…

I still felt something was missing.

Data wasn’t static.

It changed over time.

So I asked:

What if I could see the history of a piece of data?

🧵 The Idea

A Data Timeline

📊 What It Shows

For each user:

  • When data was created
  • When it was updated
  • Whether it came from Stripe / Postmark
  • Who triggered it (admin/system)

🤯 Why This Is Powerful

Instead of guessing:

“What happened here?”

You can now see:

“Here’s exactly what happened.”

🔍 Debugging Becomes Easier

This also changed how I debug things.

Instead of:

  • digging through logs
  • checking multiple systems

I can just:

Look at the timeline.

🧠 Unexpected Benefit

This isn’t just a dev feature.

It’s also:

  • Audit-friendly
  • Compliance-friendly
  • Support-friendly