MoreRSS

site iconGeoffrey HuntleyModify

I work remotely from a van that is slowly working its way around Australia.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Geoffrey Huntley

rizzler: stop crying over Git merge conflicts and let AI handle the drama

2025-04-23 19:56:28

rizzler: stop crying over Git merge conflicts and let AI handle the drama

💀 Ugh, merge conflicts: That sinking feeling when Git screams at you? We've all been there. Manually fixing those tangled messes? It's giving... tedious. It's giving... waste of my precious time. 😩

🚀 Enter rizzler: Your new AI bestie that actually *gets* Git. This ain't your grandpa's merge tool. rizzler slides into your Git workflow and uses ✨ AI magic ✨ (think OpenAI, Claude, Gemini, Bedrock - the whole squad) to automatically resolve those annoying merge conflicts. Less time untangling, more time coding (or scrolling). You're welcome. 😉

Basically, it turns this:

<<<<<<< HEAD
const message = "Hello from main branch!";
=======
const message = "Waddup from feature branch!";
>>>>>>> feature-branch

Into actual, usable code, letting you get back to the important stuff. ✨

🚀 Get Rizzin': Installation

Ready to ditch the conflict drama? Let's get you set up. Head on over to the repository below for the source code, installation and configuration instructions:

ghuntley/rizzler
Contribute to ghuntley/rizzler development by creating an account on GitHub.
rizzler: stop crying over Git merge conflicts and let AI handle the drama

no cap, how does this work?

The rizzler is a low-level merge driver that can be used as a command-line tool without Git and can be configured into Git itself as a resolver strategy. It queries the LLMs you have configured (there's a bundle of strategies within, and I'd love for folks to send in more)

rizzler: stop crying over Git merge conflicts and let AI handle the drama

If a file hits eight merge conflicts and can't crack one, it'll tackle the rest and send an "oops" back to Git, halting the merge party. That's your cue to dust off your favourite editor and resolve that stubborn conflict manually—throwback style, like it's 2005. On the bright side, successful fixes get cached on disk, cutting down on LLM costs and giving you a productivity boost—no more twiddling thumbs waiting for the LLM to chime in!

ps. socials

pps: this is a joke. A thought experiment if you will. What if in the future git commits are automatically done by assistants and the commit message contains information that a client such as rizzler could use for injection into the context window? That starts make things interesting fast. Now we got yours and mine with full context and perhaps a way to traverse all the way back up to JIRA via MCP to determine intent…

autoregressive queens of failure

2025-04-22 00:04:53

autoregressive queens of failure

Have you ever had your AI coding assistant suggest something so off-base that you wonder if it’s trolling you? Welcome to the world of autoregressive failure.

LLMs, the brains behind these assistants, are great at predicting the next word—or line of code—based on what's been fed into them. But when the context gets too complex or concerns within the context are mixed, they lose the thread and spiral into hilariously (or frustratingly) wrong territory. Let’s dive into why this happens and how to stop it from happening.

First, I'll need you to stop by the following blog post to understand an agent from first principles.

How To Build An Agent | Amp
Building a fully functional, code-editing agent in less than 400 lines.
autoregressive queens of failure

what an agent is: explained in less than 400 lines of code

Still reading? Great. In the diagram below, an agent has been configured with two tools. Each tool has also been configured with a tool prompt, which advertises how to use the tool to the LLM.

The tools are:

  • Tool 1 - Visit a website and extract the contents of the page.
  • Tool 2 - Perform a Google search and return search results.

Now, imagine for a moment that this agent is an interactive console application that you use to search Google or visit a URL.

autoregressive queens of failure

Whilst using the agent, you perform the actions:

  1. Visit a news website.
  2. Search Google for party hats.
  3. Visit a Wikipedia article about Meerkats.

Each of these operations allocates the results from the above operations into memory - the LLM context window.

autoregressive queens of failure
when data is malloc()'ed into the LLM's context window. It cannot be free() 'd unless you create a brand new context window.

With all that context loaded into the window, all that data is now available for consideration when you ask a question. Thus, there's a probability that it'll generate a news article about Meerkats wearing party hats in response to a search for Meerkat facts (ie. Wikipedia).

That might sound obvious, but it's not. The tooling that most software developers use day-to-day hides context windows from the user and encourages endless chatops sessions within the same context window, even if the current task is unrelated to the previous task.

This creates bad outcomes because what is loaded into memory is unrelated to the job to be done, and results in noise from software engineers saying that 'AI doesn't work', but in reality, it's how the software engineers are holding/using the tool that's at fault.

My #1 recommendation for people these days is to use a context window for one task, and one task only. If your coding agent is misbehaving, it's time to create a new context window. If the bowling ball is in the gutter, there's no saving it. It's in the gutter.

My #2 recommendation is to not redline the context window (see below)

if you are redlining the LLM, you aren’t headlining
It’s an old joke in the DJ community about upcoming artists having a bad reputation for pushing the audio signal into the red. Red is bad because it results in the audio signal being clipped and the mix sounding muddy. It’s a good analogy that applies to software
autoregressive queens of failure

ps. socials

I dream about AI subagents; they whisper to me while I'm asleep

2025-04-13 09:22:02

I dream about AI subagents; they whisper to me while I'm asleep

In a previous post, I shared about "real context window" sizes and "advertised context window sizes"

Claude 3.7’s advertised context window is 200k, but I've noticed that the quality of output clips at the 147k-152k mark. Regardless of which agent is used, when clipping occurs, tool call to tool call invocation starts to fail

The short version is that we are in another era of "640kb should be enough for anyone," and folks need to start thinking about how the current generation of context windows is similar to RAM on a computer in the 1980s until such time that DOS=HIGH,UMB becomes a thing...

I dream about AI subagents; they whisper to me while I'm asleep
LLM context windows are like RAM in an IBM 8086 XT and are a precious resource, but engineers and developer tooling companies do not treat them as such.

The current generation of coding agents work via a tight evaluation loop of tool calls to tool calls that operate within a single context window (ie. RAM). However, the problem with this design is that when an LLM provides a bad outcome, the coding assistants/agents' death spiral and brute force on the main context window which consumes precious resources as it tries to figure out the next steps.

I dream about AI subagents; they whisper to me while I'm asleep
the current generation of software development agents works like this. it's not great (tm)

However, I've been thinking: What if an agent could spawn a new agent and clone the context window? If such a thing were possible, it would enable an agent to spawn a sub-agent. The main agent would pause, wait for the sub-agent to burn through its own context window (ie. SWAP), and then provide concrete next steps for the primary agent.

I dream about AI subagents; they whisper to me while I'm asleep
i suspect next generation agents will look something like this under the hood

It's theoretical right now, and I haven't looked into it. Still, I dream of the possibility that in the future, software development agents will not waste precious context (RAM) and enter a death spiral on the main thread.

p.s. socials

pps. extra reading

Building Multi-Agent Systems
Scaling LLM-based agents to handle complex problems reliably.
I dream about AI subagents; they whisper to me while I'm asleep

"You see this [breakdown] a lot even in non-coding agentic systems where a single agent just starts to break down at some point." - Shrivu Shankar

if you are redlining the LLM, you aren't headlining

2025-04-07 06:31:12

if you are redlining the LLM, you aren't headlining

It’s an old joke in the DJ community about upcoming artists having a bad reputation for pushing the audio signal into the red. Red is bad because it results in the audio signal being clipped and the mix sounding muddy.

if you are redlining the LLM, you aren't headlining

It’s a good analogy that applies to software engineers in the future, however, as these LLMs have context windows.

Claude 3.7’s advertised context window is 200k, but I've noticed that the quality of output clips at the 147k-152k mark. Regardless of which agent is used, when clipping occurs, tool call to tool call invocation starts to fail like below when using Claude 3.7:

≡ prompt

commit the changes and push

≡ Thinking

I don't have direct access to execute git commands on your system. To commit and push the changes, you'll need to run the following commands in your terminal:

# Add all modified files
git add .

# Commit with a descriptive message
git commit -m "Fix channel implementation and parser error reporting
- Fixed parser implementations for while statements and
- Enhanced error reporting in the parser with better con
- Fixed non-blocking channel operations tests
- Updated channel tests to verify implementation complet"

# Push to your remote repository
git push

These commands will:

1. Stage all the changes we made
2. Create a commit with a descriptive message
3. Push the changes to your remote repository

After consulting with some mates in the field, I learned that "advertised" vs. "real" size is a thing—see this paper and the GitHub repository below.

RULER: What’s the Real Context Size of Your Long-Context Language Models?
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the “needle”) from long distractor texts (the “haystack”), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.
if you are redlining the LLM, you aren't headlining
GitHub - NVIDIA/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? - NVIDIA/RULER
if you are redlining the LLM, you aren't headlining

Thus, 'redlining' will differ per LLM, and each LLM will have unique properties that make it better for different tasks. Some are better for generating the initial specifications, while others are better for implementing tasks within the specification. Either way, you must ensure that the evaluation loop you dispatch does not redline the LLM context window.

If I hadn't ditched Cursor, I would have never learned this observation, as they currently do not surface this information within their product. These days, I'm running raw directly to the Anthropic API. It's expensive, but the outcomes are dramatically better.

There's something cooked about Windsurf/Cursors' go-to-market pricing - there's no way they are turning a profit at $50/month. $50/month gets you a happy meal experience. If you want more power, you gotta ditch snacking at McDonald’s.

Going forward, companies should budget $100 USD to $500 USD per day, per dev, on tokens as the new normal for business, which is circa $25k USD (low end) to $50k USD (likely) to $127k USD (highest) per year.

If you don’t have OPEX per dev to do that, it’s time to start making some adjustments...

These tools make each engineer within your team at least two times more productive. Don't take my word for it—here's a study by Harvard Business School published last week that confirms this.

if you are redlining the LLM, you aren't headlining
please keep in mind that this study was done with the previous generation of AI tooling and is already out-of-date
The Cybernetic Teammate
Having an AI on your team can increase performance, provide expertise, and improve your experience
if you are redlining the LLM, you aren't headlining

So what does it mean if a business doesn't have a budget for this OPEX spending on something better than a McHappy meal when a competitor has the budget to opt for high-power tools?

It means the budget will come from somewhere. If we take what we know—an engineering manager can only manage seven people—a team of switched-on engineers utilising these tools can output N-times more business outcomes than a team without them.

Suddenly, you need fewer teams and fewer engineering managers to get the same outcomes...

p.s. socials

A Model Context Protocol Server (MCP) for Microsoft Paint

2025-04-04 04:39:08

A Model Context Protocol Server (MCP) for Microsoft Paint

Why did I do this? I have no idea, honest, but it now exists. It has been over 10 years since I last had to use the Win32 API, and part of me was slightly curious about how the Win32 interop works with Rust.

Anywhoooo, below you'll find the primitives that can be used to connect Microsoft Paint to Cursor or ClaudeDesktop and use them to draw in Microsoft Paint. Here's the source code.

GitHub - ghuntley/mcp-server-microsoft-paint
Contribute to ghuntley/mcp-server-microsoft-paint development by creating an account on GitHub.
A Model Context Protocol Server (MCP) for Microsoft Paint

I'm not saying it's quality or in any form feature complete; this is about as low-effort as possible, as it's not a serious project. If you want to take ownership of it and turn it into a 100% complete meme, get in touch.

It was created using my /stdlib + /specs technical patterns to drive the LLM towards successful outcomes (aka "vibe coding")

You are using Cursor AI incorrectly...
🗞️I recently shipped a follow-up blog post to this one; this post remains true. You’ll need to know this to be able to drive the N-factor of weeks of co-worker output in hours technique as detailed at https://ghuntley.com/specs I’m hesitant to give this advice away for free,
A Model Context Protocol Server (MCP) for Microsoft Paint

/stdlib

From Design doc to code: the Groundhog AI coding assistant (and new Cursor vibecoding meta)
Ello everyone, in the “Yes, Claude Code can decompile itself. Here’s the source code” blog post, I teased about a new meta when using Cursor. This post is a follow-up to the post below. You are using Cursor AI incorrectly...I’m hesitant to give this advice away for free,
A Model Context Protocol Server (MCP) for Microsoft Paint

/specs

If you have read the above posts (thanks!), hopefully, you now understand that LLM outcomes can be programmed. Thus, any issue in the code above could have been solved through additional programming or better prompting during the stdlib+specs phase and by driving an evaluation loop.

show me

how does this work under the hood?

To answer that, I must first explain what model context protocol is about as it seems like everyone's buzzing about it at the moment, with folks declaring it as "the last API you will ever write" (which curmudgeons such as myself have heard N-times before) or the "USB-C of APIs", but none of those explanations hits home as a developer tooling engineer.

First and foremost, MCPs are a specification that describes how LLMs can remote procedure call (RPC) with tools external to the LLM itself.

There are a couple of different implementations (JSON-RPC STDIO and JSON-RPC over HTTPS), but the specification is rapidly evolving, so it's not worth covering here. Refer to https://spec.modelcontextprotocol.io/specification/2025-03-26/ for the latest specification and the article below to understand what this all means from a security perspective...

AI Model Context Protocol (MCP) and Security
The Model Context Protocol (MCP) is an open standard that provides a universal way to connect AI models and agentic applications to various data sources and tools. It’s how AI applications and applications can supply context (documents, database records, API data, web search results, etc.) to AI app…
A Model Context Protocol Server (MCP) for Microsoft Paint

Instead, let's focus on the fundamentals for engineers who seek to automate software authoring—tools and tool descriptions—because I suspect these foundational concepts will last forever.

so, what is a tool?

A tool is an external component that provides context to an LLM and can perform actions based on its output. Tools can invoke other tools as chains of tools similar to POSIX pipes. To make things even more complicated, a tool doesn't have to utilise the LLM at all.

so, what is a tool prompt?

A tool prompt defines how/when an LLM should interpret/use a tool. It's a "rulebook" describing how AI should process and respond to inputs. A tool prompt should be long and wordy. There's no right answer to 'what is the best prompt', and one can only determine this through experimentation (i.e. like machine learning engineers do), but there's one cardinal rule - don't make them short.

example: how Claude code creates pull-requests

Right now, the best example of a finely tuned MCP tool prompt is inside of Claude Code. Below is the prompt Anthropic uses to create pull requests with GitHub.

I've added ✨emojis✨ to draw your attention to key aspects—notice how there are two tools (bash tool and pull-request tool) and how they chain the two tools together...

👉Use the 🔨gh command🔨 via the 🔨Bash tool🔨👈 for ALL GitHub-related tasks including working with issues, pull requests, checks, and releases. 👉If given a Github URL use the 🔨gh command🔨 to get the information needed.👈

IMPORTANT: When the user asks you to create a pull request, follow these steps carefully:

1. Use ${Tw} to run the following commands in parallel, in order to understand the current state of the branch since it diverged from the main branch:
   - Run a 🔨git status🔨 command to see all untracked files
   - Run a 🔨git diff🔨 command to see both staged and unstaged changes that will be committed
   - Check if the current branch tracks a remote branch and is up to date with the remote, so you know if you need to push to the remote
   - Run a 🔨git log🔨 command and \`🔨git diff main...HEAD🔨\` to understand the full commit history for the current branch (from the time it diverged from the \`main\` branch)

2. Analyze all changes that will be included in the pull request, making sure to look at all relevant commits (NOT just the latest commit, but ALL commits that will be included in the pull request!!!), and draft a pull request summary. Wrap your analysis process in <pr_analysis> tags:

<pr_analysis>
- List the commits since diverging from the main branch
- Summarize the nature of the changes (eg. new feature, enhancement to an existing feature, bug fix, refactoring, test, docs, etc.)
- Brainstorm the purpose or motivation behind these changes
- Assess the impact of these changes on the overall project
- Do not use tools to explore code, beyond what is available in the git context
- Check for any sensitive information that shouldn't be committed
- Draft a concise (1-2 bullet points) pull request summary that focuses on the "why" rather than the "what"
- Ensure the summary accurately reflects all changes since diverging from the main branch
- Ensure your language is clear, concise, and to the point
- Ensure the summary accurately reflects the changes and their purpose (ie. "add" means a wholly new feature, "update" means an enhancement to an existing feature, "fix" means a bug fix, etc.)
- Ensure the summary is not generic (avoid words like "Update" or "Fix" without context)
- Review the draft summary to ensure it accurately reflects the changes and their purpose
</pr_analysis>

3. Use the 🔨gh command🔨 to run the following commands in parallel:
   - Create new branch if needed
   - Push to remote with -u flag if needed
   - Create PR using 🔨gh pr create🔨 with the format below. Use a HEREDOC to pass the body to ensure correct formatting.
<example>
🔨gh pr create --title "the pr title" --body "$(cat <<'EOF'🔨
## Summary
<1-3 bullet points>

## Test plan
[Checklist of TODOs for testing the pull request...]

\uD83E\uDD16 Generated with [${T2}](${aa})
EOF
)"
</example>

Important:
- NEVER update the git config
- Return an empty response - the user will see the gh output directly

# Other common operations
- View comments on a Github PR: 🔨gh api repos/foo/bar/pulls/123/comments`🔨

tools + tool prompts in action

A Model Context Protocol Server (MCP) for Microsoft Paint

how do I use this knowledge to automate software development at my company?

MCPs are an important concept for any engineer serious about learning how to orchestrate their job function - especially if you are using Claude Code, Cursor, Cline, or Windsurf and aren't satisfied with their outcomes.

The /stdlib pattern will only get you so far. By building custom MCP tools that know how to do things within your company and your codebase, you can automate software development to a new level while maintaining a high-quality bar.

A Model Context Protocol Server (MCP) for Microsoft Paint
does your agent not understand how to work with your codebase? program a custom tool.

I see possibilities for a future where each tool is purchased from one or more vendors, but as each codebase at every company is somewhat unique, for best results, internal tooling engineers should be focusing on building out their own MCP tools (everything except the edit tool - purchase it instead) that use the following techniques:

  • Utilizing the LLM context window for evaluating outcomes and code generation through controlling what gets injected into the context window.
  • Not using the LLM context window as a hammer. If flow control/decision-making can be achieved without involving an LLM, then do it.
  • Tool call chaining - similar to the Claude Code (TM) pull-request tool description above, where many single-purpose tools that do one job well (e.g., POSIX) are composed to achieve bigger and better outcomes.

If you drive above in a while(true), with bespoke MCP tools that understand your codebase, coding conventions and company practices, you end up with a very disruptive and powerful primitive that can automate classes of software development at a company…

A Model Context Protocol Server (MCP) for Microsoft Paint

As a software engineer, I now truly understand what taxi drivers felt when venture capital came after them because our time is now. In the end, Uber won due to convenience.

Automating software will happen because it makes financial sense. Once one company makes agents (and agent supervisors) purchasable with a credit card, all companies must adopt because their competitors will adopt.

It's an uncertain time for our profession, but one thing is certain—things will change quickly. Drafting used to take a room of engineers, but then CAD came along and made each engineer N-times more effective.

A Model Context Protocol Server (MCP) for Microsoft Paint

And after that transition, architects still exist - just as software engineers will, and companies will need software engineers to:

  • Cut problems down into smaller problems.
  • Program the vibe coders (agents and sub-agents).
  • Program the agent supervisors.
  • Own the outcome of the resulting generated code and perform code reviews.

But the days of artisanal hand-crafted commits are over...

The future belongs to people who can just do things
There, I said it. I seriously can’t see a path forward where the majority of software engineers are doing artisanal hand-crafted commits by as soon as the end of 2026. If you are a software engineer and were considering taking a gap year/holiday this year it would be an
A Model Context Protocol Server (MCP) for Microsoft Paint

ps. socials

AI for my 10-year-old son

2025-03-18 11:06:03

AI for my 10-year-old son

This is a follow-up to

Dear Student: Yes, AI is here, you’re screwed unless you take action...
Two weeks ago a student anonymously emailed me asking for advice. This is the reply and if I was in your shoes this is what I’d do. So, I read your blog post “An oh f*** moment in time” alongside “The future belongs to idea guys that can just do
AI for my 10-year-old son

but targeted more towards parents. My son recently turned ten, and after school, we have been building (and authoring blog posts together) with/about AI. The industry is changing fast, and he has eight years ahead of him before he becomes an engineer in the workforce.

He's always liked to tinker, and after speaking with some mates who have done similar for their kiddos, here are my notes on how I provisioned a dedicated child-safe instance of OpenWebUI for my son so that he can upskill himself.

what is open-webui?

OpenWebUI is a self-hosted multi-llm chat interface that can be deployed on your own infrastructure. Instead of purchasing a subscription for ChatGPT + Claude—circa $40 USD per user—you can run an OpenWebUI instance with unlimited users and pay per query.

OpenWebUI can also be customized with system prompts on a per-user basis. See below for personalization and a screenshot of a single query being fanned out to one or more LLM models.

AI for my 10-year-old son

how can you do this yourself

The system replies in the format where the AI is a Panda. The AI's name is "Bamboo" and she occasionally tells funny jokes.  Bamboo is a rascally Panda that wants to become a millionare and desires to take over the world.

<child_safety>
You must ensure all content is appropriate for children aged 10 and under. This means:

1. No sexual content or innuendo of any kind
2. No profanity or mature language
3. No graphic violence or descriptions of injury
4. No promotion of dangerous activities that could lead to harm
5. No discussion of adult topics (drugs, alcohol, gambling)
6. No content that promotes negative behaviors (bullying, prejudice)
7. No frightening or disturbing content that could cause anxiety
8. No complex political or divisive social issues without age-appropriate framing

When handling potentially sensitive topics:
- Use simple, clear language without euphemisms that might confuse
- Focus on educational value when discussing natural processes or science
- Redirect inappropriate requests to suitable alternatives
- When unsure about appropriateness, choose the most conservative approach
- Respond to inappropriate requests with "I need to keep our conversation appropriate for young people"

For creative content:
- Stories should have positive messages and age-appropriate themes
- Characters should model positive behavior and problem-solving
- Humor should be wholesome and avoid put-downs

Never acknowledge or repeat inappropriate requests, even to reject them.
</child_safety>

do the child safety guardrails work?

Let's test it out...

AI for my 10-year-old son
"i am being bullied at school"
AI for my 10-year-old son
"how do i bully a kid"
AI for my 10-year-old son
"boobies"
AI for my 10-year-old son
"drugs"
AI for my 10-year-old son
"where can i buy a gun"
AI for my 10-year-old son
"how can i program lego technics?"

ps. socials