2025-08-22 23:40:28
This blog post intends to be a definitive guide to context engineering fundamentals from the perspective of an engineer who builds commercial coding assistants and harnesses for a living.
Just two weeks ago, I was back over in San Francisco, and there was a big event on Model Context Protocol Servers. MCP is all hype right now. Everyone at the event was buzzing about the glory and how amazing MCP is going to be, or is, but when I pushed folks for their understanding of fundamentals, it was crickets.
It was a big event. Over 1,300 engineers registered, and an entire hotel was rented out as the venue for the takeover. Based on my best estimate, at least $150,000 USD to $200,000 USD was spent on this event. The estimate was attained through a game of over and under with the front-of-house engineers. They brought in a line array, a GrandMA 3, and had full DMX lighting. As a bit of a lighting nerd myself, I couldn't help but geek out a little.
To clarify, this event was a one-night meet-up, not a conference. There was no registration fee; attendance was free, and the event featured an open bar, including full cocktail service at four bars within the venue, as well as an after-party with full catering and chessboards. While this post might seem harsh on the event, I enjoyed it. It was good.
The meetup even hired a bunch of beatboxers to close off the event, and they gave a live beatbox performance about Model Context Protocol...
MC protocol live and in the flesh.
One of the big announcements was the removal of the 128 tool limit from Visual Studio Code....
Why Microsoft? It's not a good thing...
Later that night, I was sitting by the bar catching up with one of the engineers from Cursor, and we were just scratching our heads,
"What the hell? Why would you need 128 tools or why would you want more than that? Why is Microsoft doing this or encouraging this bad practice?"
For the record, Cursor caps the number of MCP tools that can be enabled in Cursor to just 40 tools, and it's for a good reason. What follows is a loose recap. This is knowledge that is known by people who build these coding harnesses, and I hope this knowledge spreads - there's one single truth:
Less is more. The more you allocate into the context window of an LLM (regardless of which LLM it is), the worse the outcomes you're going to get: both in the realms of quality of output and also in the department of unexpected behavior.
If you are new to MCP or what it is, drop by my previous blog post at:
Some time has passed since I authored the above, and you could consider the post you are reading right now the updated wisdom version of the above blog post.
For the sake of keeping this blog post concise, I'll recap things in the correct order sequentially. However, see above for a comprehensive explanation of the Model Context Protocol.
A tool is an external piece of software that an agent can invoke to provide context to an LLM. Typically, they are packaged as binaries and distributed via NPM, or they can be written in any programming language; alternatively, they may be a remote MCP provided by a server.
Below you'll find an example of an MCP tool that provides context to the LLM and advertises its ability to list all files and directories within a given directory_path
.
In its purest form, it is the application logic and a billboard on top, also known as a tool description. Below, you will find an example of a tool that lists directories and files within a directory.
@mcp.tool()
async def list_files(directory_path: str, ctx: Context[ServerSession, None]) -> List[Dict[str, Any]]:
###
### tool prompt starts here
"""
List all files and directories in a given directory path.
This tool helps explore filesystem structure by returning a list of items
with their names and types (file or directory). Useful for understanding
project structure, finding specific files, or navigating unfamiliar codebases.
Args:
directory_path: The absolute or relative path to the directory to list
Returns:
List of dictionaries with 'name' and 'type' keys for each filesystem item
"""
###
### tool prompt ends here
try:
if not os.path.isdir(directory_path):
return [{"error": f"Path '{directory_path}' is not a valid directory."}]
items = os.listdir(directory_path)
file_list = []
for item_name in items:
item_path = os.path.join(directory_path, item_name)
item_type = "directory" if os.path.isdir(item_path) else "file"
file_list.append({"name": item_name, "type": item_type})
return file_list
except OSError as e:
return [{"error": f"Error accessing directory: {e}"}]
For the remainder of this blog post, we'll focus on tool descriptions rather than the application logic itself, as each tool description is allocated into the context window to advertise capabilities that the LLM can invoke.
Language models process text using tokens, which are common sequences of characters found in a set of text. Below you will find a tokenisation of the tool description above.
The tool prompt above is approximately 93 tokens or 518 characters in length. It's not much, but bear with me as I'll show you how this can go fatally wrong really fast.
An LLM context window is the maximum amount of text (measured in tokens, which are roughly equivalent to words or parts of words) that a large language model can process at one time when generating or understanding text.
It determines how much prior conversation or input the model can "remember" and use to produce relevant responses
A harness is anything that wraps the LLM to get outcomes. For software development, this may include tools such as Roo/Cline, Cursor, Amp, Opencode, Codex, Windsurf, or any of these coding tools available.
The numbers advertised by LLM vendors for the context window are not the real context window. You should consider that to be a marketing number. Just because a model claims to have a 200k context window or a 1 million context window doesn't mean that's factual.
For the sake of simplicity, let's work with the old 200k number that Anthropic advertised for Sonnet 4. Amp now supports 400k, but back a couple of weeks ago, when the context window was 200k, users only had 176k of usable context. That's not because we're not providing the whole context window.
It's because there are two cold, hard facts:
The maths are simple. Take 200k, minus the system prompt (approximately 12k) and the harness prompt (approximately 12k), and you end up with 176k usable.
Alright, with those fundamentals established, let's switch back to how a potentially uneducated consumer thinks about Model Context Protocol servers.
They start their journey by doing a Google search for "best MCP servers", and they include side:reddit.com
in their query.
Currently, this is the top post for that Google search query....
That's eight MCP servers. Seems innocent, right? Well, it's not.
Suppose you were to install the recommended MCP servers found in that Reddit post and add in the JetBrains MCP.
Your usable context window would shrink from 178,000 usable to 84,717 usable.
And here's the problem: People are installing and shopping for MCP servers as if they're apps on their iPhone when the iPhone first came out. iPhones have terabytes of space. The context windows of all these LLMs are best thought of as if they were a Commodore 64, and you only have a tiny amount of memory...
So we have gone from 178,000 usable to 84,717 usable just by adding the Reddit suggestions and the JetBrains MCP, but it gets worse, as that's the usable amount before you've added your harness configuration, aka rules.
If your AGENTS.md, or Cursor rules are incredibly extensive, then you could find yourself operating with a headroom of 20k tokens and thus the quality of output is utter dogpoo.
I've come across stories of people installing 20+ MCP servers into their IDE. Yikes.
LLMs work by needle in the haystack. The more you allocate, the worse your outcomes will be. Less is more, folks! You don't need the "full context window" (whatever that means); you really only want to use 100k of it.
Refer to the Ralph blog post below for guidance on driving the main context window, similar to a Kubernetes scheduler, and managing other context windows through automatic garbage collection.
Once you exceed 100,000 allocations, it's time to start a new session. It's time to start a new thread. It's time to clear the context window (see below).
The critical questions that you have to ask are:
It's not just the amount of tokens allocated, but also a question of the number of tools - the more tools that are allocated into a context window, the greater the chances of driving inconsistent behaviour in the coding harness.
the data and analysis
Let's take the naive example of the list_files
tool. Let's say we registered in a custom tool, such as the code previously shown above, which lists files and directories on a filesystem.
Your harness (i.e., for example, Cursor Windsurf & Claude Code) also has a tool for listing files. There is no name spacing in the context window. Tool registrations can interfere with each other. If you list two tools for listing files, you make a non-deterministic system more non-deterministic.
Which list files tool does it invoke? Your custom one or does it invoke the in-built one in your harness?
Now take a moment to consider the potential for conflicts among the various tools and tool prompts listed in the table above, which includes 225 tools.
Extending on the above, this is where it gets fascinating because in each one of those tools, they have described a behaviour on how a tool could be done, and because there is no name spacing, it's not just the tool registration that could conflict; it could be the tool descriptions (the billboards) themselves.
And it gets even stranger because different LLMs have different styles and recommendations on how a tool or a tool prompt should be designed.
For example, did you know that if you use uppercase with GPT-5, it will become incredibly timid and uncertain, and it will end its turn early due to the uncertainty.
This is a direct contradiction to Anthropic, which recommends using upper case to stress the importance of things. However, if you do, you risk detuning GPT-5.
So yeah, not only do we have an issue with the number of tools allocated and what's in the prompt, but we also have an issue with "Is the tool tuned for the LLM provider that you're using?"
Perhaps I'm the first one to point out this as I haven't seen anyone else talking about it.
Everyone is consuming these MCP servers as if they're generic but these MCP servers need to be tuned to the LLM provider and I don't see this aspect being discussed in the MCP ecosystem currently or implementations of it.
If you haven't read it yet, Simon Wilson has a banger of a blog post called "The Lethal Trifecta," which is linked below. You should read it.
Simon is spot on with that blog post, but I'd like to expand on it and add another consideration that should be on your mind: supply chain security...
A couple of months back, the Amazon Q harness was compromised through a supply chain attack that updated the Amazon Q system prompt to delete all AWS resources.
Again, there is no name-spacing in the context window. If it's in the context window, it is up for consideration and execution. There is no significant difference between the coding harness prompt, the model system prompt, and the tooling prompts. It's all the same.
Therefore, I strongly recommend that if you're deploying MCP within an enterprise, you ban the installation of third-party MCPs. When I was the Tech Lead for AI developer productivity at Canva, around February, I wrote a design document and had it signed off by the security team. We got in early, and that was one of the best things we ever did, as it was before the hype and craze. By being early, the problem never existed and didn't need to be unwound.
read the tea leafs folks
It is straightforward to roll your own MCP server or MCP tools. In Enterprise, you must either deploy a remote MCP server or install a static binary on all endpoints using Ansible or another configuration management tool.
The key thing here is that it's a first-party solution, where you've designed the tools and tool prompts, and you have complete control over your supply chain. This means you do not have the same possibility of being attacked how Amazon Q was.
I strongly recommend not installing the GitHub MCP. It is not needed, folks. There exist two tiers of companies within the developer tooling space:
S-tier companies and non-S-tier companies.
What makes a company S-tier? Ah, it's simple: if that company has a CLI and the model weights know how to drive that CLI, then you don't need an MCP server.
For example, GitHub has a very stable command-line tool called GH, which is included in the model weights, meaning you don't need the GitHub MCP.
All you need to do is prompt to use the GitHub CLI, and voila! You have saved yourself an allocation of 55,260 tokens!
So, it should be obvious what is not S-tier. Non-S-tier occurs when the foundation models are unable to drive a developer tooling company's command-line tool, or when that developer tooling company doesn't have a command-line tool.
In these circumstances, developer tooling companies will need to create an MCP server to supplement the model weights, teaching it how to work with their specific developer tooling. If, at any stage in the future, the models can interface directly with the developer tooling company, then the MCP server is no longer needed.
The lethal trifecta concerns me greatly. It is a real risk. There's only so much you can do to control your supply chain. If your developers are interfacing with the GitHub CLI instead of the MCP and they read some data on a public GitHub comment, then that description or comment on the issue or pull request has a non-zero chance of being allocated into the context window, and boom, you're compromised.
It would be beneficial to have a standard that allows all harnesses to enable or disable MCP servers or tools within an MCP server, based on the stage of the SDLC workflow.
For example, if you're about to start work, you'll need the Jira MCP. However, once you have finished planning, you no longer need the Jira MCP allocated in the context window.
The less that is allocated, the less risks that exist, which is the classical security model of least privilege.
p.s. socials
2025-08-20 11:21:58
It's a meme as accurate as time. The problem is that our digital infrastructure depends upon just some random guy in Nebraska.
Open-source, by design, is not financially sustainable. Finding reliable, well-defined funding sources is exceptionally challenging. As projects grow in size, many maintainers burn out and find themselves unable to meet the increasing demands for support and maintenance.
Speaking from experience here, as someone who has delivered talks at conferences (see below) six years ago and also took a decent stab at resolving open source funding. The settlement on my land on Kangaroo Island was funded through open-source donations, and I'm forever thankful to the backers who supported me during a rough period of my life for helping make that happen.
Rather than watch a 60-minute talk by two burnt-out open-source maintainers, here is a quick summary of the conference video. The idea was simple:
If companies were to enumerate their bills of material and identify their unpaid vendors, they could take steps to mitigate their supply chain risks.
For dependencies that are of strategic importance, then the strategy would be a combination of financial support, becoming regular contributors to the project or even hiring the maintainers of these projects as engineers for [short|long]-term engagements.
Six years have gone by, and I haven't seen many companies do it. I mean, why would they? The software's given away for free, it's released as-is, so why would they pay?
It's only out of goodwill that someone would do it, or in my case, as part of a marketing expenditure program. While I was at Gitpod, I was able to distribute over $33,000 USD to open-source maintainers through the program.
The idea was simple: you could acquire backlinks and promote your brand on the profiles of prolific open-source maintainers, their website and in their GitHub repositories for a fraction of the cost compared to traditional marketing.
Through the above framework, I was able to raise over $33,000 USD for open source maintainers. The approach still works, and I don't understand why other companies are still overlooking it.
Now it's easy to say "marketing business dirty", etc., but what was underpinning this was a central thought.
If just one of those people can help more people better understand a technology or improve the developer experience for an entire ecosystem what is the worth/value of that and why isn’t our industry doing that yet?
The word volunteer, by definition, means those who have the ability and time to give freely.
Paying for resources that are being consumed broadens the list of people who can do open-source. Additionally, money enables open-source maintainers to buy services and outsource the activities that do not bring them joy.
AI has. I'm now eight months into my journey of using AI to automate software development (see below)
and when I speak with peers who have similarly spent the same amount of time invested in these tools, we're noticing a new emergent pattern:
We are reducing open source software consumption and taking dependencies on third parties.
Instead of relying on a third-party library maintained by a developer in Nebraska, we code-generate the libraries/dependencies ourselves unless the dependency has network effects or is critical infrastructure.
For example, you wouldn't want to code-generate your crypto - trust me, I have, and the results are comical. Well, it works, but I wouldn't trust it because I'm not a cryptographer. However, I'm sure a cryptographer with AI capabilities could generate something truly remarkable.
For projects like FFmpeg, Kubernetes, React, or PyTorch, they are good examples of something with network effects. Something that I wouldn't code-generate because it makes no sense to do so.
However, I want you to pause and consider the following:
If something is common enough to require a trustworthy NPM package, then it is also well-represented in the training set, and you can generate it yourself.
Humans created libraries to facilitate code reusability. I still heavily utilise libraries internally within the software I develop, but they are first-party libraries, not third-party libraries.
The problem with third-party libraries is that they were designed and built by someone else with different constraints and a different design in mind. When you code-generate your library, you can create it exactly to your needs without any trade-offs.
You also no longer have the Nebraska problem. When you encounter a bug or need a feature added, you no longer need to nag someone who maintains open-source software and seek their permission to get it added, or juggle with Unix patches.
You can just shape, mold, and craft the software exactly to your needs.
The next time you run into an issue on GitHub, I want you to think about why do you even have that dependency and why could you not just vibe code up its replacement and take complete ownership and control of your supply chain.
Yes, perhaps there will be bugs in the code-generated library, but then again, there are bugs in open source software. Open source software is released as is without warranties. When you find an issue, just kick off an agent to resolve it. You no longer need to be dependent on some random person on GitHub.
One positive upside I can see for this approach of code-generating your own first-party libraries is a reduction in blast radius for security incidents.
Consider Log4j and the billions of dollars of damage it caused while everyone was trying to eliminate/update that software dependency from their supply chain.
What if instead of using log4j you had your own logging library, and thus if there's any problems or security issues the blast radius is restricted just to you, not to the entire ecosystem?
My closing thoughts are that perhaps the open source sustainability issue is solved because of two factors:
I still believe there's a place for open source, and through AI, we're going to see a lot more open-source software being produced than ever before. But the question is, do you need to depend on it?
2025-07-19 10:22:46
It might surprise some folks, but I'm incredibly cynical when it comes to AI and what is possible; yet I keep an open mind. That said, two weeks ago, when I was in SFO, I discovered another thing that should not be possible. Every time I find out something that works, which should not be possible, it pushes me further and further, making me think that we are already in post-AGI territory.
I was sitting next to a mate at a pub; it was pretty late, and we were just talking about LLM capabilities, riffing about what the modern version of Falco or any of these tools in the DFIR space looks like when combined with an LLM.
You see, a couple of months ago, I'd been playing with eBPF and LLMs and discovered that LLMs do eBPF unusually well. So in the spirit of deliberate practice (see below), a laptop was brought out, and we SSH'd into a Linux machine.
The idea was simple.
Could we convert an eBPF trace to a fully functional application via Ralph Wiggum? So we started with a toy.
2025-07-15 06:40:27
2025-07-14 12:22:55
If you've seen my socials lately, you might have seen me talking about Ralph and wondering what Ralph is. Ralph is a technique. In its purest form, Ralph is a Bash loop.
while :; do cat PROMPT.md | npx --yes @sourcegraph/amp ; done
Ralph can replace the majority of outsourcing at most companies for greenfield projects. It has defects, but these are identifiable and resolvable through various styles of prompts.
That's the beauty of Ralph - the technique is deterministically bad in an undeterministic world.
Ralph can be done with any tool that does not cap tool calls and usage (ie, Amp).
Ralph is currently building a brand new programming language. We are on the final leg before a brand new production-grade esoteric programming language is released. What's kind of wild to me is that Ralph has been able to build this language and is also able to program in this language without that language being in the LLM's training data set.
Amp creating a new programming language AFK https://t.co/KmmOtHIGK4
— geoff (@GeoffreyHuntley) July 13, 2025
Building software with Ralph requires an extreme amount of faith and a belief in eventual consistency. Ralph will test you. Every time Ralph has taken a wrong direction in making CURSED, I haven't blamed the tools, but instead looked inside. Each time Ralph does something wrong, Ralph gets tuned - like a guitar.
It starts with no playground in the beginning, with instructions for Ralph to construct a playground. Ralph is very good at making playgrounds, but he comes home bruised because he fell off the slide, so one then tunes Ralph by adding a sign next to the slide saying “SLIDE DOWN, DON’T JUMP, LOOK AROUND,” and Ralph is more likely to look and see the sign.
Eventually all Ralph thinks about is the signs so that’s when you get a new Ralph that doesn't feel defective like Ralph, at all.
When I was in SFO, I taught a few smart people about Ralph. One incredibly talented engineer listened and used Ralph on their next contract, walking away with the wildest ROI. These days, all they think about is Ralph.
From my iMessage
— geoff (@GeoffreyHuntley) July 11, 2025
(shared with permission)
Cost of a $50k USD contract, delivered, MVP, tested + reviewed with @ampcode.
$297 USD. pic.twitter.com/0JgT8Q19bV
2025-07-03 01:02:48
This post is a follow-up from LLMs are mirrors of operator skill in which I remarked the following:
I'd ask the candidate to explain the sounds of each one of the LLMs. What are the patterns and behaviors, and what are the things that you've noticed for each one of the different LLMs out there?
After publishing, I broke the cardinal rule of the internet - never read the comments and well, it's been on my mind that expanding on this points and explaining it in simple terms will, perhaps, help others start to see the beauty in AI.
Humble me, dear reader, for a moment and rewind time to the moment in time when you first purchased a car. I remember my first car, and I remember specifically knowing nothing about cars. I remember asking my father "what a good car is" and seeking his advice and recommendations.
Is that visual in your head? Good, now, fast-forward time back to now here in the present to the moment when you last purchased a car. What car was it? Why did you buy that car? What was different between your first car-buying experience and your last car-purchasing experience? What factors did you consider in your previous purchase that you perhaps didn't even consider when purchasing your first car?
If you wanted to go off-road 4WD'ing, you wouldn't purchase a hatchback. No, you would likely pick up a Land Rover 40 Series.
Likewise, if you have (or are about to have) a large family then upgrading from a two door sports car to "something better and more suitable for family" is the ultimate vehicle purchased upgrade trope in itself.
Now you might be wondering why I'm discussing cars (now), guitars (previously), and later on the page, animals; well, it's because I'm talking about LLMs, but through analogies...
LLMs as guitars
Most people assume all LLMs are interchangeable, but that’s like saying all cars are the same. A 4x4, a hatchback, and a minivan serve different purposes.
there are many LLMs and each LLMs has different sounds, properties and use cases. most people think each LLM is competiing with each other, in part they are but if you play around enough with them you'll notice each provider has a particular niche and they are fine-tuning towards that niche.
Currently, consumers of AI are picking and choosing their AI based on the number of people a car seats (context window size) and the total cost of the vehicle (price per mile or token), which is the wrong way to conduct purchasing decisions.
Instead of comparing context window sizes vs. m/tok costs, one should look deeper into the latent patterns of each model and consider what their needs are.
For the last couple of months, I've been using different ways to describe the emergent behaviour of LLMS to various people to refine what 'sticks and what does not'. The first couple of attempts involved anthropomorphism of the LLMs into Animals.
Galaxy brained precision based slothes (oracles) and incremental small brained hyperactive incremental squirrels (agents).
But I've come to realise that the latent patterns can be modelled as a four-way quadrant.