MoreRSS

site iconIrrational ExuberanceModify

By Will Larson. CTO at Carta, writes about software engineering and has authored several books including "An Elegant Puzzle."
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Irrational Exuberance

Building internal agents

2026-01-02 01:00:00

A few weeks ago in Facilitating AI adoption at Imprint, I mentioned our internal agent workflows that we are developing. This is not the core of Imprint–our core is powering co-branded credit card programs–and I wanted to document how a company like ours is developing these internal capabilities.

Building on that post’s ideas like a company-public prompt library for the prompts powering internal workflows, I wanted to write up some of the interesting problems and approaches we’ve taken as we’ve evolved our workflows, split into a series of shorter posts:

  1. Skill support
  2. Progressive disclosure and large files
  3. Context window compaction
  4. Evals to validate workflows
  5. Logging and debugability
  6. Subagents
  7. Code-driven vs LLM-driven workflows
  8. Triggers
  9. Iterative prompt and skill refinement

In the same spirit as the original post, I’m not writing these as an industry expert unveiling best practice, rather these are just the things that we’ve specifically learned along the way. If you’re developing internal frameworks as well, then hopefully you’ll find something interesting in these posts.

Building your intuition for agents

As more folks have read these notes, a recurring response has been, “How do I learn this stuff?” Although I haven’t spent time evaluating if this is the best way to learn, I can share what I have found effective:

  1. Reading a general primer on how Large Language Models work, such as AI Engineering by Chip Huyen. You could also do a brief tutorial too, you don’t need the ability to create an LLM yourself, just a mental model of what they’re capable of
  2. Build a script that uses a basic LLM API to respond to a prompt
  3. Extend that script to support tool calling for some basic tools like searching files in a local git repository (or whatever)
  4. Implement a tool_search tool along the lines of Anthropic Claude’s tool_search, which uses a separate context window to evaluate your current context window against available skills and return only the relevant skills to be used within your primary context window
  5. Implement a virtual file system, such that tools can operate on references to files that are not within the context window. Also add a series of tools to operate on that virtual file system like load_file, grep_file, or whatnot
  6. Support Agent Skills, particularly load_skills tool and enhancing the prompt with available skills
  7. Write post-workflow eval that runs automatically after each workflow and evaluates the quality of the workflow run
  8. Add context-window compaction support to keep context windows below a defined size Make sure that some of your tool responses are large enough to threaten your context-window’s limit, such that you’re forced to solve that problem

After working through the implementation of each of these features, I think you will have a strong foundation into how to build and extend these kinds of systems. The only missing piece is supporting code-driven agents, but unfortunately I think it’s hard to demonstrate the need of code-driven agents in simple examples, because LLM-driven agents are sufficiently capable to solve most contrived examples.

Why didn’t you just use X?

There are many existing agent frameworks, including OpenAI Agents SDK and Claude’s Agents SDK. Ultimately, I think these are fairly thin wrappers, and that you’ll learn a lot more by implementing these yourself, but I’m less confident that you’re better off long-term building your own framework.

My general recommendation would be to build your own to throw away, and then try to build on top of one of the existing frameworks if you find any meaningful limitations. That said, I really don’t regret the decision to build our own, because it’s just so simple from a code perspective.

Final thoughts

I think every company should be doing this work internally, very much including companies that aren’t doing any sort of direct AI work in their product. It’s very fun work to do, there’s a lot of room for improvement, and having an engineer or two working on this is a relatively cheap option to derisk things if AI-enhanced techniques continue to improve as rapidly in 2026 as they did in 2025.

Building an internal agent: Iterative prompt and skill refinement

2026-01-02 00:30:00

Some of our internal workflows are being used quite frequently, and usage reveals gaps in the current prompts, skills, and tools. Here is how we’re working to iterate on these internal workflows.

This is part of the Building an internal agent series.

Why does iterative refinement matter?

When companies push on AI-led automation, specifically meaning LLM agent-driven automation, there are two major goals. First is the short-term goal of increasing productivity. That’s a good goal. Second, and I think even more importantly, is the long-term goal of helping their employees build a healthy intuition for how to use various kinds of agents to accomplish complex tasks.

If we see truly remarkable automation benefits from the LLM wave of technology, it’s not going to come from the first-wave of specific tools we build, but the output of a new class of LLM-informed users and developers. There is nowhere that you can simply acquire that talent, instead it’s talent that you have to develop inhouse, and involving more folks in iterative refinement of LLM-driven systems is the most effective approach that I’ve encountered.

How are we enabling iterative refinement?

We’ve taken a handful of different approaches here, all of which are currently in use. From earliest to latest, our approaches have been:

  1. Being responsive to feedback is our primary mechanism for solving issues. This is both responding quickly in an internal #ai channel, but also skimming through workflows each day to see humans interacting, for better and for worse, with the agents. This is the most valuable ongoing source of improvement.

  2. Owner-led refinement has been our intended primary mechanism, although in practice it’s more of the secondary mechanism. We store our prompts in Notion documents, where they can be edited by their owners in real-time. Permissions vary on a per-document basis, but most prompts are editable by anyone at the company, as we try to facilitate rapid learning.

    Editable prompts alone aren’t enough, these prompts also need to be discoverable. To address that, whenever an action is driven by a workflow, we include a link to the prompt. For example, a Slack message sent by a chat bot will include a link to the prompt, as well a comment in Jira.

  3. Claude-enhanced, owner-led refinement via the Datadog MCP to pull logs into the repository where the skills live has been fairly effective, although mostly as a technique used by the AI Engineering team rather than directly by owners. Skills are a bit of a platform, as they are used by many different workflows, so it may be inevitable that they are maintained by a central team rather than by workflow owners.

  4. Dashboard tracking shows how often each workflow runs and errors associated with those runs. We also track how often each tool is used, including how frequently each skill is loaded.

My guess is that we will continue to add more refinement techniques as we go, without being able to get rid of any of the existing ones. This is sort of disappointing–I’d love to have the same result with fewer–but I think we’d be worse off if we cut any of them.

Next steps

What we don’t do yet, but is the necessary next step to making this truly useful, is to include a subjective post-workflow eval that determines whether the workflow was effective. While we have evals to evaluate workflows, this would be using evals to evaluate individual workflow runs, which would provide a level of very useful detail to understand.

How it’s going

In our experience thus far, there are roughly three workflow archetypes: chatbots, very well understood iterative workflows (e.g. applying :merge: reacji to merged PRs as discussed in code-driven workflows), and not-yet-well-understood workflows.

Once we build a code-driven workflow, they have always worked well for us, because we have built a very focused, well-understood solution at that point. Conversely, chatbots are an extremely broad, amorphous problem space, and I think post-run evals will provide a high quality dataset to improve them iteratively with a small amount of human-in-the-loop to nudge the evolution of their prompts and skills.

The open question, for us anyway, is how we do a better job of identifying and iterating on the not-yet-well-understood workflows. Ideally without requiring a product engineer to understand and implement each of them individually. We’ve not scalably cracked this one yet, and I do think scalably cracking it is the key to whether these internal agents are somewhat useful (frequently performed tasks performed by many people eventually get automated) and are truly transformative (a significant percentage of tasks, even infrequent ones performed by a small number of people get automated).

Building an internal agent: Subagent support

2026-01-01 01:45:00

Most of the extensions to our internal agent have been the direct result of running into a problem that I couldn’t elegantly solve within our current framework. Evals, compaction, large-file handling all fit into that category. Subagents, allowing an agent to initiate other agents, are in a different category: I’ve frequently thought that we needed subagents, and then always found an alternative that felt more natural.

Eventually, I decided to implement them anyway, because it seemed like an interesting problem to reason through. Eventually I would need them… right? (Aside: I did, indeed, eventually use subagents to support code-driven workflows invoking LLMs.)

This is part of the Building an internal agent series.

Why subagents matter

“Subagents” is the name for allowing your agents to invoke other agents, which have their own system prompt, available tools, and context windows. Some of the reasons you’re likely to consider subagents:

  1. Provide an effective strategy for context window management. You could provide them access to uploaded files, and then ask them to extract specific data from those files, without polluting your primary agent’s context window with the files’ content
  2. You could use subagents to support concurrent work. For example, you could allow invocation of multiple subagents at once, and then join on the completion of all subagents. If your agent workflows are predominantly constrained by network IO (to e.g. model evaluation APIs), then this could support significant reduction in clock-time to complete your workflows
  3. I think you could convince yourself that there are some security benefits to performing certain operations in subagents with less access. I don’t actually believe that’s meaningfully better, but you could at least introduce friction by ensuring that retrieving external resources and accessing internal resources can only occur in mutually isolated subagents

Of all these reasons, I think that either the first or the second will be most relevant to the majority of internal workflow developers.

How we implemented subagents

Our implementation for subagents is quite straightforward:

  1. We define subagents in subagents/*.yaml, where each subagent has a prompt, allowed tools (or option to inherit all tools from parent agent), and a subset of the configurable fields from our agent configuration
  2. Each agent is configured to allow specific subagents, e.g. the planning subagent
  3. Agents invoke subagents via the subagent(agent_name, prompt, files) tool, which allows them to decide which virtual files are accessible within the subagent, and also the user prompt passed to the subagent (the subagent already has a default system prompt within its configuration)

This has worked fairly well. For example, supporting the quick addition of planning and think subagents which the parent agent can use to refine its work. We further refactored the implementation of the harness running agents to be equivalent to subagents, where effectively every agent is a subagent, and so forth.

How this has worked / what next

To be totally honest, I just haven’t found subagents to be particularly important to our current workflows. However, user-facing latency is a bit of an invisible feature, with it not mattering at all until at some point it starts subtly creating undesirable user workflows (e.g. starting a different task before checking the response), so I believe long-term this will be the biggest advantage for us.

Addendum: as alluded to in the introduction, this subagents functionality ended up being extremely useful when we introduced code-driven workflows, as it allows handing off control to the LLM for a very specific determination, before returning control to the code.

Building an internal agent: Code-driven vs LLM-driven workflows

2026-01-01 01:30:00

When I started this project, I knew deep in my heart that we could get an LLM plus tool-usage to solve arbitrarily complex workflows. I still believe this is possible, but I’m no longer convinced this is actually a good solution. Some problems are just vastly simpler, cheaper, and faster to solve with software. This post talks about our approach to supporting both code and LLM-driven workflows, and why we decided it was necessary.

This is part of the Building an internal agent series.

Why determinism matters

When I joined Imprint, we already had a channel where folks would share pull requests for review. It wasn’t required to add pull requests to that channel, but it was often the fastest way to get someone to review it, particularly for cross-team pull requests.

I often start my day by skimming for pull requests that need a review in that channel, and quickly realized that often a pull request would get reviewed and merged without someone adding the :merged: reacji onto the chat. This felt inefficient, but also extraordinarily minor, and not the kind of thing I want to complain about. Instead, I pondered how I could solve it without requiring additional human labor.

So, I added an LLM-powered workflow to solve this. The prompt was straightforward:

  1. Get the last 10 messages in the Slack channel
  2. For each one, if there was exactly one Github pull request URL, extract that URL
  3. Use the Github MCP to check the status of each of those URLs
  4. Add the :merged: reacji to messages where the associated pull request was merged or closed

This worked so well! So, so well. Except, ahh, except that it sometimes decided to add :merged: to pull requests that weren’t merged. Then no one would look at those pull requests. So, it worked in concept–so much smart tool usage!–but in practice it actually didn’t solve the problem I was trying to solve: erroneous additions of the reacji meant folks couldn’t evaluate whether to look at a given pull request in the channel based on the reacji’s presence.

(As an aside, some people really don’t like the term reacji. Don’t complain to me about it, this is what Slack calls them.)

How we implemented support for code-driven workflows

Our LLM-driven workflows are orchestrated by a software handler. That handler works something like:

  1. Trigger comes in, and the handler selects which configuration corresponds with the trigger
  2. Handler uses that configuration and trigger to pull the associated prompt, load the approved tools, and generate the available list of virtual files (e.g. files attached to a Jira issue or Slack message)
  3. Handler sends the prompt and available tools to an LLM, then coordinates tool calls based on the LLM’s response, including e.g. making virtual files available to tools. The handler also has termination conditions where it prevents excessive tool usage, and so on
  4. Eventually the LLM will stop recommending tools, and the final response from the LLM will be used or discarded depending on the configuration (e.g. configuration can determine whether the final response is sent to Slack)

We updated our configuration to allow running in one of two configurations:

# this is default behavior if omitted
coordinator: llm

# this is code-driven workflow
coordinator: script
coordinator_script: scripts/pr_merged.py

When the coordinator is set to script, then instead of using the handler to determine which tools are called, custom Python is used. That Python code has access to the same tools, trigger data, and virtual files as the LLM-handling code. It can use the subagent tool to invoke an LLM where useful (and that subagent can have full access to tools as well), but LLM control only occurs when explicitly desired.

This means that these scripts–which are being written and checked in by our software engineers, going through code review and so on–have the same permission and capabilities as the LLM, although given it’s just code, any given commit could also introduce a new dependency, etc.

How’s it working? / Next steps?

Altogether, this has worked very well for complex workflows. I would describe it as a “solution of frequent resort”, where we use code-driven workflows as a progressive enhancement for workflows where LLM prompts and tools aren’t reliable or quick enough. We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.

Even as models get more powerful, relying on them narrowly in cases where we truly need intelligence, rather than for iterative workflows, seems like a long-term addition to our toolkit.

Building an internal agent: Logging and debugability

2026-01-01 01:15:00

Agents are extremely impressive, but they also introduce a lot of non-determinism, and non-determinism means sometimes weird things happen. To combat that, we’ve needed to instrument our workflows to make it possible to debug why things are going wrong.

This is part of the Building an internal agent series.

Why logging matters

Whenever an agent does something sub-optimal, folks flag it as a bug. Often, the “bug” is ambiguity in the prompt that led to sub-optimal tool usage. That makes me feel better, but it doesn’t make the folks relying on these tools feel any better: they just expect the tools to work.

This means that debugging unexpected behavior is a significant part of rolling out agents internally, and it’s important to make it easy enough to do it frequently. If it takes too much time, effort or too many permissions, then your agents simply won’t get used.

How we implemented logging

Our agents run in an AWS Lambda, so the very first pass at logging was simply printing to standard out to be captured in the Lambda’s logs. This worked OK for the very first steps, but also meant that I had to log into AWS every time something went wrong, and even many engineers didn’t know where to find logs.

The second pass was creating the #ai-logs channel, where every workflow run shared its configuration, tools used, and a link to the AWS URL where logs could be found. This was a step up, but still required a bunch of log spelunking to answer basic questions.

The third pass, which is our current implementation, was integrating Datadog’s LLM Observability which provides an easy to use mechanism to view each span within the LLM workflow, making it straightforward to debug nuanced issues without digging through a bunch of logs. This is a massive improvement.

It’s also worth noting that the Datadog integration also made it easy to introduce dashboarding for our internal efforts, which has been a very helpful, missing ingredient to our work.

How is it working? / What’s next?

I’ll be honest: the Datadog LLM observability toolkit is just great. The only problem I have at this point is that we mostly constrain Datadog accounts to folks within the technology organization, so workflow debugging isn’t very accessible to folks outside that team. However, in practice there are very few folks who would be actively debugging these workflows who don’t already have access, so it’s more of a philosophical issue than a practical one.

Building an internal agent: Evals to validate workflows

2026-01-01 01:00:00

Whenever a new pull request is submitted to our agent’s GitHub repository, we run a bunch of CI/CD operations on it. We run an opinionated linter, we run typechecking, and we run a bunch of unittests. All of these work well, but none of them test entire workflows end-to-end. For that end-to-end testing, we introduced an eval pipeline.

This is part of the Building an internal agent series.

Why evals matter

The harnesses that run agents have a lot of interesting nuance, but they’re generally pretty simple: some virtual file management, some tool invocation, and some context window management. However, it’s very easy to create prompts that don’t work well, despite the correctness of all the underlying pieces. Evals are one tool to solve that, exercising your prompts and tools together and grading the results.

How we implemented it

I had the good fortune to lead Imprint’s implementation of Sierra for chat and voice support, and I want to acknowledge that their approach has deeply informed my view of what does and doesn’t work well here.

The key components of Sierra’s approach are:

  1. Sierra implements agents as a mix of React-inspired code that provide tools and progressively-disclosed context, and a harness runner that runs that software.
  2. Sierra allows your code to assign tags to conversations such as “otp-code-sent” or “lang-spanish” which can be used for filtering conversations, as well as other usecases discussed shortly.
  3. Every tool implemented for a Sierra agent has both a true and a mock implementation. For example, for a tool that searches a knowledge base, the true version would call its API directly, and the mock version would return a static (or locally generated) version for use in testing.
  4. Sierra names their eval implementation as “simulations.” You can create any number of simulations either in code or via the UI-driven functionality.
  5. Every evaluation has an initial prompt, metadata about the situation that’s available to the software harness running the agent, and criteria to evaluate whether an evaluation succeeds.
  6. These evaluation criteria are both subjective and objective. The subjective criteria are “agent as judge” to assess whether certain conditions were met (e.g. was the response friendly?). The objective criteria are whether specific tags (“login-successful”) were, or were not (“login-failed”) added to a conversation.

Then when it comes to our approach, we basically just reimplemented that approach as it’s worked well for us. For example, the following image is the configuration for an eval we run.

YAML configuration for an eval showing a Slack reaction JIRA workflow test with expected tools and evaluation criteria

Then whenever a new PR is opened, these run automatically along with our other automation.

GitHub Actions bot comment showing eval results with 6 failed tests and tool mismatch details

While we largely followed the map laid out by Sierra’s implementation, we did diverge on the tags concept. For objective evaluation, we rely exclusively on tools that are, or are not, called. Sierra’s tag implementation is more flexible, but since our workflows are predominantly prompt-driven rather than code-driven, it’s not an easy one for us to adopt

Altogether, following this standard implementation worked well for us.

How is it working?

Ok, this is working well, but not nearly as well as I hoped it would. The core challenge is the non-determinism introduced by these eval tests, where in practice there’s very strong signal when they all fail, and strong signal when they all pass, but most runs are in between those two. A big part of that is sloppy eval prompts and sloppy mock tool results, and I’m pretty confident I could get them passing more reliably with some effort (e.g. I did get our Sierra tests almost always passing by tuning them closely, although even they aren’t perfectly reliable).

The biggest issue is that our reliance on prompt-driven workflows rather than code-driven workflows introduces a lot of non-determinism, which I don’t have a way to solve without the aforementioned prompt and mock tuning.

What’s next?

There are three obvious follow ups:

  1. More tuning on prompts and mocked tool calls to make the evals more probabilistically reliable
  2. I’m embarrassed to say it out loud, but I suspect we need to retry failed evals to see if they pass e.g. “at least once in three tries” to make this something we can introduce as a blocking mechanism in our CI/CD
  3. This highlights the general limitation of LLM-driven workflows, and I suspect that I’ll have to move more complex workflows away from LLM-driven workflows to get them to work more consistently

Altogether, I’m very glad that we introduced evals, they are an essential mechanism for us to evaluate our workflows, but we’ve found them difficult to consistently operationalize as something we can rely on as a blocking tool rather than directionally relevant context.