2026-01-02 01:00:00
A few weeks ago in Facilitating AI adoption at Imprint, I mentioned our internal agent workflows that we are developing. This is not the core of Imprint–our core is powering co-branded credit card programs–and I wanted to document how a company like ours is developing these internal capabilities.
Building on that post’s ideas like a company-public prompt library for the prompts powering internal workflows, I wanted to write up some of the interesting problems and approaches we’ve taken as we’ve evolved our workflows, split into a series of shorter posts:
In the same spirit as the original post, I’m not writing these as an industry expert unveiling best practice, rather these are just the things that we’ve specifically learned along the way. If you’re developing internal frameworks as well, then hopefully you’ll find something interesting in these posts.
As more folks have read these notes, a recurring response has been, “How do I learn this stuff?” Although I haven’t spent time evaluating if this is the best way to learn, I can share what I have found effective:
tool_search tool along the lines of Anthropic Claude’s tool_search,
which uses a separate context window to evaluate your current context window against available skills and return only the
relevant skills to be used within your primary context windowload_file, grep_file, or whatnotload_skills tool and enhancing the prompt with available skillsAfter working through the implementation of each of these features, I think you will have a strong foundation into how to build and extend these kinds of systems. The only missing piece is supporting code-driven agents, but unfortunately I think it’s hard to demonstrate the need of code-driven agents in simple examples, because LLM-driven agents are sufficiently capable to solve most contrived examples.
There are many existing agent frameworks, including OpenAI Agents SDK and Claude’s Agents SDK. Ultimately, I think these are fairly thin wrappers, and that you’ll learn a lot more by implementing these yourself, but I’m less confident that you’re better off long-term building your own framework.
My general recommendation would be to build your own to throw away, and then try to build on top of one of the existing frameworks if you find any meaningful limitations. That said, I really don’t regret the decision to build our own, because it’s just so simple from a code perspective.
I think every company should be doing this work internally, very much including companies that aren’t doing any sort of direct AI work in their product. It’s very fun work to do, there’s a lot of room for improvement, and having an engineer or two working on this is a relatively cheap option to derisk things if AI-enhanced techniques continue to improve as rapidly in 2026 as they did in 2025.
2026-01-02 00:30:00
Some of our internal workflows are being used quite frequently, and usage reveals gaps in the current prompts, skills, and tools. Here is how we’re working to iterate on these internal workflows.
This is part of the Building an internal agent series.
When companies push on AI-led automation, specifically meaning LLM agent-driven automation, there are two major goals. First is the short-term goal of increasing productivity. That’s a good goal. Second, and I think even more importantly, is the long-term goal of helping their employees build a healthy intuition for how to use various kinds of agents to accomplish complex tasks.
If we see truly remarkable automation benefits from the LLM wave of technology, it’s not going to come from the first-wave of specific tools we build, but the output of a new class of LLM-informed users and developers. There is nowhere that you can simply acquire that talent, instead it’s talent that you have to develop inhouse, and involving more folks in iterative refinement of LLM-driven systems is the most effective approach that I’ve encountered.
We’ve taken a handful of different approaches here, all of which are currently in use. From earliest to latest, our approaches have been:
Being responsive to feedback is our primary mechanism for solving issues.
This is both responding quickly in an internal #ai channel, but also skimming
through workflows each day to see humans interacting, for better and for worse,
with the agents. This is the most valuable ongoing source of improvement.
Owner-led refinement has been our intended primary mechanism, although in practice it’s more of the secondary mechanism. We store our prompts in Notion documents, where they can be edited by their owners in real-time. Permissions vary on a per-document basis, but most prompts are editable by anyone at the company, as we try to facilitate rapid learning.
Editable prompts alone aren’t enough, these prompts also need to be discoverable. To address that, whenever an action is driven by a workflow, we include a link to the prompt. For example, a Slack message sent by a chat bot will include a link to the prompt, as well a comment in Jira.
Claude-enhanced, owner-led refinement via the Datadog MCP to pull logs into the repository where the skills live has been fairly effective, although mostly as a technique used by the AI Engineering team rather than directly by owners. Skills are a bit of a platform, as they are used by many different workflows, so it may be inevitable that they are maintained by a central team rather than by workflow owners.
Dashboard tracking shows how often each workflow runs and errors associated with those runs. We also track how often each tool is used, including how frequently each skill is loaded.
My guess is that we will continue to add more refinement techniques as we go, without being able to get rid of any of the existing ones. This is sort of disappointing–I’d love to have the same result with fewer–but I think we’d be worse off if we cut any of them.
What we don’t do yet, but is the necessary next step to making this truly useful, is to include a subjective post-workflow eval that determines whether the workflow was effective. While we have evals to evaluate workflows, this would be using evals to evaluate individual workflow runs, which would provide a level of very useful detail to understand.
In our experience thus far, there are roughly three workflow archetypes:
chatbots,
very well understood iterative workflows (e.g. applying :merge: reacji
to merged PRs as discussed in code-driven workflows),
and not-yet-well-understood workflows.
Once we build a code-driven workflow, they have always worked well for us, because we have built a very focused, well-understood solution at that point. Conversely, chatbots are an extremely broad, amorphous problem space, and I think post-run evals will provide a high quality dataset to improve them iteratively with a small amount of human-in-the-loop to nudge the evolution of their prompts and skills.
The open question, for us anyway, is how we do a better job of identifying and iterating on the not-yet-well-understood workflows. Ideally without requiring a product engineer to understand and implement each of them individually. We’ve not scalably cracked this one yet, and I do think scalably cracking it is the key to whether these internal agents are somewhat useful (frequently performed tasks performed by many people eventually get automated) and are truly transformative (a significant percentage of tasks, even infrequent ones performed by a small number of people get automated).
2026-01-01 01:45:00
Most of the extensions to our internal agent have been the direct result of running into a problem that I couldn’t elegantly solve within our current framework. Evals, compaction, large-file handling all fit into that category. Subagents, allowing an agent to initiate other agents, are in a different category: I’ve frequently thought that we needed subagents, and then always found an alternative that felt more natural.
Eventually, I decided to implement them anyway, because it seemed like an interesting problem to reason through. Eventually I would need them… right? (Aside: I did, indeed, eventually use subagents to support code-driven workflows invoking LLMs.)
This is part of the Building an internal agent series.
“Subagents” is the name for allowing your agents to invoke other agents, which have their own system prompt, available tools, and context windows. Some of the reasons you’re likely to consider subagents:
Of all these reasons, I think that either the first or the second will be most relevant to the majority of internal workflow developers.
Our implementation for subagents is quite straightforward:
subagents/*.yaml,
where each subagent has a prompt, allowed tools (or option to inherit all tools from parent agent),
and a subset of the configurable fields from our agent configurationplanning subagentsubagent(agent_name, prompt, files) tool,
which allows them to decide which virtual files are accessible within the subagent,
and also the user prompt passed to the subagent (the subagent already has a default system prompt within its configuration)This has worked fairly well. For example, supporting the quick addition of planning and think subagents
which the parent agent can use to refine its work.
We further refactored the implementation of the harness running agents to be equivalent to subagents,
where effectively every agent is a subagent, and so forth.
To be totally honest, I just haven’t found subagents to be particularly important to our current workflows. However, user-facing latency is a bit of an invisible feature, with it not mattering at all until at some point it starts subtly creating undesirable user workflows (e.g. starting a different task before checking the response), so I believe long-term this will be the biggest advantage for us.
Addendum: as alluded to in the introduction, this subagents functionality ended up being extremely useful when we introduced code-driven workflows, as it allows handing off control to the LLM for a very specific determination, before returning control to the code.
2026-01-01 01:30:00
When I started this project, I knew deep in my heart that we could get an LLM plus tool-usage to solve arbitrarily complex workflows. I still believe this is possible, but I’m no longer convinced this is actually a good solution. Some problems are just vastly simpler, cheaper, and faster to solve with software. This post talks about our approach to supporting both code and LLM-driven workflows, and why we decided it was necessary.
This is part of the Building an internal agent series.
When I joined Imprint, we already had a channel where folks would share pull requests for review. It wasn’t required to add pull requests to that channel, but it was often the fastest way to get someone to review it, particularly for cross-team pull requests.
I often start my day by skimming for pull requests that need a review in that channel,
and quickly realized that often a pull request would get reviewed and merged without
someone adding the :merged: reacji onto the chat. This felt inefficient, but also
extraordinarily minor, and not the kind of thing I want to complain about.
Instead, I pondered how I could solve it without requiring additional human labor.
So, I added an LLM-powered workflow to solve this. The prompt was straightforward:
:merged: reacji to messages where the associated pull request was merged or closedThis worked so well! So, so well. Except, ahh, except that it sometimes decided to add :merged:
to pull requests that weren’t merged. Then no one would look at those pull requests.
So, it worked in concept–so much smart tool usage!–but in practice it actually didn’t
solve the problem I was trying to solve: erroneous additions of the reacji meant
folks couldn’t evaluate whether to look at a given pull request in the channel based on the reacji’s presence.
(As an aside, some people really don’t like the term reacji.
Don’t complain to me about it, this is what Slack calls them.)
Our LLM-driven workflows are orchestrated by a software handler. That handler works something like:
We updated our configuration to allow running in one of two configurations:
# this is default behavior if omitted
coordinator: llm
# this is code-driven workflow
coordinator: script
coordinator_script: scripts/pr_merged.py
When the coordinator is set to script, then instead of using the handler to determine which tools are called,
custom Python is used. That Python code has access to the same tools, trigger data, and virtual files
as the LLM-handling code. It can use the subagent tool to invoke an LLM where useful
(and that subagent can have full access to tools as well), but LLM control only occurs when explicitly desired.
This means that these scripts–which are being written and checked in by our software engineers, going through code review and so on–have the same permission and capabilities as the LLM, although given it’s just code, any given commit could also introduce a new dependency, etc.
Altogether, this has worked very well for complex workflows. I would describe it as a “solution of frequent resort”, where we use code-driven workflows as a progressive enhancement for workflows where LLM prompts and tools aren’t reliable or quick enough. We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.
Even as models get more powerful, relying on them narrowly in cases where we truly need intelligence, rather than for iterative workflows, seems like a long-term addition to our toolkit.
2026-01-01 01:15:00
Agents are extremely impressive, but they also introduce a lot of non-determinism, and non-determinism means sometimes weird things happen. To combat that, we’ve needed to instrument our workflows to make it possible to debug why things are going wrong.
This is part of the Building an internal agent series.
Whenever an agent does something sub-optimal, folks flag it as a bug. Often, the “bug” is ambiguity in the prompt that led to sub-optimal tool usage. That makes me feel better, but it doesn’t make the folks relying on these tools feel any better: they just expect the tools to work.
This means that debugging unexpected behavior is a significant part of rolling out agents internally, and it’s important to make it easy enough to do it frequently. If it takes too much time, effort or too many permissions, then your agents simply won’t get used.
Our agents run in an AWS Lambda, so the very first pass at logging was simply printing to standard out to be captured in the Lambda’s logs. This worked OK for the very first steps, but also meant that I had to log into AWS every time something went wrong, and even many engineers didn’t know where to find logs.
The second pass was creating the #ai-logs channel, where every workflow
run shared its configuration, tools used, and a link to the AWS URL where
logs could be found. This was a step up, but still required a bunch of
log spelunking to answer basic questions.
The third pass, which is our current implementation, was integrating Datadog’s LLM Observability which provides an easy to use mechanism to view each span within the LLM workflow, making it straightforward to debug nuanced issues without digging through a bunch of logs. This is a massive improvement.
It’s also worth noting that the Datadog integration also made it easy to introduce dashboarding for our internal efforts, which has been a very helpful, missing ingredient to our work.
I’ll be honest: the Datadog LLM observability toolkit is just great. The only problem I have at this point is that we mostly constrain Datadog accounts to folks within the technology organization, so workflow debugging isn’t very accessible to folks outside that team. However, in practice there are very few folks who would be actively debugging these workflows who don’t already have access, so it’s more of a philosophical issue than a practical one.
2026-01-01 01:00:00
Whenever a new pull request is submitted to our agent’s GitHub repository, we run a bunch of CI/CD operations on it. We run an opinionated linter, we run typechecking, and we run a bunch of unittests. All of these work well, but none of them test entire workflows end-to-end. For that end-to-end testing, we introduced an eval pipeline.
This is part of the Building an internal agent series.
The harnesses that run agents have a lot of interesting nuance, but they’re generally pretty simple: some virtual file management, some tool invocation, and some context window management. However, it’s very easy to create prompts that don’t work well, despite the correctness of all the underlying pieces. Evals are one tool to solve that, exercising your prompts and tools together and grading the results.
I had the good fortune to lead Imprint’s implementation of Sierra for chat and voice support, and I want to acknowledge that their approach has deeply informed my view of what does and doesn’t work well here.
The key components of Sierra’s approach are:
Then when it comes to our approach, we basically just reimplemented that approach as it’s worked well for us. For example, the following image is the configuration for an eval we run.

Then whenever a new PR is opened, these run automatically along with our other automation.

While we largely followed the map laid out by Sierra’s implementation, we did diverge on the tags concept. For objective evaluation, we rely exclusively on tools that are, or are not, called. Sierra’s tag implementation is more flexible, but since our workflows are predominantly prompt-driven rather than code-driven, it’s not an easy one for us to adopt
Altogether, following this standard implementation worked well for us.
Ok, this is working well, but not nearly as well as I hoped it would. The core challenge is the non-determinism introduced by these eval tests, where in practice there’s very strong signal when they all fail, and strong signal when they all pass, but most runs are in between those two. A big part of that is sloppy eval prompts and sloppy mock tool results, and I’m pretty confident I could get them passing more reliably with some effort (e.g. I did get our Sierra tests almost always passing by tuning them closely, although even they aren’t perfectly reliable).
The biggest issue is that our reliance on prompt-driven workflows rather than code-driven workflows introduces a lot of non-determinism, which I don’t have a way to solve without the aforementioned prompt and mock tuning.
There are three obvious follow ups:
Altogether, I’m very glad that we introduced evals, they are an essential mechanism for us to evaluate our workflows, but we’ve found them difficult to consistently operationalize as something we can rely on as a blocking tool rather than directionally relevant context.