MoreRSS

site iconLenny RachitskyModify

The #1 business newsletter on Substack.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Lenny Rachitsky

🎙️ How I AI: How to write AI agent loops in Claude Code and Codex + How Claude Mythos found a 15-year-old bug in Mozilla Firefox

2026-06-22 23:02:37

How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex

Listen now on YouTubeSpotifyApple Podcasts

Brought to you by:

  • WorkOS—Make your app enterprise-ready today

  • Runway—The creative AI platform for images, video and more

In this hands-on tutorial, Claire explains the difference between heartbeats, crons, hooks, and goal-based loops, then builds real automations in Claude Code and Codex, including a daily PR-review loop and a weekly skills loop that spawns its own subagents. If you’ve heard “loop engineering” and wondered what it actually means, this is the beginner-friendly breakdown.

Biggest takeaways:

  1. A loop is just a prompt that fires itself, nothing more exotic than that. The reason “loops” sound intimidating is that the hype cycle turned a basic automation concept into something mystical. Heartbeats, crons, and webhooks have been around forever. What’s new is pointing them at an AI agent instead of a batch job.

  2. Goals are the most powerful loop type, and the one most people get wrong. A goal loop sets an outcome and runs an agent against it until the outcome is validated or the agent gets stuck. It doesn’t stop on a timer; it stops when the work is actually done. Fuzzy success criteria means the agent loops forever, burning tokens, so my advice is to let Codex write its own goals, using OpenAI’s goal-writing guide as a starting point.

  3. Think about loops the way you think about onboarding an employee. Define the job: what they check, how often, what output you want, and who to contact when something’s wrong. “Every Friday at 10 a.m., review all merged PRs and identify skills our agents are missing” is a job description. It’s also a loop prompt.

  4. Your agent can have its own agents. This is where loops get truly powerful. The PR-review loop Claire built in Claude Code doesn’t just check PR status; it spins off dedicated subagents to babysit individual PRs until all merge checks are green. The skills loop in Codex identifies gaps and immediately spawns subagents to validate each new skill using a goal loop.

  5. Loops get expensive if you don’t write them carefully. If the success criteria is vague or the validation threshold is too thin, the agent will keep running and keep charging without meaningful progress. Monitor both cost and output quality from day one.

  6. The morning briefing in Claude Cowork is a perfect loop starter. A scheduled task that fires every morning, checks your calendar and email, and sends a summary to Slack is already a fully functional loop. No code required. From there, scaling up to PR reviews or skills identification in Claude Code or Codex is a natural next step.

  7. The power move is loops that generate their own subagent loops. In the Codex demo, Claire’s weekly automation spawned two named subagents that each ran their own goal loops to validate skills in real time. The ceiling on loop-based automation is basically “how well can you define the job?” not “how complex is the engineering?”

Blog and detailed workflow walkthroughs from this episode:

How I AI: Designing AI Agent Loops in Claude Code and Codex: https://www.chatprd.ai/how-i-ai/how-i-ai-designing-ai-agent-loops-in-claude-code-and-codex
↳ Build a Self-Improving AI to Generate Agent Skills in Codex: https://www.chatprd.ai/how-i-ai/workflows/build-a-self-improving-ai-to-generate-agent-skills-in-codex
↳ Automate Daily Pull Request Reviews with a Claude Code Agent: https://www.chatprd.ai/how-i-ai/workflows/automate-daily-pull-request-reviews-with-a-claude-code-agent

How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead

Listen now on YouTubeSpotifyApple Podcasts

Brought to you by:

  • WorkOS—Make your app enterprise-ready today

  • Metaview—The agentic recruiting platform for winning teams

Brian Grinstead, distinguished engineer at Mozilla, breaks down how his team used AI agents to ship 423 Firefox security fixes in one month. He explains why the real unlock wasn’t just a better model, but the custom harness around it: scoring files, running goal loops, verifying bugs with subagents, and keeping humans in the review process. It’s a tactical look at how to point agents at a massive codebase and get fixes you can actually ship.

Biggest takeaways:

  1. The Firefox security bug spike wasn’t just about the model; it was the harness too. While everyone focused on Mythos, the real story is that Firefox built a custom harness that gives AI agents the right tools to find, verify, and fix bugs. Brian says this is simpler than it looks: “It’s actually a reasonably simple wrapper around it. You just need to give it access to the right tools for the job.”

  2. Agents are relentless in a way humans can’t be. Agents will try 14, 15, 20 different approaches to trigger a bug without getting tired or losing focus. Brian found bugs that required the agent to try 14 times before succeeding. As Brian notes, “Cognitive energy declines over time in a way that agents don’t.”

  3. The verification loop is what eliminates false positives. Firefox uses a two-stage verification process: first, the agent must trigger an actual crash in their fuzzing build (a crystal-clear signal), and second, a verifier subagent checks that the bug report makes sense and doesn’t involve test-only configurations. By the time a bug reaches human engineers, there are almost no false positives.

  4. Agents get laser-focused on the specific task and miss the bigger picture. When the patching agent fixed a bug, it would often patch just the one vulnerable location. Human engineers would then look at the fix and say, “This is right, but we should also check three other similar places in the codebase.”

  5. Prioritization is essential when you have millions of lines of code. Firefox built a simple LLM judge that scores each file on two dimensions: likelihood of a memory safety issue, and ease of access from a webpage. Brian says this is “very, very simple” and anyone can replicate it.

  6. The harness can be built in an afternoon using vendor SDKs. Firefox started with Claude’s agent SDK, which is essentially a wrapper around Claude Code CLI that streams JSON and provides programmatic hooks. Brian’s advice: use the vendor-provided harnesses (Claude agent SDK, OpenAI agent SDK) rather than third-party frameworks, because the models are likely post-trained to work best with their own infrastructure.

  7. You should run multiple models and harnesses for security work. Because attackers will use whatever model and technique finds bugs, defenders need to scan with multiple approaches. Different models and harnesses spike on different strengths and will identify different vulnerabilities.

  8. This approach works for more than security—performance, tech debt, and UX are all viable targets. The same pattern applies: score and prioritize areas of your codebase, give the agent a constrained goal with verification criteria, and plug the results into your existing pipeline. Brian says they’re doing active work on performance optimization using the same harness structure.

Blog and detailed workflow walkthroughs from this episode:

How Mozilla Fixed 500 Security Bugs with Claude Mythos: https://www.chatprd.ai/how-i-ai/how-mozilla-fixed-500-security-bugs-with-mythos
↳ Create an AI-Powered Patch and Verification Loop for Security Bugs: https://www.chatprd.ai/how-i-ai/workflows/create-an-ai-powered-patch-and-verification-loop-for-security-bugs
↳ Use an LLM as a Security Judge to Prioritize Codebase Analysis: https://www.chatprd.ai/how-i-ai/workflows/use-an-llm-as-a-security-judge-to-prioritize-codebase-analysis
↳ Build an AI Agentic Harness for Automated Security Bug Hunting: https://www.chatprd.ai/how-i-ai/workflows/build-an-ai-agentic-harness-for-automated-security-bug-hunting


If you’re enjoying these episodes, reply and let me know what you’d love to learn more about: AI workflows, hiring, growth, product strategy—anything.

Catch you next week,
Lenny

P.S. Want every new episode delivered the moment it drops? Hit “Follow” on your favorite podcast app.

How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead

2026-06-22 20:03:06

Brian Grinstead is a distinguished engineer at Mozilla, where he’s worked on Firefox and the web platform since 2013 (he joined to help launch Firefox DevTools). Recently he and his team pointed an agentic bug-finding pipeline at Firefox—a codebase with tens of thousands of files and tens of millions of lines of code—and shipped a record month of security fixes. The viral chart everyone saw gave the credit to Anthropic’s new Mythos model. Brian’s take is that the harness and pipeline did just as much of the work, and he walks through exactly how it runs and how anyone can build a starter version.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

  1. How to build a basic bug-finding harness by running Claude Code or Codex with one prompt and the -p flag, no SDK required

  2. Why pointing an agent at a whole codebase fails, and how an LLM judge can score and rank files before you spend any compute

  3. How a verifier subagent kills false positives by catching the agent when it cheats

  4. The goal-loop pattern: give an agent a tightly scoped problem, a clear pass/fail signal, and let it retry far past the point a human would quit

  5. Why teams that already invested in fuzzing, CI, and dev tooling are so far ahead

  6. How to weigh model versus harness, and why Brian splits the credit close to 50-50

  7. How a non-engineer can reuse the same score, verify, and fix the loop for design quality, conversion rate, or tech debt

  8. Why AI-generated patches still can’t ship on their own, and where humans stay in the loop


Brought to you by:

WorkOS—Make your app enterprise-ready today

Metaview—The agentic recruiting platform for winning teams

In this episode, we cover:

(00:00) Introduction to Brian Grinstead

(02:43) The viral chart: Firefox Security Bug Fixes by Month

(05:32) How the custom harness works

(10:22) Goal loops and guardrails

(14:45) How they built it

(16:55) Real bugs, including a 15-year-old one

(23:00) Open-sourcing it

(26:26) Why humans still review every fix

(32:30) Live demo and prioritizing files

(40:18) Mobilizing the team and recap

(42:33) Lightning round

Tools referenced:

• Claude Code: https://claude.ai/code

• Claude Agent SDK: https://code.claude.com/docs/en/agent-sdk/overview

• Codex: https://openai.com/index/openai-codex/

• OpenAI Agent SDK: https://developers.openai.com/api/docs/guides/agents

• VS Code: https://code.visualstudio.com/

• Docker: https://www.docker.com/

• Firefox: https://www.mozilla.org/firefox/

• Address Sanitizer: https://github.com/google/sanitizers

• RLBox: https://rlbox.dev/

Other references:

• Mozilla Bug Bounty Program: https://www.mozilla.org/security/bug-bounty/

• Mozilla GitHub: https://github.com/mozilla

Where to find Brian Grinstead:

LinkedIn: https://www.linkedin.com/in/bgrins/

GitHub: https://github.com/bgrins

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

What happens after coding is solved? | Fiona Fung (Manager of the Claude Code and Cowork Teams)

2026-06-21 20:31:43

Fiona Fung leads the teams behind Claude Code and Cowork at Anthropic (overseeing Boris Cherny and the entire engineering and PM team). Before Anthropic, she spent 11 years at Microsoft building Visual Studio and TypeScript and then moved to Meta, where she started Facebook Marketplace (now generating over $100 billion in GMV annually), worked on Meta’s first smart glasses and AR glasses, and led infrastructure, growth, integrity, and safety teams at Instagram. She’s been an engineer for over 25 years and has a unique perspective on how the role of building software is changing.

In our in-depth conversation, we discuss:

  1. What she’s learned about running a team that’s shipping 8x more code than before

  2. Which roles AI will transform next

  3. Specific ways her team uses AI

  4. How Claude “routines” have changed how she operates as a manager

  5. The context-switching problem no one has solved yet

  6. The biggest unsolved problem in AI

  7. What keeps her up at night


Brought to you by:

WorkOS—Make your app enterprise-ready, with SSO, SCIM, RBAC, and more

Mercury—Radically different banking, now with Command

Where to find Fiona Fung:

• LinkedIn: linkedin.com/in/fionafung

Referenced:

• Running an AI-native engineering org: https://www.youtube.com/watch?v=igO8iyca2_g

• Head of Claude Code: What happens after coding is solved | Boris Cherny: https://www.lennysnewsletter.com/p/head-of-claude-code-what-happens

• Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025: https://x.com/AnthropicAI/status/2062568864240836995

• Visual Studio: https://visualstudio.microsoft.com

• Joseph Campbell’s quote: https://www.goodreads.com/quotes/192665-the-cave-you-fear-to-enter-holds-the-treasure-you

• Life-changing Cowork use case: https://x.com/lennysan/status/2059664455001334124

• Introducing Claude for Small Business: https://www.anthropic.com/news/claude-for-small-business

Conversations with Tyler podcast: https://conversationswithtyler.com

• Sheryl Sandberg on Facebook: https://www.facebook.com/sheryl#

Amélie on Prime Video: https://www.amazon.com/Amelie-Jean-Pierre-Jeunet/dp/B0DQ4S3N45

Spirited Away on HBO Max: https://www.hbomax.com/movies/spirited-away/3deab668-d0a4-4a8d-9bc8-0952a0ad836e

Nausicaä of the Valley of the Wind on HBO Max: https://www.hbomax.com/movies/nausicaa-of-the-valley-of-the-wind/ed66031b-6353-4019-ba54-35488468a4db

• Sweet Sisters Bodycare: https://sweetsistersbodycare.com

• Anthropic events: https://www.anthropic.com/events

• Clare Pooley’s quote: https://www.goodreads.com/quotes/11305360-in-a-world-where-you-can-be-anything-be-kind

Recommended books:

• Margaret Atwood’s books: https://www.amazon.com/stores/author/B000AQTHI0?ccs_id=0027a474-cd59-4a3a-bcd7-9b173c27d530

• Haruki Murakami’s books: https://www.amazon.com/stores/Haruki-Murakami/author/B000AP7AFI

The Little Prince: https://www.amazon.com/Little-Prince-Antoine-Saint-Exup%C3%A9ry/dp/0156012197

Nausicaä of the Valley of the Wind: https://www.amazon.com/Nausica%C3%A4-Valley-Wind-Box-Set/dp/1421550644

High Output Management: https://www.amazon.com/High-Output-Management-Andrew-Grove/dp/0679762884


Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Lenny may be an investor in the companies discussed.


My biggest takeaways from this conversation:

Read more

🧠 Community Wisdom: Fractional CPO compensation, free e-signature tools, why some users pay but never use your product, sharing Claude Code context across a team, and more

2026-06-20 23:53:05

👋 Hello and welcome to this week’s edition of ✨ Community Wisdom ✨ a subscriber-only email, delivered every Saturday, highlighting the most helpful conversations in our members-only Slack community.

Read more

How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex

2026-06-17 20:04:04

I break down every loop type from scratch—what a heartbeat, cron, hook, and goal loop actually are, when each one fits, and the five things any effective loop needs before it touches production. Then I build two live loops: a daily aging-PR reviewer in Claude Code that schedules itself at 10:15 a.m. and spins off its own subagents, and a weekly skills-identification loop in Codex that spawns goal-based subagents to validate its own output in real time.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

  1. The plain-English definition of a loop—and why it’s just an automated prompt, not a scary new paradigm

  2. The four loop types (heartbeat, cron, hook, and goal) and when each one actually fits your workflow

  3. How to think about loop design using the “onboarding an employee” mental model

  4. The five things every effective loop needs: work trees, skills, plugins/connectors, subagents, and state tracking

  5. How to build a scheduled PR-review routine in Claude Code that babysits aging PRs and alerts your team

  6. How to set up a weekly skills-identification automation in Codex that spawns its own validating subagents

  7. Why goal-based loops are the hardest to write well—and where most people burn tokens for nothing

  8. The two warning signs that your loop is going to get expensive before it gets useful


Brought to you by:

WorkOS—Make your app enterprise-ready today

Runway—The creative AI platform for images, video, and more

In this episode, we cover:

(00:00) Prompts are out and loops are in

(02:30) Defining a loop

(03:03) The four ways to automate a prompt: heartbeat, cron, hooks, and goals

(06:03) Five things every effective loop needs

(09:26) The “onboarding an employee” framework for designing loops

(11:58) Live build #1: Daily aging PR loop in Claude Code

(17:08) Subagents inside loops

(19:00) Live build #2: Weekly skills identification loop in Codex

(22:57) Watching subagents spin up in real time

(25:28) Warning signals around loops

(27:31) What listeners are doing with loops

Tools referenced:

• Claude Code: https://claude.ai/code

• Codex: https://chatgpt.com/codex

• OpenClaw: https://openclaw.ai/

Other references:

• Claire’s article “Why OpenClaw Feels Alive Even Though It’s Not”: https://x.com/clairevo/article/2017741569521271175

• Addy Osmani’s article on loop engineering: https://addyosmani.com/blog/loop-engineering/

• Using Goals in Codex: https://developers.openai.com/cookbook/examples/codex/using_goals_in_codex

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

🎙️ How I AI: Claude Fable 5 review & How Braintrust uses AI agents, evals, and CI to ship better software

2026-06-15 23:01:32

Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Listen now on YouTubeSpotifyApple Podcasts

Claire puts Claude Fable 5, Anthropic’s first generally available Mythos-class model, through a series of real-world tests: product specs, agent workflows, design tasks, vision tasks, and multi-agent orchestration. She breaks down what Anthropic is claiming, where the model genuinely feels like a leap forward, and where it surprisingly falls short.

Biggest takeaways:

  1. Fable 5 is Anthropic’s first “Mythos-class” model to reach general availability, and it’s crushing benchmarks across the board. It hit 80% on SWBench Pro, significantly outperforming Opus 4.8, GPT-4.5, and Gemini 3.1 Pro. Claire found the model excels in specific areas while falling short in others that matter for everyday product work.

  2. The model is expensive by design: $10 per million input tokens and $50 per million output tokens. That’s a new tier above Opus, and it consumes tokens at roughly twice the rate of other models. You need to be strategic about when to deploy this level of intelligence versus using cheaper models like Sonnet or Opus for simpler tasks.

  3. Fable 5 works like a “seasoned engineer”—which is both its superpower and its Achilles’ heel. It’s thorough, autonomous, and will investigate every corner of a problem to be 120% sure it’s shipping the right thing. Sometimes you need a model that’s a little less thorough, a little “dumber,” to actually ship something useful quickly.

  4. The model is exceptionally good at vision tasks, particularly document formatting and PDF parsing. Claire tested it on creating handwriting worksheets for her 7-year-old and found it dramatically outperformed Opus 4.8—better spacing, clearer layout, appropriate white space. This extends to other vision tasks where you want something to look good or need to parse complex documents.

  5. The writing is nearly unreadable for specs and PRDs. Claire found that Fable 5 produces extremely detailed, technically complete documents that are almost impossible to parse. It gets wrapped around the axle on details, creates big blocks of dense paragraphs with internal references, and makes it hard to see the forest for the trees.

  6. Design output is shockingly bad, at least for one-shot design tasks. When Claire asked Fable to design a skills registry, it produced fundamentally terrible design: gray, black, red, simple outlines. This was a real surprise given the model’s benchmark performance.

  7. The model is conservative on execution and takes “minimal” very literally. When Claire asked it to ship an MVP that would deliver customer value, Fable produced something extremely narrow and not actually that useful. This conservatism may stem from the safety guardrails built into the model.

  8. Fable 5 includes specific safeguards for cybersecurity, biology, chemistry, and distillation tasks. Instead of blocking you entirely, it uses a new “fallback” concept—if you get classified into one of these categories, it gracefully falls back to Opus 4.8. Anthropic reports that 95% of sessions don’t hit a fallback, and they maintain a 30-day retention policy solely to catch misuse.

  9. Multi-agent orchestration is technically possible but not yet reliable. Claire tested the dynamic workflows and subagent capabilities extensively and had some successful multi-agent runs, but also encountered frequent stalls and errors. She walked away from her laptop and came back to find subagents had stalled after about three hours.

  10. The key insight: match model intelligence to task complexity. Claire recommends using it for hard technical problems where extreme detail matters, long-horizon work, and vision tasks. But for front-end work, strategy, specs, and design, other models in the ecosystem will serve you better and cost less.

  11. This is “baby Mythos,” not the full Mythos model. Fable 5 has guardrails that the unrestricted Mythos model (available only to Project Glasswing partners) doesn’t have. The underlying model is the same, but Fable is tuned for safety and general availability.

Blog from this episode:

How I AI: My Honest Review of Claude Fable 5: https://www.chatprd.ai/how-i-ai/claude-fable-5-review

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

Listen now on YouTubeSpotifyApple Podcasts

Brought to you by:

  • Guru—The AI layer of truth

  • Persona—Trusted identity verification for any use case

Claire sits down with Ankur Goyal, the founder and CEO of Braintrust, to unpack how top engineering teams are using AI agents, evals, and CI to ship better software faster. They get into why agents are now capable of tackling hard infrastructure problems, how to decide what work sits “below the agent line,” and why evals are quickly becoming the modern version of a PRD. Ankur’s core message: the best teams won’t just use AI to write more code; they’ll build the feedback loops, benchmarks, and systems that let AI improve the quality of the product itself.

Biggest takeaways:

  1. There’s no staff engineer running as many rigorous benchmarks as someone using an agent. Ankur viscerally disagrees with engineers who say AI can’t handle complicated problems. While models might not be perfect at writing highly concurrent code, they excel at running exhaustive experiments—testing every column store format, every execution engine, every optimization strategy. The baseline of rigor you get from agents is incredible, and there’s simply no excuse anymore to skip benchmarks because they’re tedious.

  2. The agent line keeps going up—and you need to identify what’s below it. Many interactions, decisions, and directions that feel like they need human judgment actually fit “below the agent line.” If you took the information from a meeting and gave it to an agent, would it solve the same problem? Increasingly, the answer is yes. The best teams push this line higher by building smart skills and integrations that expand what agents can handle autonomously.

  3. Practical quality beats theoretical quality every time. In theory, a human engineer with infinite time and focus might produce better code than an AI agent. In practice, humans lose context over days, have decaying attention spans on hard-but-tedious problems, and skip benchmarks they know they should run. AI agents maintain consistent focus, run every test, and can work on problems continuously for days or weeks. The practical quality of AI-assisted engineering is higher because of sustained rigor, not because the code is theoretically better.

  4. You can now bite off much harder technical problems than before. Companies historically avoid major infrastructure changes because the cost of testing alternatives is prohibitively high and the unknown unknowns are risky. With AI agents, you can exhaustively test six different database solutions, run thousands of benchmarks on production-scale data, and make informed decisions about platform shifts that would have been impossible before. The business case for deep technical work becomes much easier when agents do the heavy lifting.

  5. Run four to six foreground agents simultaneously—that’s the human concurrency limit. Ankur runs different agents working on different problems. This matches the personal concurrency limit most people can manage; you can’t effectively context switch between more than that. Some agents run locally, and others run remotely on cloud infrastructure with production-scale data. The key is isolation: each agent has its own environment, ports, and services.

  6. Evals are the modern PRD—they define what success looks like, not how to achieve it. Machine learning shifts programming from defining implementation details to defining success criteria. Just like the best PRDs include user stories and examples, the best evals include concrete test cases and scoring functions. The difference is that evals quantify success in ways that can be automatically measured and improved. This lets you focus on outcomes while AI figures out the implementation.

  7. Build a feedback loop that automatically turns real-world data into evals. For AI product teams, the #1 engineering priority isn’t prompt engineering or picking an agent framework—it’s building a pipeline that summons real-world data and converts it into evals. This is the same principle as investing in CI for traditional software: you’re building the platform that lets agents do the work engineers used to do manually. Without this feedback loop, you’re stuck in whack-a-mole mode, fixing individual cases without systematic improvement.

  8. Quantify your designer’s taste so it scales across your product. Ankur runs hundreds of evals to improve things quantitatively, then asks David (their tastemaker designer) for a vibe check every few days. When David destroys his work, Ankur captures the feedback (“David thinks it’s OK to show both languages as long as . . .”) and improves the scoring functions to encode David’s palette. This doesn’t replace David; it amplifies him. They’re able to apply David’s quality bar to more things than he could ever review manually.

  9. Product building is now carving, not constructing. It’s extremely fast to create something with too many features, too many buttons, and too much code. The hard part is removing stuff. When customers complain, Braintrust removes the thing causing confusion 90% of the time, making the system work better by eliminating complexity. This is the opposite of traditional product development, where you carefully add features one by one.

  10. Invest in CI to earn the ability to move faster—it’s the platform for AI-powered engineering. Every engineer is now building a platform upon which agents do the work engineers used to do manually. For traditional software, that platform is CI. If you feel constrained by velocity, don’t ship crappy stuff faster. Instead, pause and improve CI so you earn the ability to move faster safely. The same principle applies to AI products: build the eval pipeline first, then let agents optimize within that system.

  11. When agents fail, close the session and improve the evals—don’t yell or bribe. Ankur’s back-pocket strategy is remarkably disciplined: he doesn’t try to prompt his way out of problems. He closes the session, improves the evaluation criteria or success metrics, and starts fresh. Sometimes this means hand-writing code to better understand the problem (like when he spent a weekend hand-writing a 3,000-line eval that had become trash through vibe coding). The solution is always better evals, not better prompting.

Blog and detailed workflow walkthroughs from this episode:

Blog: Ankur Goyal’s Playbook for Agent-Driven Benchmarking and AI Evals https://www.chatprd.ai/how-i-ai/ankur-goyals-playbook-for-agent-driven-benchmarking-and-ai-evals

Workflows:

↳ How to Scale Expert Judgment in AI Systems with a Human Feedback Loop: https://www.chatprd.ai/how-i-ai/workflows/how-to-scale-expert-judgment-in-ai-systems-with-a-human-feedback-loop

↳ How to Use AI Coding Agents for Exhaustive Infrastructure Benchmarking: https://www.chatprd.ai/how-i-ai/workflows/how-to-use-ai-coding-agents-for-exhaustive-infrastructure-benchmarking


If you’re enjoying these episodes, reply and let me know what you’d love to learn more about: AI workflows, hiring, growth, product strategy—anything.

Catch you next week,
Lenny

P.S. Want every new episode delivered the moment it drops? Hit “Follow” on your favorite podcast app.