MoreRSS

site iconLenny RachitskyModify

The #1 business newsletter on Substack.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Lenny Rachitsky

The ultimate guide to AEO: How to get ChatGPT to recommend your product | Ethan Smith (Graphite)

2025-09-14 19:03:23

Jump to the best parts:

  • (08:13) → The biggest shift in search since Google’s Panda algorithm updates

  • (11:51) → How early-stage startups can start appearing in AI answers almost instantly

  • (14:34) → Why traffic from ChatGPT converts dramatically better than from Google search


Brought to you by:

Orkes—The enterprise platform for reliable applications and agentic workflows

Vanta—Automate compliance. Simplify security.

Great Question—Empower everyone to run great research


Ethan Smith is the CEO of Graphite—the leading SEO growth agency—and my go-to expert on SEO. After 18 years of mastering traditional SEO, Ethan has been at the forefront of what is called AEO: answer engine optimization, or, more simply, getting your product to show up in ChatGPT/Claude/Gemini/Perplexity answers. He’s discovered that ChatGPT traffic converts six times better than Google search—and most companies are completely missing this opportunity.

In our conversation, we discuss:

  1. His 7-step playbook to rank #1 in ChatGPT

  2. Why ChatGPT traffic converts 6x better than Google

  3. How early-stage startups can win at AEO immediately (unlike with SEO, which takes years)

  4. The three tactics that actually work: landing pages, YouTube videos, and Reddit comments

  5. Why help-center content can suddenly be your highest-ROI investment

  6. The specific Reddit strategy that works (spoiler: be authentic)

  7. Why AI-generated content doesn’t work

Where to find Ethan Smith:

• Twitter: https://twitter.com/ethan_l_s

• LinkedIn: https://bit.ly/ethans-linkedin

• Graphite: https://graphite.io/

• Graphite Research Papers: https://bit.ly/graphite-five-percent

In this episode, we cover:

(00:00) Welcome back, Ethan

(04:34) The changing landscape of SEO

(06:19) AEO (answer engine optimization) vs. GEO (generative engine optimization)

(08:13) The impact of AEO

(11:51) How early-stage startups can win at AEO

(14:34) The quality of AEO leads

(15:35) On-site vs. off-site traffic

(16:32) Reddit’s role in AEO and avoiding spam

(20:11) How AI models use citations (RAG)

(21:41) Key principles for winning at AEO

(25:00) Avoiding hyper-SEOed content, and the importance of originality

(28:55) Actionable AEO playbook: steps and experiments

(33:35) Tracking, measuring, and share of voice

(38:34) Adapting AEO for B2B, commerce, and early-stage companies

(41:11) Is letting AI index your content good?

(43:06) Experimentation, control groups, and measuring results

(46:15) The future of AEO, SEO, and search channels

(51:35) AI-generated content: what works and what doesn’t

(55:25) The dangers of infinite AI derivatives

(58:44) The future: convergence of LLMs and search

(01:00:40) Help-center optimization and the long tail

(01:03:18) Lightning round and final thoughts

Referenced:

• AEO Tools: https://bit.ly/graphite-aeo-tool

• The ultimate guide to SEO | Ethan Smith (Graphite): https://www.lennysnewsletter.com/p/the-ultimate-guide-to-seo-ethan-smith

• Inside ChatGPT: The fastest-growing product in history | Nick Turley (Head of ChatGPT at OpenAI): https://www.lennysnewsletter.com/p/inside-chatgpt-nick-turley

• Webflow: https://webflow.com/

• YouTube: https://youtube.com/

• Reddit: https://www.reddit.com/

• Quora: https://www.quora.com/

• An inside look at Deel’s unprecedented growth | Meltem Kuran Berkowitz (Head of Growth): https://www.lennysnewsletter.com/p/an-inside-look-at-deels-unprecedented

• Dotdash Meredith: https://www.people.inc/

Forbes: https://www.forbes.com/

• Vimeo: https://vimeo.com/

• Perplexity: https://www.perplexity.ai/

• Gemini: https://gemini.google.com/

• ChatGPT: https://chatgpt.com/

• Claude: https://claude.ai/

• TechRadar: https://www.techradar.com/

• Yelp: https://www.yelp.com

• Tripadvisor: https://www.tripadvisor.com/

• TikTok: https://www.tiktok.com/

• Why ChatGPT will be the next big growth channel (and how to capitalize on it) | Brian Balfour (Reforge): https://www.lennysnewsletter.com/p/why-chatgpt-will-be-the-next-big-growth-channel-brian-balfour

• AI SEO: Latest Market Intelligence & Landscape Analysis—Summer Sessions: https://www.youtube.com/watch?v=cm589PmhIOY

• AI Content Study: https://bit.ly/ai-content-white-paper

• Free AI detector: https://surferseo.com/ai-content-detector/

• Common Crawl: https://commoncrawl.org/

• AI models collapse when trained on recursively generated data: https://www.nature.com/articles/s41586-024-07566-y

• Looker: https://cloud.google.com/looker/

• Otter: https://otter.ai/

• Zapier: https://zapier.com/

• Zendesk: https://www.zendesk.com/

• Intercom: https://www.intercom.com/

The Last Dance on Netflix: https://www.netflix.com/title/80203144

30 for 30: Lance on Netflix: https://www.netflix.com/title/81753677

Free Solo on National Geographic: https://films.nationalgeographic.com/free-solo

• Sony mirrorless cameras: https://electronics.sony.com/compact-mirrorless-cameras

• What Is Butter Lettuce? Plus, a Simple Butter Lettuce Salad Recipe: https://www.masterclass.com/articles/what-is-butter-lettuce-plus-a-simple-butter-lettuce-salad-recipe

• Bryan Johnson on X: https://x.com/bryan_johnson

Recommended books:

Emotional Intelligence: Why It Can Matter More Than IQ: https://www.amazon.com/Emotional-Intelligence-Matter-More-Than/dp/055338371X

Influence: The Psychology of Persuasion: https://www.amazon.com/Influence-Psychology-Persuasion-Robert-Cialdini/dp/006124189X

Thinking, Fast and Slow: https://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555

How to Measure Anything: Finding the Value of Intangibles in Business: https://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/1118539273

Outliers: The Story of Success: https://www.amazon.com/Outliers-Story-Success-Malcolm-Gladwell/dp/0316017930


Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Lenny may be an investor in the companies discussed.


My biggest takeaways from this conversation:

Read more

🧠 Community Wisdom: When leadership asks you to build an AI feature, giving feedback to people outside your team, transitioning from PM to BizOps, building an email verification flow, and more

2025-09-13 23:55:50

👋 Hello and welcome to this week’s edition of ✨ Community Wisdom ✨ a subscriber-only email, delivered every Saturday, highlighting the most helpful conversations in our members-only Slack community.

Read more

$46B of hard truths from Ben Horowitz: Why founders fail and why you need to run toward fear (a16z co-founder)

2025-09-11 19:03:17

Jump to the best parts:

  • (04:09) → Success comes from many small decisions, not one big moment: Progress comes from stacking difficult but correct choices, even when each feels insignificant on its own.

  • (10:15) → Great leaders run toward fear, not away: The muscle CEOs need most is acting decisively in difficult situations, rather than hesitating in the face of equally bad options.

  • (24:54) → CEOs don’t make people great—they find people who make them great: Surround yourself with world-class talent that pushes the company forward; don’t try to “fix” low performers.


Brought to you by:

DX—The developer intelligence platform designed by leading researchers

Basecamp—The famously straightforward project management system from 37signals

Miro—A collaborative visual platform where your best work comes to life


Ben Horowitz is the co-founder of Andreessen Horowitz, Silicon Valley’s largest and most influential venture capital firm, with over $46B in committed capital across multiple funds. He took Loudcloud public with just $2 million in revenue (dubbed “the IPO from hell”), sold it for $1.6 billion, and has backed companies from Facebook to Stripe to Airbnb to OpenAI to Databricks (now worth more than $100 billion). His management philosophy—forged through near-death experiences and refined through coaching hundreds of CEOs—contradicts most conventional startup wisdom.

In our conversation, Ben shares:

  1. Why “founder mode” is half right and half dangerously wrong

  2. The story behind “Good Product Manager/Bad Product Manager” and why it went viral despite being written in anger

  3. Where the biggest AI startup opportunities remain

  4. Why you need to run toward fear, never away

  5. The one trait that predicts that a founder will fail as CEO

  6. Inside Paid in Full, Ben’s nonprofit awarding pensions to pioneering hip-hop artists

Where to find Ben Horowitz:

• X: https://x.com/bhorowitz

• LinkedIn: https://www.linkedin.com/in/behorowitz/

• Andreessen Horowitz’s website: https://a16z.com/

In this episode, we cover:

(00:00) Introduction to Ben Horowitz

(04:09) Important leadership lessons from Shaka Senghor

(10:15) Running toward fear and why hesitation kills companies

(19:35) Who shouldn’t start a company

(22:36) The Databricks story: thinking bigger

(24:54) Managerial leverage and CEO psychology

(28:06) When founders should be replaced as CEOs

(31:20) Normalizing failure for CEOs

(37:57) Counterintuitive lessons about building companies

(42:31) “Good Product Manager/Bad Product Manager”

(48:21) Product managers as leaders

(51:16) Why a16z invested in Adam Neumann after WeWork

(56:23) Is AI in a bubble?

(01:02:43) The biggest opportunities in AI

(01:12:51) Why U.S. leadership in AI matters

(01:18:53) The Paid in Full Foundation for hip-hop pioneers

(01:23:18) Lightning round: book recommendations, products, and life mottos

Referenced:

• Shaka Senghor on The Joe Rogan Experience: https://open.spotify.com/episode/79neOSawKbrxY6Tl2wV1Kx

• 1999 Martha’s Vineyard plane crash: https://en.wikipedia.org/wiki/1999_Martha%27s_Vineyard_plane_crash

• John Reed: https://en.wikipedia.org/wiki/John_S._Reed#

• LoudCrowd: https://loudcrowd.com/

• Marc Andreessen on X: https://x.com/pmarca

• Ali Ghodsi on LinkedIn: https://www.linkedin.com/in/alighodsi/

• Databricks: https://www.databricks.com/

• Ion Stoica on LinkedIn: https://www.linkedin.com/in/ionstoica/

• Hadoop: https://hadoop.apache.org/

• The Sad Truth About Developing Executives: https://a16z.com/the-sad-truth-about-developing-executives/

• Mark Zuckerberg on Facebook: https://www.facebook.com/zuck/

• Sam Altman on X: https://x.com/sama

• Brian Chesky’s new playbook: https://www.lennysnewsletter.com/p/brian-cheskys-contrarian-approach

• Brian Chesky—Founder Mode & The Art of Hiring: https://www.youtube.com/watch?v=aFOGlNL39xs

• Bob Iger: https://en.wikipedia.org/wiki/Bob_Iger

• Larry Page: https://en.wikipedia.org/wiki/Larry_Page

• Kanye West: https://en.wikipedia.org/wiki/Kanye_West

• Diddy: https://en.wikipedia.org/wiki/Sean_Combs

• Arsalan Tavakoli on LinkedIn: https://www.linkedin.com/in/arsalantavakoli/

• Good Product Manager/Bad Product Manager: https://a16z.com/good-product-manager-bad-product-manager/

• Netscape: https://en.wikipedia.org/wiki/Netscape

• Jensen Huang on LinkedIn: https://www.linkedin.com/in/jenhsunhuang/

• David Weiden on LinkedIn: https://www.linkedin.com/in/davidweiden/

• Raghu Raghuram on LinkedIn: https://www.linkedin.com/in/raghuraghuram/

• Adam Neumann on X: https://en.wikipedia.org/wiki/Adam_Neumann

• WeWork: https://www.wework.com/

• Cluely: https://cluely.com/

• The Next Bubble—Don’t Get Fooled Again: https://steveblank.com/2011/06/15/the-next-bubble-dont-get-fooled-again/

• Evite: https://www.evite.com/

• Paul Krugman: https://en.wikipedia.org/wiki/Paul_Krugman

• Stargate: https://openai.com/index/announcing-the-stargate-project/

• Elon Musk on X: https://x.com/elonmusk

• Salesforce: https://www.salesforce.com/v3/

• Behind the founder: Marc Benioff: https://www.lennysnewsletter.com/p/behind-the-founder-marc-benioff

• Cursor: https://cursor.com/

• The rise of Cursor: The $300M ARR AI tool that engineers can’t stop using | Michael Truell (co-founder and CEO): https://www.lennysnewsletter.com/p/the-rise-of-cursor-michael-truell

• Anthropic: https://www.anthropic.com/

• Sebastian Thrun: https://en.wikipedia.org/wiki/Sebastian_Thrun

• Waymo: https://waymo.com/

• Product lessons from Waymo | Shweta Shrivastava (Waymo, Amazon, Cisco): https://www.lennysnewsletter.com/p/product-lessons-from-waymo-shweta

• Mao Zedong: https://en.wikipedia.org/wiki/Mao_Zedong

• Pol Pot: https://en.wikipedia.org/wiki/Pol_Pot

• Nicolae Ceaușescu: https://en.wikipedia.org/wiki/Nicolae_Ceau%C8%99escu

• Joseph Stalin: https://en.wikipedia.org/wiki/Joseph_Stalin

• The Declaration of Independence: https://www.archives.gov/founding-docs/declaration-transcript

• The Constitution of the United States: https://www.archives.gov/founding-docs/constitution-transcript

• The Paid in Full Foundation: https://paidinfullfoundation.org/

• Eric B. & Rakim—“Paid in Full”: https://www.youtube.com/watch?v=E7t8eoA_1jQ

• Rakim Allah on X: https://x.com/officialrakim

• Scarface: https://en.wikipedia.org/wiki/Scarface_(rapper)

• Roxanne Shante on Instagram: https://www.instagram.com/imroxanneshante/

• Grandmaster Caz: https://en.wikipedia.org/wiki/Grandmaster_Caz

• Kool Moe Dee on Instagram: https://www.instagram.com/koolmoedeenyc/

• George Clinton’s website: https://georgeclinton.com/

• Kool G Rap’s website: https://koolgrap.com/

• Grand Puba on X: https://x.com/iamgrandpuba

• Jalil Hutchins on Instagram: https://www.instagram.com/jalil.whodini/

• The Legend of the Blind MC: https://a16z.com/the-legend-of-the-blind-mc/

• Erik Torenberg on LinkedIn: https://www.linkedin.com/in/eriktorenberg/

Slow Horses on AppleTV+: https://tv.apple.com/us/show/slow-horses/umc.cmc.2szz3fdt71tl1ulnbp8utgq5o

Sinners: https://www.imdb.com/title/tt31193180/

• Technivorm Moccamaster: https://www.amazon.com/Technivorm-Moccamaster-Coffee-Brewer-Polished/dp/B002S4DI2S

Follow the Leader: https://en.wikipedia.org/wiki/Follow_the_Leader_(Eric_B._%26_Rakim_album)

Stillmatic: https://en.wikipedia.org/wiki/Stillmatic

One Nation Under a Groove: https://en.wikipedia.org/wiki/One_Nation_Under_a_Groove

Recommended books:

The Hard Thing About Hard Things: Building a Business when There Are No Easy Answers: https://www.amazon.com/Hard-Thing-About-Things-Building-ebook/dp/B00DQ845EA

How to Be Free: A Proven Guide to Escaping Life’s Hidden Prisons: https://www.amazon.com/How-Be-Free-Escaping-Prisons/dp/B0DXD5KKBF/

The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous: https://www.amazon.com/WEIRDest-People-World-Psychologically-Particularly/dp/0374173222

Writing My Wrongs: Life, Death, and Redemption in an American Prison: https://www.amazon.com/Writing-My-Wrongs-Redemption-American/dp/1101907312


Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Lenny may be an investor in the companies discussed.


My biggest takeaways from this conversation:

Read more

Building eval systems that improve your AI product

2025-09-09 22:01:43

If you’re a premium subscriber

Add the private feed to your podcast app at add.lennysreads.com

In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.

After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.

Subscribe now

Listen now: YouTube | Apple | Spotify

In this episode, you’ll learn:

  • Why most AI eval dashboards fail to deliver real product improvements

  • How to use error analysis to uncover your product’s most critical failure modes

  • The role of a “principal domain expert” in setting a consistent quality bar

  • Techniques for transforming messy error notes into a clean taxonomy of failures

  • When to use code-based checks vs. LLM-as-a-judge evaluators

  • How to build trust in your eva…

Read more

Building eval systems that improve your AI product

2025-09-09 21:03:34

👋 Each week, I tackle reader questions about building product, driving growth, and accelerating your career. Annual subscribers get a free year of 15+ premium products: Lovable, Replit, Bolt, n8n, Wispr Flow, Descript, Linear, Gamma, Superhuman, Granola, Warp, Perplexity, Raycast, Magic Patterns, Mobbin, and ChatPRD (while supplies last).

For more: Lennybot | Lenny’s Podcast | How I AI | Lenny’s Reads | Courses

Subscribe now


Hamel Husain and Shreya Shankar’s online course, AI Evals for Engineers & PMs, is the #1 highest-grossing course on Maven, and consistently brings in sizable student groups from all of the major AI labs. This is because they teach something crucial: how to build evaluations that actually improve your product, not just generate vanity dashboards.

Over the past two years, Hamel and Shreya have played a major role in shifting evals from being an obscure, confusing subject to one of the most necessary skills for AI product builders.

After training more than 2,000 PMs and engineers, and leaders at over 500 companies, they’re now sharing their complete playbook—the same methodology taught at OpenAI, Anthropic, and other leading labs. You’ll learn how to leverage error analysis to understand where your AI product breaks, build robust evals you can trust, and create a continuous improvement flywheel that catches regressions before they ship.

In honor of this post, Hamel and Shreya are also offering an exclusive discount on their course: 35% off. This is the largest discount they’ve ever offered. Use this link to register (or use code LENNYSLIST when checking out). You’ve got three days left to enroll.

Thank you for sharing this gold with us, Hamel and Shreya 🙏

P.S. You can listen to this post in convenient podcast form: Spotify / Apple / YouTube.


Aman Khan’s post on evals perfectly captured why evaluation is becoming a core, make-or-break skill for product managers. This article provides the next step: a playbook for building an evaluation system to drive real product improvements. Many teams build eval dashboards that look useful but are ultimately ignored and don’t lead to better products, because the metrics these evals report are disconnected from real user problems.

This guide provides a process to bridge that trust gap. We will cover three phases: discovering what to measure through rigorous error analysis, building a reliable evaluation suite, and operationalizing that suite to create a flywheel of continuous improvement.

Phase 1: Ground your evals in reality, with error analysis

Before you can improve your AI product, you must understand how it fails. The surface area for what you could evaluate is infinite. The most common mistake is to start by measuring ready-made, fashionable metrics like “hallucination” or “toxicity.” This approach often leads to tracking scores that don’t correlate with the actual problems your users face with your product. You cannot know what to measure until you systematically find out how your product fails in specific contexts. The process that tells you where to focus is referred to as “error analysis” and should result in a clean and prioritized list of your product’s most common failure modes.

The process begins not with metrics but with data and a single human expert. For most small to medium-size companies, the most effective approach is to designate a single principal domain expert as the arbiter of quality. This person—a psychologist for a mental health chatbot, a lawyer for legal document analysis—becomes the definitive voice on quality. Appointing a single expert, sometimes called a “benevolent dictator,” provides a consistent and deeply informed signal, which eliminates annotation conflicts and prevents the paralysis that can come from having too many cooks in the kitchen. In many situations, the product manager is the principal domain expert. Larger organizations, or products that span multiple complex domains with different cultural contexts, may require multiple annotators. In those cases, you must implement a more structured process to ensure that judgments are consistent, which involves measuring their agreement.

Your next step is to arm this expert with a representative set of around 100 user interactions. As you get more sophisticated, you can sample interactions that are more likely to yield insights based on data analysis. Examples include traces that have negative user feedback, outliers in conversation length, number of tools, and high latency. However, start with random sampling to develop your intuition in the beginning.

With a dataset ready, the analysis begins with open coding. This is essentially like journaling, but with a bit of structuring. The domain expert reviews each user interaction and writes a free-form critique on anything that seems wrong or undesirable, as well as giving a pass/fail judgment on the AI performance.

  • For passes, we explain why the AI succeeded in meeting the user’s primary need, even if there were critical aspects that could be improved. We highlight these areas for enhancement while justifying the overall passing judgment.

  • For fails, we identify the critical elements that led to the failure, explaining why the AI did not meet the user’s main objective or compromised important factors, like user experience or security.

Here is a screenshot of open coding in action in response to an apartment leasing assistant. In the interface, we can see that the AI has hallucinated a virtual tour when that isn’t something that is offered:

The tool pictured above is the open source tool Arize Phoenix, but you can use any LLM observability tool you want. Other popular tools in this category include LangSmith and Braintrust.

As a heuristic, the critique should be detailed enough for a brand-new employee at your company to understand it. Or, if this is more helpful, so that you can use it in a few-shot prompt for an LLM judge. Being too terse is a common mistake. Here are some good examples of open coding in action:

Note that the example user interactions with the AI are simplified for brevity—but you might need to give the domain expert more context to make a judgment. More on that later.

This lightly constrained process is crucial for discovering problems you didn’t know you had. It’s also where teams often discover what they truly want from their AI system. Research shows that people are not very good at specifying their complete requirements for an AI up front. It is through the process of reviewing outputs and articulating what feels “wrong” that the true criteria for success emerge.

After collecting notes on dozens of traces, the next step is axial coding, or pattern-finding. The expert reads through all the open-ended critiques and starts grouping them (examples below). This process transforms a chaotic list of observations into a clean, prioritized taxonomy of concrete failure modes. It is part art and part science: group errors in a way that is manageable and sensible for your domain. Here is how you might apply axial coding to the failures above:

This grouping process often happens in a spreadsheet or a dedicated annotation tool where you can tag or label each critique. When I was working on this apartment leasing assistant in a real-life scenario, here are the categories that emerged:

  • Conversation flow issues (missing context, awkward responses)

  • Handoff failures (not recognizing when to transfer to humans)

  • Rescheduling problems (struggling with date handling)

You can accelerate this process with an LLM. You can use an LLM to perform a first-pass categorization on the critiques. However, a common trap is to over-automate. Always have the human expert review and validate the LLM’s suggestions. An LLM might miss the nuance that distinguishes a conversation flow issue from a handoff failure. Another trap is creating too many categories; aim for a manageable set of under 10 primary failure modes that capture the most significant problems. The goal is to create a useful taxonomy that you can analyze, not an exhaustive list.

The final product of this phase is to simply count the categories so you can get a sense of how to invest your time. Here is how this count looks for the apartment leasing assistant, which I calculated with a pivot table in a spreadsheet:

As you can see, the most frequently occurring errors were conversation flow, handoff (to a human), and rescheduling appointments. This data gives us concrete problems specific to our product to focus on as we build evals.

A warning about off-the-shelf metrics

While off-the-shelf metrics like hallucination and toxicity are not worth paying attention to directly, they can be used in creative ways. Instead of reporting a hallucination or toxicity score on a dashboard, calculate the scores on your traces and sort them by high/low score. Reviewing the highest and lowest scoring examples can reveal surprising failure modes or unexpected successes, which in turn helps you build custom evaluators for the patterns you discover. This is one of the only appropriate uses for off-the-shelf metrics. Note that this is an advanced technique and should be done only after you master the basic approach.

Phase 2: Build out your evaluation suite

After error analysis, you will have a prioritized list of your product’s most common failures. The next step is to build a suite of automated evaluators to track them. The goal is to create a system that is reliable, cost-effective, and trusted by your team. This requires choosing the right tool for each failure mode.

Your choice of tools comes down to one question for each prioritized failure mode on your list: Is this failure objective and rule-based (for example: “Does the output contain a user ID?”), or is it subjective and requiring judgment (example: “Was the tone appropriate for the persona?”)?

For objective failures, use code-based evaluators. These are simple checks written as code, like assertions in a unit test. They are fast, cheap, and deterministic, making them perfect for verifying things like whether an output is valid JSON, contains a required keyword, or executes without error. Use them whenever a clear rule can define success or failure.

For subjective failures, you’ll need to build an LLM-as-a-judge to reliably assess qualities that code cannot easily handle, like tone, relevance, or reasoning quality. This can be a rigorous process—as is training any LLM—but it is the only way to scale nuanced and subjective evaluations and ultimately improve your product. The good news is that there is a scientific approach to making sure the judge is sufficiently aligned with your product vision and success criteria.

The LLM-as-a-judge playbook

This is not about writing a clever prompt. It is about a systematic process of grounding an LLM’s judgment in your specific quality bar. The output is an LLM that gives you a binary pass/fail metric for specific error(s). More importantly, you need to trust the metric. The way to establish trust is by measuring the judge against a human-labeled dataset you create. There are two steps involved. The first is to create a dataset that establishes ground truth:

1. Establish ground truth

Your evaluation system is only as good as its source of truth. For most teams, the most effective approach is to leverage the principal domain expert we mentioned earlier. While larger organizations operating across multiple domains may require multiple annotators and processes to measure inter-annotator agreement, starting with a single expert accelerates the process.

The expert’s task is to provide two things for every user interaction with your AI, grouped by session: a binary pass/fail judgment and a detailed critique. Many teams are tempted to use a 1-to-5 Likert scale, believing it captures more nuance. This is a trap. The distinction between a “3” and a “4” is subjective and inconsistent. Binary decisions force clarity. An output either meets the quality bar or it does not. The nuance is not lost; it is captured in the critique, which explains why a judgment was made. These critiques are the secret ingredient for building a high-fidelity judge. For example, consider this example from earlier:

Reasonable people may disagree on whether or not this is “good enough.” However, it is important that you strive toward making a judgment call on what is good and bad for your product. In this case, we decided that this interaction was a failure.

2. Build and validate the judge

After you have collected the ground-truth data labeled by your domain expert, you are ready to build and validate the judge. Do not use your entire dataset to build and test your judge. This leads to overfitting, where you iterate toward performing well on examples you observe but fail on new, unseen data.

Instead, split the ground-truth data into three distinct sets:

  • Train set (10%–20%): A small set of clear examples, including the expert’s critiques, to use in the judge’s prompt

  • Dev set (40%–45%): A larger set used to iteratively test and refine the judge’s prompt

  • Test set (40%–45%): A held-out set, untouched during development, for a final, unbiased measurement of the judge’s performance

This process of refining your judge’s prompt on the dev set is a meta-evaluation task. You are evaluating your evaluator. It is also where you will discover the nuances of your own quality bar. As research on “criteria drift” has shown, the process of reviewing LLM outputs and aligning a judge helps you articulate and refine your own standards.

Below is a visualization of this LLM-as-a-judge alignment process at a high level.

3. Measure what matters: TPR/TNR over accuracy

A common impulse is to measure a judge’s performance with a single accuracy score, but this can be dangerously misleading. Imagine an AI system that succeeds 99% of the time. A judge that always predicts “pass” will be 99% accurate, but it will never catch a single failure. This is a common issue with imbalanced datasets, where one outcome is far more frequent than the other.

Instead of accuracy, the true positive rate (TPR) and true negative rate (TNR) measured together will tell you precisely how your judge is likely to make mistakes. In plain language:

  • TPR: Of all the examples that should pass, what percentage did the judge correctly label as “pass”?

  • TNR: Of all the examples that should fail, what percentage did the judge correctly label as “fail”?

A judge with a high TPR but low TNR is good at recognizing success but lets failures slip through. The acceptable tradeoff depends on your product. For an AI providing medical advice, a false negative (failing to catch a harmful suggestion) is far more costly than a false positive. For a creative writing assistant, a false positive (flagging a good response as bad) might be worse, as it could stifle creativity.

Once you know your judge’s TPR and TNR, you can even statistically correct its raw scores to get a more accurate estimate of your system’s true failure rate. For example, if your judge reports a 95% pass rate on 1,000 new examples but you know it has a 10% chance of mislabeling a failure as a pass, you can adjust that 95% score to reflect the judge’s known error rate. (The mathematical details for this correction will be in an appendix).

This rigorous, human-led validation process is the only way to build an evaluation system your team can rely on. When you present a dashboard showing a 5% failure rate for a critical feature, your stakeholders need to believe that number reflects reality. This process is how you build that trust.

Considerations for specific architectures

There are special considerations and strategies to keep in mind when designing evals for multi-turn conversations, RAG pipelines, and agentic workflows. We address each of them below.

Multi-turn conversations

Many AI products are conversational, which introduces the challenge of maintaining context over time. When evaluating conversations, start at the highest level: Did the entire session achieve the user’s goal? This session-level pass/fail judgment is the most important measure of success.

When a conversation fails, the next step is to isolate the root cause. A common mistake is to assume the failure is due to the complexities of dialogue. Before you dive into multi-turn analysis, try to reproduce the failure in a single turn. For example, if a shopping bot gives the wrong return policy on the fourth turn, first ask it directly: “What is the return policy for product X1000?” If it still fails, the problem is likely a simple knowledge or retrieval issue. If it succeeds, you have confirmed the failure is conversational—the bot is losing context or misinterpreting information from earlier in the dialogue. This diagnostic step saves significant time by distinguishing simple knowledge gaps from true conversational memory failures.

Retrieval-augmented generation (RAG)

A RAG system is a two-part machine: a retriever finds information, and a generator writes an answer using that information. These two parts can fail independently, and an end-to-end correctness score will not tell you which one is broken. You must evaluate them separately.

First, evaluate the retriever. Treat it as a search problem. To do this, you need a dataset of queries paired with their known correct documents. The most critical metric for RAG is often recall@k. This measures what percentage of all the truly relevant documents are captured within the top k results your system retrieves. Recall is paramount because if the correct information is not retrieved, the generator has no chance of producing a correct answer. Modern LLMs are surprisingly adept at ignoring irrelevant noise in their context, but they cannot invent facts from missing information.

The value of k is a critical tuning parameter that depends on your task. For a simple query requiring a single fact, like “What are the property taxes for 123 Main St.?,” a small k (e.g. 3 to 5) is often sufficient. The main goal is to ensure that the one correct document is retrieved. However, for a complex query that requires synthesizing information from multiple sources, such as “Summarize recent market trends for 3-bedroom houses in downtown,” you'll need a larger k (e.g. 10 to 20) to provide the generator with enough context to build a comprehensive answer. While recall is the priority for the initial retrieval stage, precision@k (the fraction of retrieved documents that are relevant) becomes important in systems with a second, re-ranking stage designed to select the best few documents to pass to the LLM.

Once your retriever is performing well on a diverse set of queries, you can evaluate the generator. Here you are primarily measuring two things. First is faithfulness: does the generated answer stick to the facts provided in the retrieved context, or is it hallucinating? Second is answer relevance: does the answer directly address the user’s original question? An answer can be perfectly faithful to the source documents but still fail to be relevant to the user’s intent.

Fix your retriever first. Only when you are confident that the right information is consistently being fed to your generator should you focus heavily on improving the generation step. It should be noted that RAG is a very nascent topic, and there is still much to be explored in terms of evaluating and optimizing it. See this series for an exploration of advanced RAG topics.

Agentic workflows

Agents—which can execute a sequence of actions, like tool calls, to accomplish a goal—are the most complex systems to evaluate. A single pass/fail judgment on the final outcome is a good start, but it is not diagnostic. When an agent fails, you need to know which step in the chain of reasoning broke.

For this, a transition failure matrix is an invaluable tool. Think of an agent’s workflow as a series of states or steps, like an assembly line. The agent moves from one state (e.g. generating_sql) to the next (e.g. executing_sql). A transition failure matrix is a chart that shows you exactly where the assembly line breaks down most often.

The rows of the matrix represent the last successful step, and the columns represent the step where the failure occurred. By analyzing traces of failed agent interactions and mapping them onto this matrix, you can quickly spot hotspots. Instead of guessing, you can see with data that, for example, your agent fails most often when trying to execute the SQL it just generated, or when it misinterprets the output from a tool call. This transforms the overwhelming task of debugging a complex agent into a focused, data-driven investigation.

With these targeted evaluation strategies for complex systems, you are now ready to operationalize your full evaluation suite.

Phase 3: Operationalizing evals for continuous improvement

Read more

How Devin replaces your junior engineers with infinite AI interns that never sleep | Scott Wu (Cognition CEO)

2025-09-08 19:03:38

Watch or listen now:
YouTube // Spotify // Apple

Brought to you by:

Google Gemini—Your everyday AI assistant

Vanta—Automate compliance. Simplify security.


Scott Wu is the co-founder and CEO of Cognition Labs, the creators of Devin, an AI agent designed to function as a junior engineer on software development teams. In this conversation, Scott demonstrates how his team uses their own product to accelerate development workflows, reduce engineering toil, and handle routine tasks asynchronously. Scott walks us through real examples of how Devin integrates into Cognition’s daily operations—from researching and implementing new features to responding to crashes and handling frontend fixes. He explains how Devin differs from traditional AI coding assistants by functioning more like a team member than a tool, allowing engineers to delegate well-scoped tasks while focusing on higher-level problems.

What you’ll learn:

1. How to use DeepWiki to research your codebase and generate better prompts for AI engineering tasks

2. A workflow for treating AI agents as asynchronous junior engineers who can handle multiple tasks while you attend meetings

3. Why public channels create better learning environments for both humans and AI when implementing engineering solutions

4. The top five engineering tasks AI excels at: frontend fixes, version upgrades, documentation, incident response, and testing

5. How to implement a “first line of defense” system where AI agents analyze crashes before humans need to intervene

6. A technique for bringing voice AI into meetings as an additional participant to answer questions without disrupting flow

Where to find Scott Wu:

X: https://x.com/ScottWu46

LinkedIn: https://www.linkedin.com/in/scott-wu-8b94ab96/

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

In this episode, we cover:

(00:00) Introduction to Scott Wu and Devin

(03:53) Where Devin excels

(06:08) Using DeepWiki to research codebases and create better prompts

(10:27) Prompting tips

(11:24) The asynchronous nature of working with Devin

(13:38) Multithreading tasks

(14:43) Using Devin to implement an MCP server integration

(18:38) Setting up workflows in Slack for first-line responses

(23:22) Encouraging AI adoption in public Slack channels

(25:50) Top five engineering tasks for Devin

(32:17) Using ChatGPT voice as a meeting participant

(35:57) Lightning round

Tools referenced:

• Devin: https://devin.ai/

• DeepWiki: https://deepwiki.org/

• ChatGPT: https://chat.openai.com/

• Windsurf: https://windsurf.ai/

• Slack: https://slack.com/

• Linear: https://linear.app/

• GitHub: https://github.com/

Other references:

• MCP (model context protocol): https://www.anthropic.com/news/model-context-protocol

• TanStack Router: https://tanstack.com/router/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].