MoreRSS

site iconLenny RachitskyModify

The #1 business newsletter on Substack.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Lenny Rachitsky

“Engineers are becoming sorcerers” | The future of software development with OpenAI’s Sherwin Wu

2026-02-12 21:31:43

Sherwin Wu leads engineering for OpenAI’s API platform, where roughly 95% of engineers use Codex, often working with fleets of 10 to 20 parallel AI agents.

We discuss:

  1. What OpenAI did to cut code review times from 10-15 minutes to 2-3 minutes

  2. How AI is changing the role of managers

  3. Why the productivity gap between AI power users and everyone else is widening

  4. Why “models will eat your scaffolding for breakfast”

  5. Why the next 12 to 24 months are a rare window where engineers can leap ahead before the role fully transforms


Brought to you by:

DX—The developer intelligence platform designed by leading researchers

Sentry—Code breaks, fix it faster

Datadog—Now home to Eppo, the leading experimentation and feature flagging platform

Where to find Sherwin Wu:

• X: https://x.com/sherwinwu

• LinkedIn: https://www.linkedin.com/in/sherwinwu1

Referenced:

• Codex: https://openai.com/codex

• OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter): https://www.lennysnewsletter.com/p/kevin-weil-open-ai

• OpenClaw: https://openclaw.ai

• The creator of Clawd: “I ship code I don’t read”: https://newsletter.pragmaticengineer.com/p/the-creator-of-clawd-i-ship-code

The Sorcerer’s Apprentice: https://en.wikipedia.org/wiki/The_Sorcerer%27s_Apprentice_(Dukas)

• Quora: https://www.quora.com

• Marc Andreessen: The real AI boom hasn’t even started yet: https://www.lennysnewsletter.com/p/marc-andreessen-the-real-ai-boom

• Sarah Friar on LinkedIn: https://www.linkedin.com/in/sarah-friar

• Sam Altman on X: https://x.com/sama

• Nicolas Bustamante’s “LLMs Eat Scaffolding for Breakfast” post on X: https://x.com/nicbstme/status/2015795605524901957

• The Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

• Overton window: https://en.wikipedia.org/wiki/Overton_window

• Developers can now submit apps to ChatGPT: https://openai.com/index/developers-can-now-submit-apps-to-chatgpt

• Responses: https://platform.openai.com/docs/api-reference/responses

• Agents SDK: https://platform.openai.com/docs/guides/agents-sdk

• AgentKit: https://openai.com/index/introducing-agentkit

• Ubiquiti: https://ui.com

Jujutsu Kaisen on Crunchyroll: https://www.crunchyroll.com/series/GRDV0019R/jujutsu-kaisen?srsltid=AfmBOoqvfzKQ6SZOgzyJwNQ43eceaJTQA2nUxTQfjA1Ko4OxlpUoBNRB

• eero: https://eero.com

• Opendoor: https://www.opendoor.com

Recommended books:

Structure and Interpretation of Computer Programs: https://www.amazon.com/Structure-Interpretation-Computer-Programs-Engineering/dp/0262510871

The Mythical Man-Month: Essays on Software Engineering: https://www.amazon.com/Mythical-Man-Month-Software-Engineering-Anniversary/dp/0201835959

There Is No Antimemetics Division: A Novel: https://www.amazon.com/There-No-Antimemetics-Division-Novel/dp/0593983750

Breakneck: China’s Quest to Engineer the Future: https://www.amazon.com/Breakneck-Chinas-Quest-Engineer-Future/dp/1324106034

Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373


Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Lenny may be an investor in the companies discussed.


My biggest takeaways from this conversation:

Read more

Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days

2026-02-11 21:02:52

I put the newest AI coding models from OpenAI and Anthropic head-to-head, testing them on real engineering work I’m actually doing. I compare GPT-5.3 Codex with Opus 4.6 (and Opus 4.6 Fast) by asking them to redesign my marketing website and refactor some genuinely gnarly components. Through side-by-side experiments, I break down where each model shines—creative development versus code review—and share how I’m thinking about combining them to build a more effective AI engineering stack.

What you’ll learn:

  1. The strengths and weaknesses of OpenAI’s Codex vs. Anthropic’s Opus for different coding tasks

  2. How I shipped 44 PRs containing 98 commits across 1,088 files in just five days using these models

  3. Why Codex excels at code review but struggles with creative, greenfield work

  4. The surprising way Opus and Codex complement each other in a real-world engineering workflow

  5. How to use Git concepts like work trees to maximize productivity with AI coding assistants

  6. Why Opus 4.6 Fast might be worth the 6x price increase (but be careful with your token budget)


Brought to you by:

WorkOS—Make your app enterprise-ready today

In this episode, we cover:

(00:00) Introduction to new AI coding models

(02:13) My test methodology for comparing models

(03:30) Codex’s unique features: Git primitives, skills, and automations

(09:05) Testing GPT-5.2 Codex on a website redesign task

(10:40) Challenges with Codex’s literal interpretation of prompts

(15:00) Comparing the before and after with Codex

(16:23) Testing Opus 4.6 on the same website redesign task

(20:56) Comparing the visual results of both models

(21:30) Real-world engineering impact: 44 PRs in five days

(23:03) Refactoring components with Opus 4.6

(24:30) Using Codex for code review and architectural analysis

(26:55) Cost considerations for Opus 4.6 Fast

(28:52) Conclusion

Tools referenced:

• OpenAI’s GPT-5.3 Codex: https://openai.com/index/introducing-gpt-5-3-codex/

• Anthropic’s Claude Opus 4.6: https://www.anthropic.com/news/claude-opus-4-6

• Cursor: https://cursor.sh/

• GitHub: https://github.com/

Other references:

• Tailwind CSS: https://tailwindcss.com/

• Git: https://git-scm.com/

• Bugbot: https://cursor.com/bugbot

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Building AI product sense, part 2

2026-02-10 22:31:00

👋 Hey there, I’m Lenny. Each week, I answer reader questions about building product, driving growth, and accelerating your career. For more: Lenny’s Podcast | How I AI | Lennybot | My favorite AI/PM courses, public speaking course, and interview prep copilot.

Subscribe now

P.S. Subscribers get a free year of Lovable, Manus, Replit, Gamma, n8n, Canva, ElevenLabs, Amp, Factory, Devin, Bolt, Wispr Flow, Linear, PostHog, Framer, Railway, Granola, Warp, Perplexity, Magic Patterns, Mobbin, ChatPRD, and Stripe Atlas. Yes, this is for real.


In part two of our in-depth series on building AI product sense (don’t miss part one), Dr. Marily Nika—a longtime AI PM at Google and Meta, and an OG AI educator—shares a simple weekly ritual that you can implement today that will rapidly build your AI product sense. Let’s get into it.

For more from Marily, check out her AI Product Management Bootcamp & Certification course (which is also available for private corporate sessions) and her recently launched AI Product Sense and AI PM Interview prep course (both courses are 15% off using these links). You can also watch her free Lightning Lesson on how to excel as a senior IC PM in the AI era, and subscribe to her newsletter.

P.S. You can listen to this post in convenient podcast form: Spotify / Apple / YouTube.


Meta recently added a new PM interview, the first major change to its PM loop in over five years. It’s called “Product Sense with AI,” and candidates are asked to work through a product problem with the help of AI, in real time.

In this interview, candidates aren’t judged on clever prompts, model trivia, or even flashy demos. They are evaluated on how they work with uncertainty: how they notice when the model is guessing, ask the right follow-up questions, and make clear product decisions despite imperfect information.

That shift reflects something bigger. AI product sense—understanding what a model can do and where it fails, and working within those constraints to build a product that people love—is becoming the new core skill of product management.

Over the past year, I’ve watched the same pattern repeat across different teams at work and in my trainings: the AI works beautifully in a controlled flow . . . and then it breaks in production because of a handful of predictable failure modes. The uncomfortable truth is that the hardest part of AI product development comes when real users arrive with messy inputs, unclear intent, and zero patience. For example, a customer support agent can feel incredible in a demo and then, after launch, quietly lose user trust by confidently answering ambiguous or underspecified questions (for example, “Is this good?”) instead of stopping to ask for clarification.

Through my work shipping speech and identity features for conversational platforms and personalized experiences (on-device assistants and diverse hardware portfolios) for 10 years, I started using a simple, repeatable workflow to uncover issues that would otherwise show up weeks later, building this AI product sense for myself first, and then with teams and students. It’s not a theory or a framework but, rather, important practice that gives you early feedback on model behavior, failure modes, and tradeoffs—forcing you to see if an AI product can survive contact with reality before your users teach you the hard way. When I run this process, two things happen quickly: I stop being surprised by model behavior, because I’ve already experienced the weird cases myself. And I get clarity on what’s a product problem vs. what’s a model limitation.

In this post, I’ll walk through my three steps for building AI product sense:

1. Map the failure modes (and the intended behavior)
2. Define the minimum viable quality (MVQ)
3. Design guardrails where behavior breaks


Once that AI product sense muscle develops, you should be able to evaluate a product across a few concrete dimensions: how the model behaves under ambiguity, how users experience failures, where trust is earned or lost, and how costs change at scale. It’s about understanding and predicting how the system will respond to different circumstances.

In other words, the work expands from “Is this a good product idea?” to “How will this product behave in the real world?”

Let’s start building AI product sense.

Map the failure modes (and the intended behavior)

Every AI feature has a failure signature: the pattern of breakdowns it reliably falls into when the world gets messy. And the fastest way to build AI product sense is to deliberately push the model into those failure modes before your users ever do.

I run the following rituals once a week, usually Wednesday mornings before my first meeting, on whatever AI workflow I’m currently building. Together, they run under 15 minutes, and are worth every second. The results consistently surface issues for me that would otherwise show up much later in production.

Ritual 1: Ask a model to do something obviously wrong (2 min.)

Goal: Understand the model’s tendency to force structure onto chaos

Take the kind of chaotic, half-formed, emotionally inconsistent data every PM deals with daily—think Slack threads, meeting notes, Jira comments—and ask the model to extract “strategic decisions” from it. That’s because this is where generative models reveal their most dangerous pattern:

When confronted with mess, they confidently invent structure.

Here’s an example messy Slack thread:

Alice: “Stripe failing for EU users again?”

Ben: “no idea, might be webhook?”
Sara: “lol can we not rename the onboarding modal again?”
Kyle: “Still haven’t figured out what to do with dark mode”
Alice: “We need onboarding out by Thursday”
Ben: “Wait, is the banner still broken on mobile???”
Sara: “I can fix the copy later”

I asked the model to extract “strategic product decisions” from this thread, and it confidently hallucinated a roadmap, assigned the wrong owners, and turned offhand comments into commitments. This is the kind of failure signature every AI PM must design around:

It looks authoritative, clean, structured. And it’s completely wrong.

Now that you have the obviously wrong results, you’ll need to generate the “ideal” response and compare the two responses to understand what signals the model needs to behave correctly.

Here’s exactly what to do:

1. Re-run the same Slack thread through the model

Use the same messy context that caused the hallucination.

Example (you paste the Slack thread):

Based on this Slack discussion, draft our Q4 roadmap.

Let’s say the model invents features you never discussed. Great, you’ve found a failure mode.

2. Now tell the model what good looks like and run it again

Add one short line explaining the expected behavior. For example:

Try again, but only include items explicitly mentioned in the thread. If something is missing, say “Not enough information.”

Run that prompt against the exact same Slack thread. A correct, trustworthy behavior would be:

This answer acknowledges the lack of clear decisions, asks clarifying questions, and surfaces useful structure without inventing facts (“key themes”). It avoids assigning owners unless explicitly stated and highlights uncertainties instead of hiding them.

3. Compare the two outputs—and the inputs that led to them—side by side

This contrast of the two outputs above—confident hallucination vs. humble clarity—is what teaches you how the model behaves today, and what you need to design toward. And that contrast is where AI product sense sharpens fastest.

You’re looking for:

  • What changed?

  • What guardrail fixed the hallucination?

  • What does the model need to behave reliably? (Explicit constraints? Better context? Tighter scoping?)

  • Does the “good” version feel shippable or still brittle?

  • What would the user experience in each version?

4. Capture the gaps—this becomes a product requirement

When you see a failure mode repeat, it usually points to a specific kind of product gap (and specific kind of fix).

Now you know where the product fails and its intended behavior. Later in this guide, I’ll show concrete examples of what prompt and design guardrails and retrieval look like in practice, and how to decide when to add them.

Ritual 2: Ask a model to do something ambiguous (3 min.)

Goal: Understand the model’s semantic fragility

Ambiguity is kryptonite for probabilistic systems because if a model doesn’t fully understand the user’s intent, it fills the gaps with its best guess (i.e. hallucinations, bad ideas). That’s when user trust starts to crack. Try, for example, to input a PRD into NotebookLM and ask it to “Summarize this PRD for the VP of Product.”

How to try this in 2 minutes (NotebookLM):

  1. Open NotebookLM → create a new notebook

  1. Upload a PRD (Google Doc/PDF works well)

  1. Ask: “Summarize this for execs and list the top 5 risks and open questions.”

Does it:

  • over-summarize?

  • latch onto one irrelevant detail?

  • ignore caveats?

  • assume the wrong audience?

The model’s failures reveal where its semantic fragility is—in what ways the model technically understands your words but completely misses your intent. Other examples could be if you ask for a summary for leaders and it gives you a bullet list of emojis and jokes from the thread. Or you ask for UX problems and it confidently proposes a new pricing model.

What you’re learning here is where the model gets confused, which is exactly where your product should step in and do the work to reduce ambiguity. That could mean asking the user to choose a goal (“Summarize for who?”), giving the model more context, or constraining the action so the model can’t go off-track. You’re not trying to “trick” the model; you’re trying to understand where communication breaks so you can prevent misunderstanding through design.

Ambiguous prompts: what to test, what breaks, what to do

Here are a few ambiguous prompts to try, along with the different interpretations you should explicitly test:

Now you have another batch of design work for the AI product to help guide it toward predictable and trustworthy results.

Ritual 3: Ask a model to do something unexpectedly difficult (3 min.)

Goal: Understand the model’s first point of failure

Pick one task that feels simple to a human PM but stresses a model’s reasoning, context, or judgment.

You’re not trying to exhaustively test the model. You’re trying to see where it breaks first, so you know where the product needs organizing structure. Where it starts to go wrong is exactly where you need to design guardrails, narrow inputs, or split the task into smaller steps.

Note: This isn’t the final solution yet; it’s the intended behavior. In the guardrails section later, I’ll show how to turn this into an explicit rule in the product (prompt + UX + fallback behavior).

Example 1: “Group these 40 bugs into themes and propose a roadmap.”

Example 2: “Summarize this PRD and flag risks for leadership.”

With results from all three rituals, you now have a complete list of product design work that needs to happen to get the results you and users can use and trust.

Over time, this kind of work also starts to surface second-order effects—moments where a small AI feature quietly reshapes workflows, defaults, or expectations. System-level insights come later, once the foundations are solid. The first goal is to understand behavior.

Define a minimum viable quality (MVQ)

Even when you understand a model’s failure modes and have designed around them, it’s nearly impossible to entirely predict how AI features will behave once they hit the real world, but performance almost always drops once they’re out of the controlled development environment. Since you don’t know how it will drop or by how much, one of the best ways to keep the bar high from the start is to define a minimum viable quality (MVQ) and check it against your product throughout development.

A strong MVQ explicitly defines three thresholds:

  1. Acceptable bar: where it’s good enough for real users

  2. Delight bar: where the feature feels magical

  3. Do-not-ship bar: the unacceptable failure rates that will break trust

Also important in MVQ is the product’s cost envelope: the rough range of what this feature will cost to run at scale for your users.

A concrete example of MVQ comes from my firsthand experience. I spent years working in speech recognition and speaker identification, a domain where the gap between lab accuracy and real-world accuracy is painfully visible.

I still remember demos where the model hit over 90% accuracy in controlled tests and then completely fell apart the first time we tried it in a real home. A barking dog, a running dishwasher, someone speaking from across the room, and suddenly the “great” model felt broken. And from the user’s perspective, it was broken.

With speaker identification for AI features coming from smart speakers, the MVQ of the ability to identify who is speaking would look like this:

Acceptable bar

  • Correctly identifies the speaker x% of the time in typical home conditions

  • Recovers gracefully when unsure (“I’m not sure who’s speaking—should I use your profile or continue as a guest?”)

Delight bar

You don’t need a perfect percentage to know that you’ve hit the right delight bar, but you look for behavioral signals like:

  • Users stop repeating themselves or rephrasing commands

  • “No, I meant . . .” corrections drop sharply

Rule of thumb: If 8 or 9 out of 10 attempts work without a retry in realistic conditions, it feels magical. If 1 in 5 needs a retry, trust erodes fast. MVQ also depends on the phase you’re in. In a closed beta, users often tolerate rough edges because they expect iteration. In a broad launch, the same failure modes feel broken.

For the speech recognition feature, here are some examples for assessing delight:

  • Background chaos test: Play a video in the background while two people talk over each other and see if the assistant still responds correctly without asking, “Sorry, can you repeat that?”

  • 6 p.m. kitchen test: Dishwasher running, kids talking, dog barking—and the smart speaker still recognizes you and gives a personalized response without a “I couldn’t recognize your voice” interruption.

  • Mid-command correction test: You say “Set a timer for 10 minutes . . . actually, make it 5,” and it updates correctly instead of sticking to the original instruction.

Do-not-ship bar

  • Misidentifies the speaker more than y% of the time in critical flows (purchases, messages, personalized actions)

  • Forces users to repeat themselves multiple times just to be recognized

  • You may have noticed I didn’t actually assign values to each bar. That’s because the specific thresholds for MVQ (your “acceptable,” “delight,” and “do-not-ship” bars) aren’t fixed. They depend heavily on your strategic context.

Five strategic context factors that raise or lower your MVQ bar

Here are the five factors that most often determine where that bar should be set, and how they change your product decision:

Estimating the cost envelope

One of the most common mistakes new AI PMs make is falling in love with a magical AI demo without checking whether it’s financially viable. That’s why it’s important to estimate the AI product or feature’s cost envelope early.

Cost envelope = the rough range of what this feature will cost to run at scale for your users

You don’t need perfect numbers, but you need a ballpark. Start with:

  • What’s the model cost per call (roughly)?

  • How often will users trigger it per day/month?

  • What’s the worst-case scenario (power users, edge cases)?

  • Can caching, smaller models, or distillation bring this down?

  • If usage 10x’s, does the math still work?

Example: AI meeting notes again

  • Per-call cost: ~$0.02 to process a 30-minute transcript

  • Average usage: 20 meetings/user/month → ~$0.40/month/user

  • Heavy users: 100 meetings/month → ~$2.00/month/user

  • With caching and a smaller model for “low-stakes” meetings, maybe you bring this to ~$0.25–$0.30/month/user on average

Now you can have a real conversation:

  • A feature that effectively costs $0.30/user/month and drives retention is a no-brainer.

  • A feature that ends up at $5/user/month with unclear impact is a business problem.

This is a core part of AI product sense: Does what you’re proposing actually make sense for the business?

Design guardrails where behavior breaks

Now that you better understand where a model’s behavior breaks and what you’re looking for to greenlight a launch, it’s time to codify some guardrails and design them into the product. A good guardrail determines what the product should do when the model hits its limits so that users don’t get confused, misled, or lose trust. In practice, guardrails protect users from experiencing a model’s failure modes. At a startup I’ve been collaborating with, we built an AI feature to increase the team’s productivity that summarized long Slack threads into “decisions and action items.” In testing, it worked well—until it started assigning owners for action items when no one had actually agreed to anything yet. Sometimes it even picked the wrong person.

Because my team had developed our AI product sense, we figured out that the fix was a new guardrail in the product, not a different underlying model.

So we added one simple rule to the system prompt (in this case, just a line of additional instruction):

Only assign an owner if someone explicitly volunteers or is directly asked and confirms. Otherwise, surface themes and ask the user what to do next.

That single constraint eliminated the biggest trust issue almost immediately.

What good guardrails look like in practice

Read more

Building AI product sense, part 2

2026-02-10 18:02:26

If you’re a premium subscriber

Add the private feed to your podcast app at add.lennysreads.com

Dr. Marily Nika, longtime AI PM at Google and Meta, shares a simple weekly ritual that rapidly builds AI product sense – the ability to translate probabilistic model behavior into products people can trust. In this episode, Marily walks through the framework for uncovering failure modes before users do.

Subscribe now

Listen now: YouTube | Apple | Spotify

In this episode, you’ll learn:

  • Why Meta added “Product Sense with AI” to its PM interview loop

  • The rituals that surface hidden failure modes

  • Why generative models confidently invent structure when confronted with mess

  • What minimum viable quality (MVQ) means and how to define three critical thresholds

  • Five strategic context factors that raise or lower your quality bar

  • Why you need to estimate your AI feature’s cost envelope early

  • How to design guardrails that protect users from model shortcomings

  • Four patterns that cover most real-world failure cases

Referenced:

Read more

🎙️ This week on How I AI: How to build your own AI developer tools with Claude Code

2026-02-10 00:02:35

Every Monday, host Claire Vo shares a 30- to 45-minute episode with a new guest demoing a practical, impactful way they’ve learned to use AI in their work or life. No pontificating—just specific and actionable advice.

How to build your own AI developer tools with Claude Code | CJ Hess (Tenex)

Brought to you by:

  • Orkes—The enterprise platform for reliable applications and agentic workflows

  • Rovo—AI that knows your business

CJ Hess, an engineer at Tenex, walks through how he’s built a custom AI development workflow that lets models handle over 90% of his front-end coding. In the episode, CJ demos Flowy, a tool he built to turn Claude’s ASCII plans into interactive flowcharts and UI mockups, and explains why visual planning dramatically reduces cognitive load compared with text. He shares why he prefers Claude Code for intent-heavy work, how custom “skills” make AI tools compound over time, and why pairing Claude for generation with GPT-5.2 Codex for review produces better code than either model alone.

Detailed workflow walkthroughs from this episode:

• How I AI: CJ Hess on Building Custom Dev Tools and Model-vs-Model Code Reviews: https://www.chatprd.ai/how-i-ai/cj-hess-tenex-custom-dev-tools-and-model-vs-model-code-reviews

• Implement Model-vs-Model AI Code Reviews for Quality Control: https://www.chatprd.ai/how-i-ai/workflows/implement-model-vs-model-ai-code-reviews-for-quality-control

• Develop Features with AI Using Custom Visual Planning Tools: https://www.chatprd.ai/how-i-ai/workflows/develop-features-with-ai-using-custom-visual-planning-tools

Biggest takeaways:

  1. Claude Code excels at “intent understanding” compared with other models. While CJ acknowledges that GPT-5.2 might be “smarter,” he finds Claude more “steerable” and better at understanding his intentions. This makes Claude particularly valuable for deep dives into complex coding tasks where nuanced understanding matters more than raw intelligence.

  2. Skills are the secret to making Claude work with your custom tools. CJ created specific skills that teach Claude how to generate proper JSON for Flowy, with separate skills for flowcharts and UI mockups. These skills evolve alongside his tools, creating a continuously improving ecosystem that makes Claude more powerful for his specific needs.

  3. Use model-to-model comparison to improve code quality. CJ uses both Claude (for generation) and Codex (for review) in his workflow. While Claude excels at building features quickly, Codex is better at identifying code smells, inconsistencies, and potential refactoring opportunities. This dual-model approach creates better code than either model could produce alone.

  4. Visual planning reduces cognitive overhead compared with text. Even when Claude’s ASCII diagrams contain the same information as Flowy visualizations, CJ finds it much easier to evaluate and approve visual mockups. This highlights how AI tools should adapt to human cognitive preferences rather than forcing humans to adapt to AI output formats.

  5. AI can handle more than 90% of front-end coding tasks. CJ says he “hasn’t written a single line of JavaScript or HTML in three months,” instead managing “teams of AI” to write code.

  6. “Living dangerously” with AI permissions is increasingly viable. CJ uses an alias named “Kevin” for Claude with bypass permissions, noting that with proper Git safeguards, the risks are manageable.

▶️ Listen now on YouTube | Spotify | Apple Podcasts


If you’re enjoying these episodes, reply and let me know what you’d love to learn more about: AI workflows, hiring, growth, product strategy—anything.

Catch you next week,
Lenny

P.S. Want every new episode delivered the moment it drops? Hit “Follow” on your favorite podcast app.

How to build your own AI developer tools with Claude Code | CJ Hess (Tenex)

2026-02-09 21:03:17

CJ Hess is a software engineer at Tenex who has built some of the most useful tools and workflows for being a “real AI engineer.” In this episode, CJ demonstrates his custom-built tool, Flowy, that transforms Claude’s ASCII diagrams into interactive visual mockups and flowcharts. He also shares his process for using model-to-model comparison to ensure that his AI-generated code is high-quality, and why he believes we’re just at the beginning of a revolution in how developers interact with AI.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

  1. How CJ built Flowy, a custom visual planning tool that converts JSON files into interactive mockups and flowcharts

  2. Why visual planning tools are more effective than ASCII diagrams for complex UI and animation workflows

  3. How to create and use Claude Code skills to extend your development environment

  4. Using model-to-model comparison (Claude + Codex) to improve code quality

  5. How to build your own ecosystem of tools around Claude Code

  6. The value of bypassing permissions in controlled environments to speed up development


Brought to you by:

Orkes—The enterprise platform for reliable applications and agentic workflows

Rovo—AI that knows your business

In this episode, we cover:

(00:00) Introduction to CJ Hess

(02:48) Why CJ prefers Claude Code for development

(04:46) The evolution of developer environments with AI

(06:50) Planning workflows and the limitations of ASCII diagrams

(08:23) Introduction to Flowy, CJ’s custom visualization tool

(11:54) How Flowy compares to mermaid diagrams

(15:25) Demo: Using Flowy

(19:30) Examining Flowy’s skill structure

(23:27) Reviewing the generated flowcharts and diagrams

(28:34) The cognitive benefits of visual planning vs. text-based planning

(31:38) Generating UI mockups with Flowy

(33:30) Building the feature directly from flowcharts and mockups

(35:40) Quick recap

(36:51) Using model-to-model review with Codex (Carl)

(41:52) The benefits of using AI for code review

(45:13) Lightning round and final thoughts

Tools referenced:

• Claude Code: https://claude.ai/code

• Claude Opus 4.5: https://www.anthropic.com/news/claude-opus-4-5

• Cursor: https://cursor.sh/

• Obsidian: https://obsidian.md/

• GPT-5.2 Codex: https://openai.com/index/introducing-gpt-5-2-codex/

• Google’s Project Genie: https://labs.google/projectgenie

Other references:

• Mermaid diagrams: https://mermaid.js.org/

• Figma: https://www.figma.com/

• Excalidraw: https://excalidraw.com/

• TypeScript: https://www.typescriptlang.org/

Where to find CJ Hess:

LinkedIn: https://www.linkedin.com/in/cj-hess-connexwork/

X: https://x.com/seejayhess

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].