MoreRSS

site iconIrrational ExuberanceModify

By Will Larson. CTO at Carta, writes about software engineering and has authored several books including "An Elegant Puzzle."
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Irrational Exuberance

How to get better at strategy?

2025-04-10 20:00:00

One of the most memorable quotes in Arthur Miller’s The Death of a Salesman comes from Uncle Ben, who describes his path to becoming wealthy as, “When I was seventeen, I walked into the jungle, and when I was twenty-one I walked out. And by God I was rich.” I wish I could describe the path to learning engineering strategy in similar terms, but by all accounts it’s a much slower path. Two decades in, I am still learning more from each project I work on. This book has aimed to accelerate your learning path, but my experience is that there’s still a great deal left to learn, despite what this book has hoped to accomplish.

This final chapter is focused on the remaining advice I have to give on how you can continue to improve at strategy long after reading this book’s final page. Inescapably, this chapter has become advice on writing your own strategy for improving at strategy. You are already familiar with my general suggestions on creating strategy, so this chapter provides focused advice on creating your own plan to get better at strategy.

It covers:

  • Exploring strategy creation to find strategies you can learn from via public and private resources, and through creating learning communities
  • How to diagnose the strategies you’ve found, to ensure you learn the right lessons from each one
  • Policies that will help you find ways to perform and practice strategy within your organization, whether or not you have organizational authority
  • Operational mechanisms to hold yourself accountable to developing a strategy practice
  • My final benediction to you as a strategy practitioner who has finished reading this book

With that preamble, let’s write this book’s final strategy: your personal strategy for developing your strategy practice.

This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.

Exploring strategy creation

Ideally, we’d begin improving our engineering strategy skills by broadly reading publicly available examples. Unfortunately, there simply aren’t many easily available works to learn from others’ experience. Nonetheless, resources do exist, and we’ll discuss the three categories that I’ve found most useful:

  1. Public resources on engineering strategy, such as companies’ engineering blogs
  2. Private and undocumented strategies available through your professional network
  3. Learning communities that you build together, including ongoing learning circles

Each of these is explored in its own section below.

Public resources

While there aren’t as many public engineering strategy resources as I’d like, I’ve found that there are still a reasonable number available. This book collects a number of such resources in the appendix of engineering strategy resources. That appendix also includes some individuals’ blog posts that are adjacent to this topic. You can go a long way by searching and prompting your way into these resources.

As you read them, it’s important to recognize that public strategies are often misleading, as discussed previously in evaluating strategies. Everyone writing in public has an agenda, and that agenda often means that they’ll omit important details to make themselves, or their company, come off well. Make sure you read through the lines rather than taking things too literally.

Private resources

Ironically, where public resources are hard to find, I’ve found it much easier to find privately held strategy resources. While private recollections are still prone to inaccuracies, the incentives to massage the truth are less pronounced.

The most useful sources I’ve found are:

  • peers’ stories – strategies are often oral histories, and they are shared freely among peers within and across companies. As you build out your professional network, you can usually get access to any company’s engineering strategy on any topic by just asking.

    There are brief exceptions. Even a close peer won’t share a sensitive strategy before its existence becomes obvious externally, but they’ll be glad to after it does. People tend to overestimate how much information companies can keep private anyway. Even reading recent job postings can usually expose a surprising amount about a company.

  • internal strategy archaeologists – while surprisingly few companies formally collect their strategies into a repository, the stories are informally collected by the tenured members of the organization. These folks are the company’s strategy archaeologists, and you can learn a great deal by explicitly consulting them

  • becoming a strategy archaeologist yourself – whether or not you’re a tenured member of your company, you can learn a tremendous amount by starting to build your own strategy repository. As you start collecting them, you’ll interest others in contributing their strategies as well.

    As discussed in Staff Engineer’s section on the Write five then synthesize approach to strategy, over time you can foster a culture of documentation where one didn’t exist before. Even better, building that culture doesn’t require any explicit authority, just an ongoing show of excitement.

There are other sources as well, ranging from attending the hallway track in conferences to organizing dinners where stories are shared with a commitment to privacy.

Working in community

My final suggestion for seeing how others work on strategy is to form a learning circle. I formed a learning circle when I first moved into an executive role, and at this point have been running it for more than five years. What’s surprised me the most is how much I’ve learned from it.

There are a few reasons why ongoing learning circles are exceptional for sharing strategy:

  1. Bi-directional discussion allows so much more learning and understanding than mono-directional communication like conference talks or documents.
  2. Groups allow you to learn from others’ experiences and others’ questions, rather than having to guide the entire learning yourself.
  3. Continuity allows you to see the strategy at inception, during the rollout, and after it’s been in practice for some time.
  4. Trust is built slowly, and you only get the full details about a problem when you’ve already successfully held trust about smaller things. An ongoing group makes this sort of sharing feasible where a transient group does not.

Although putting one of these communities together requires a commitment, they are the best mechanism I’ve found. As a final secret, many people get stuck on how they can get invited to an existing learning circle, but that’s almost always the wrong question to be asking. If you want to join a learning circle, make one. That’s how I got invited to mine.

Diagnosing your prior and current strategy work

Collecting strategies to learn from is a valuable part of improving, but it’s only the first step. You also have to determine what to take away from each strategy. For example, you have to determine whether Calm’s approach to resourcing Engineering-driven projects is something to copy or something to avoid.

What I’ve found effective is to apply the strategy rubric we developed in the “Is this strategy any good?” chapter to each of the strategies you’ve collected. Even by splitting a strategy into its various phases, you’ll learn a lot. Applying the rubric to each phase will teach you more. Each time you do this to another strategy, you’ll get a bit faster at applying the rubric, and you’ll start to see interesting, recurring patterns.

As you dig into a strategy that you’ve split into phases and applied the evaluation rubric to, here are a handful of questions that I’ve found interesting to ask myself:

  • How long did it take to determine a strategy’s initial phase could be improved? How high was the cost to fund that initial phase’s discovery?
  • Why did the strategy reach its final stage and get repealed or replaced? How long did that take to get there?
  • If you had to pick only one, did this strategy fail in its approach to exploration, diagnosis, policy or operations?
  • To what extent did the strategy outlive the tenure of its primary author? Did it get repealed quickly after their departure, did it endure, or was it perhaps replaced during their tenure?
  • Would you generally repeat this strategy, or would you strive to avoid repeating it? If you did repeat it, what conditions seem necessary to make it a success?
  • How might you apply this strategy to your current opportunities and challenges?

It’s not necessary to work through all of these questions for every strategy you’re learning from. I often try to pick the two that I think might be most interesting for a given strategy.

Policy for improving at strategy

At a high level, there are just a few key policies to consider for improving your strategic abilities. The first is implementing strategy, and the second is practicing implementing strategy. While those are indeed the starting points, there are a few more detailed options worth consideration:

  • If your company has existing strategies that are not working, debug one and work to fix it. If you lack the authority to work at the company scope, then decrease altitude until you find an altitude you can work at. Perhaps setting Engineering organizational strategies is beyond your circumstances, but strategy for your team is entirely accessible.

  • If your company has no documented strategies, document one to make it debuggable. Again, if operating at a high altitude isn’t attainable for some reason, operate at a lower altitude that is within reach.

  • If your company’s or team’s strategies are effective but have low adoption, see if you can iterate on operational mechanisms to increase adoption. Many such mechanisms require no authority at all, such as low-noise nudges or the model-document-share approach.

  • If existing strategies are effective and have high adoption, see if you can build excitement for a new strategy. Start by mining for which problems Staff-plus engineers and senior managers believe are important. Once you find one, you have a valuable strategy vein to start mining.

  • If you don’t feel comfortable sharing your work internally, then try writing proposals while only sharing them to a few trusted peers.

    You can even go further to only share proposals with trusted external peers, perhaps within a learning circle that you create or join.

Trying all of these at once would be overwhelming, so I recommend picking one in any given phase. If you aren’t able to gain traction, then try another approach until something works. It’s particularly important to recognize in your diagnosis where things are not working–perhaps you simply don’t have the sponsorship you need to enforce strategy so you need to switch towards suggesting strategies instead–and you’ll find something that works.

What if you’re not allowed to do strategy?

If you’re looking to find one, you’ll always unearth a reason why it’s not possible to do strategy in your current environment.

If you believe your current role prevents you from engaging in strategy work, I’ve found two useful approaches:

  1. Lower your altitude – there’s always a scale where you can perform strategy, even if it’s just your team or even just yourself.

    Only you can forbid yourself from developing personal strategies.

  2. Practice rather than perform – organizations can only absorb so much strategy development at a given time, so sometimes they won’t be open to you doing more strategy. In that case, you should focus on practicing strategy work rather than directly performing it.

    Only you can stop yourself from practice.

Don’t believe the hype: you can always do strategy work.

Operating your strategy improvement policies

As the refrain goes, even the best policies don’t accomplish much if they aren’t paired with operational mechanisms to ensure the policies actually happen, and debug why they aren’t happening. It’s tempting to overlook operations for personal habits, but that would be a mistake. These habits profoundly impact us in the long term, yet they’re easiest to neglect because others rarely inquire about them.

The mechanisms I’d recommend:

  • Clearly track the strategies you’ve implemented, refined, documented, or read. Maintain these in a document, spreadsheet, or folder that makes it easy to monitor your progress.

  • Review your tracked strategies every quarter: are you working on the expected number and in the expected way? If not, why not?

    Ideally, your review should be done in community with a peer or a learning circle. It’s too easy to deceive yourself, it’s much harder to trick someone else.

  • If your periodic review ever discovers that you’re simply not doing the work you expected, sit down for an hour with someone that you trust–ideally someone equally or more experienced than you–and debug what’s going wrong. Commit to doing this before your next periodic review.

Tracking your personal habits can feel a bit odd, but it’s something I highly recommend. I’ve been setting and tracking personal goals for some time now—for example, in my 2024 year in review—and have benefited greatly from it.

Too busy for strategy

Many companies convince themselves that they’re too much in a rush to make good decisions. I’ve certainly gotten stuck in this view at times myself, although at this point in my career I find it increasingly difficult to not recognize that I have a number of tools to create time for strategy, and an obligation to do strategy rather than inflict poor decisions on the organizations I work in. Here’s my advice for creating time:

  • If you’re not tracking how often you’re creating strategies, then start there.
  • If you’ve not worked on a single strategy in the past six months, then start with one.
  • If implementing a strategy has been prohibitively time consuming, then focus on practicing a strategy instead.

If you do try all those things and still aren’t making progress, then accept your reality: you don’t view doing strategy as particularly important. Spend some time thinking about why that is, and if you’re comfortable with your answer, then maybe this is a practice you should come back to later.

Final words

At this point, you’ve read everything I have to offer on drafting engineering strategy. I hope this has refined your view on what strategy can be in your organization, and has given you the tools to draft a more thoughtful future for your corner of the software engineering industry.

What I’d never ask is for you to wholly agree with my ideas here. They are my best thinking on this topic, but strategy is a topic where I’m certain Hegel’s world view is the correct one: even the best ideas here are wrong in interesting ways, and will be surpassed by better ones.

Wardley mapping the service orchestration ecosystem (2014).

2025-04-10 19:00:00

In Uber’s 2014 service migration strategy, we explore how to navigate the move from a Python monolith to a services-oriented architecture while also scaling with user traffic that doubled every six months.

This Wardley map explores how orchestration frameworks were evolving during that period to be used as an input into determining the most effective path forward for Uber’s Infrastructure Engineering team.

This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.

Reading this map

To quickly understand this Wardley Map, read from top to bottom. If you want to review how this map was written, then you should read section by section from the bottom up, starting with Users, then Value Chains, and so on.

More detail on this structure in Refining strategy with Wardley Mapping.

How things work today

There are three primary internal teams involved in service provisioning. The Service Provisioning Team abstracts applications developed by Product Engineering from servers managed by the Server Operations Team. As more servers are added to support application scaling, this is invisible to the applications themselves, freeing Product Engineers to focus on what the company values the most: developing more application functionality.

Wardley map for service orchestration

The challenges within the current value chain are cost-efficient scaling, reliable deployment, and fast deployment. All three of those problems anchor on the same underlying problem of resource scheduling. We want to make a significant investment into improving our resource scheduling, and believe that understanding the industry’s trend for resource scheduling underpins making an effective choice.

Transition to future state

Most interesting cluster orchestration problems are anchored in cluster metadata and resource scheduling. Request routing, whether through DNS entries or allocated ports, depends on cluster metadata. Mapping services to a fleet of servers depends on resource scheduling managing cluster metadata. Deployment and autoscaling both depend on cluster metadata.

Pipeline showing progression of service orchestration over time

This is also an area where we see significant changes occurring in 2014.

Uber initially solved this problem using Clusto, an open-source tool released by Digg with goals similar to Hashicorp’s Consul but with limited adoption. We also used Puppet for configuring servers, alongside custom scripting. This has worked, but has required custom, ongoing support for scheduling. The key question we’re confronted with is whether to build our own scheduling algorithms (e.g. bin packing) or adopt a different approach. It seems clear that the industry intends to directly solve this problem via two paths: relying on Cloud providers for orchestration (Amazon Web Services, Google Cloud Platform, etc) and through open-source scheduling frameworks such as Mesos and Kubernetes.

Industry peers with more than five years of infrastructure experience are almost unanimously adopting open-source scheduling frameworks to better support their physical infrastructure. This will give them a tool to perform a bridged migration from physical infrastructure to cloud infrastructure.

Newer companies with less existing infrastructure are moving directly to the cloud, and avoiding the orchestration problem entirely. The only companies not adopting one of these two approaches are extraordinarily large and complex (think Google or Microsoft) or allergic to making any technical change at all.

From this analysis, it’s clear that continuing our reliance on Clusto and Puppet is going to be an expensive investment that’s not particularly aligned with the industry’s evolution.

User & Value Chains

This map focuses on the orchestration ecosystem within a single company, with a focus on what did, and did not, stay the same from roughly 2008 to 2014. It focuses in particular on three users:

  1. Product Engineers are focused on provisioning new services, and then deploying new versions of that service as they make changes. They are wholly focused on their own service, and entirely unaware of anything beneath the orchestration layer (including any servers).
  2. Service Provisioning Team focuses on provisioning new services, orchestrating resources for those services, and routing traffic to those services. This team acts as the bridge between the Product Engineers and the Server Operations Team.
  3. Server Operations Team is focused on adding server capacity to be used for orchestration. They work closely with the Service Provisioning Team, and have no contact with the Product Engineers.

It’s worth acknowledging that, in practice, these are artificial aggregates of multiple underlying teams. For example, routing traffic between services and servers is typically handled by a Traffic or Service Networking team. However, these omissions are intended to clarify the distinctions relevant to the evolution of orchestration tooling.

Script for consistent linking within book.

2025-04-06 19:00:00

As part of my work on #eng-strategy-book, I’ve been editing a bunch of stuff. This morning I wanted to work on two editing problems. First, I wanted to ensure I was referencing strategies evenly across chapters (and not relying too heavily on any given strategy). Second, I wanted to make sure I was making references to other chapters in a consistent, standardized way,

Both of these are collecting Markdown links from files, grouping those links by either file or url, and then outputting the grouped content in a useful way. I decided to experiment with writing a one-shot prompt to write the script for me rather than writing it myself. The prompt and output (from ChatGPT 4.5) are available in this gist.

That worked correctly! The output was a bit ugly, so I tweaked the output slightly by hand, and also adjusted the regular expression to capture less preceding content, which resulted in this script. Although I did it by hand, I’m sure it would have been faster to just ask ChatGPT to fix the script itself, but either way these are very minor tweaks.

Now I can call the script in either standard of --grouped mode. Example of ./scripts/links.py "content/posts/strategy-book/*.md" output:

Output of script extracting links from chapters and representing them cleanly

Example of ./scripts/links.py "content/posts/strategy-book/*.md" --grouped output:

Second format of output from script extracting links, this time grouping by link instead of file

Altogether, this is a super simple script that I could have written in thirty minutes or so, but this allowed me to write it in less than ten minutes, and get back to actually editing with the remaining twenty.

It’s also quite helpful for solving the intended problem of imbalanced references to strategies. Here you can see I initially had 17 references to the Uber migration strategy, which was one of the first strategies I documented for the book.

17 references to the Uber service migration strategy

On the other hand, the strategy for Stripe’s Sorbet only had three links because it was one of the last two chapters I finished writing.

3 references to the Stripe Sorbet strategy

It’s natural that I referenced existing strategies more frequently than unwritten strategies over the course of drafting chapters, but it makes the book feel a bit lopsided when read, and this script has helped me address the imbalance. This is something I didn’t do in Staff Engineer, but wish I had, as I ended up leaning a bit too heavily on early stories and mentioned later stories less frequently.

Making images consistent for book.

2025-04-06 19:00:00

TODO: fix TODOs below

After working on diversifying strategies I linked as examples in #eng-strategy-book, the next problem I wanted to start working on was consistent visual appearances across all images included in the book. There are quite a few images, so I wanted to started by creating a tool to make a static HTML page of all included images, to facilitate reviewing all the images at once.

To write the script, I decided to write a short prompt describing the prompt, followed by paste in the script I’d previously for consistent linking, and seeing what I’d get.

This worked on the first try, after which I made a few tweaks to include more information. That culminates in images.py which allowed me to review all images in the book.

This screenshot gives a sense of the general problem.

Screenshot of various imagines in my new book that I need to make more visually consistent

Reviewing the full set of images, I identifed two categories of problems. First, I had one model image that was done via Figma instead of Excalidraw, and consequently looked very different.

Inconsistent screenshot example

Then the question was whether to standardize on that style or on the Excalidraw style.

Inconsistent screenshot example

There was only one sequence diagram in Figma style, so ultimately it was the easier choice to make the Figma one follow the Excalidraw style.

TODO: add image of updated image using Excalidraw style

The second problem was deciding how to represent Wardley maps consistently. My starting point was two very inconsistent varieties of Wardley maps, neither of which was ideal for including in a book.

The output from Mapkeep, which is quite good overall but not optimized for printing (too much empty whitespace).

Inconsistent screenshot example

Then I had Figma versions I’d made as well.

Inconsistent screenshot example

In the Figma versions that I’d made, I had tried to make better use of whitespace, and I think I succeeded. That said, they looked pretty bad altogether. In this case I was pretty unhappy with both options, so I decided to spend some time thinking about it.

For inspiration, I decided to review how maps were represented in two printed books. First in Simon Wardley’s book.

TODO: example from the wardley mapping book and

Then in TODO: remember name…

TODO: example from other mapping book

Reflecting on both of those.. TODO: finish

TODO: actually finish making them consistent, lol

TODO: conclusion about this somehow

Finally, this is another obvious script that I should have written for Staff Engineer. Then again, that is a significantly less image heavy book, so it probably wouldn’t have mattered too much.

How to resource Engineering-driven projects at Calm? (2020)

2025-04-03 20:00:00

One of the recurring challenges in any organization is how to split your attention across long-term and short-term problems. Your software might be struggling to scale with ramping user load while also knowing that you have a series of meaningful security vulnerabilities that need to be closed sooner than later. How do you balance across them?

These sorts of balance questions occur at every level of an organization. A particularly frequent format is the debate between Product and Engineering about how much time goes towards developing new functionality versus improving what’s already been implemented. In 2020, Calm was growing rapidly as we navigated the COVID-19 pandemic, and the team was struggling to make improvements, as they felt saturated by incoming new requests. This strategy for resourcing Engineering-driven projects was our attempt to solve that problem.

This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.

Reading this document

To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore.

More detail on this structure in Making a readable Engineering Strategy document.

Policy & Operation

Our policies for resourcing Engineering-driven projects are:

  • We will protect one Eng-driven project per product engineering team, per quarter. These projects should represent a maximum of 20% of the team’s bandwidth. Each project must advance a measurable metric, and execution must be designed to show progress on that metric within 4 weeks.
  • These projects must adhere to Calm’s existing Engineering strategies.
  • We resource these projects first in the team’s planning, rather than last. However, only concrete projects are resourced. If there are no concrete proposals, then the team won’t have time budgeted for Engineering-driven work.
  • Team’s engineering manager is responsible for deciding on the project, ensuring the project is valuable, and pushing back on attempts to defund the project.
  • Project selection does not require CTO approval, but you should escalate to the CTO if there’s friction or disagreement.
  • CTO will review Engineering-driven projects each quarter to summarize their impact and provide feedback to teams’ engineering managers on project selection and execution. They will also review teams that did not perform a project to understand why not.

As we’ve communicated this strategy, we’ve frequently gotten conceptual alignment that this sounds reasonable, coupled with uncertainty about what sort of projects should actually be selected. At some level, this ambiguity is an acknowledgment that we believe teams will identify the best opportunities bottoms-up. However, we also wanted to give two concrete examples of projects we’re greenlighting in the first batch:

  • Code-free media release: historically, we’ve needed to make a number of pull requests to add, organize, and release new pieces of media. This is high urgency work, but Engineering doesn’t exercise much judgment while doing it, and manual steps often create errors. We aim to track and eliminate these pull requests, while also increasing the number of releases that can be facilitated without scaling the content release team.

  • Machine-learning content placement: developing new pieces of media is often a multi-week or month process. After content is ready to release, there’s generally a debate on where to place the content. This matters for the company, as this drives engagement with our users, but it matters even more to the content creator, who is generally evaluated in terms of their content’s performance.

    This often leads to Product and Engineering getting caught up in debates about how to surface particular pieces of content. This project aims to improve user engagement by surfacing the best content for their interests, while also giving the Content team several explicit positions to highlight content without Product and Engineering involvement.

Although these projects are similar, it’s not intended that all Engineering-driven projects are of this variety. Instead it’s happenstance based on what the teams view as their biggest opportunities today.

Diagnosis

Our assessment of the current situation at Calm is:

  • We are spending a high percentage of our time on urgent but low engineering value tasks. Most significantly, about one-third of our time is going into launching, debugging, and changing content that we release into our product. Engineering is involved due to implementation limitations, not because our involvement adds inherent value (We mostly just make releases slowly and inadvertently introduce bugs of our own.)

  • We have a bunch of fairly clear ideas around improving the platform to empower the Content team to speed up releases, and to eliminate the Engineering involvement. However, we’ve struggled to find time to implement them, or to validate that these ideas will work.

  • If we don’t find a way to prioritize, and succeed at implementing, a project to reduce Engineering involvement in Content releases, we will struggle to support our goals to release more content and to develop more product functionality this year

  • Our Infrastructure team has been able to plan and make these kinds of investments stick. However, when we attempt these projects within our Product Engineering teams, things don’t go that well. We are good at getting them onto the initial roadmap, but then they get deprioritized due to pressure to complete other projects.

  • Our Engineering team of 20 engineers is not very fungible, largely due to specialization across roles like iOS, Android, Backend, Frontend, Infrastructure, and QA. We would like to staff these kinds of projects onto the Infrastructure team, but in practice that team does not have the product development experience to implement this kind of project.

  • We’ve discussed spinning up a Platform team, or moving product engineers onto Infrastructure, but that would either (1) break our goal to maintain joint pairs between Product Managers and Engineering Managers, or (2) be indistinguishable from prioritizing within the existing team because it would still have the same Product Manager and Engineering Manager pair.

  • Company planning is organic, occurring in many discussions and limited structured process. If we make a decision to invest in one project, it’s easy for that project to get deprioritized in a side discussion missing context on why the project is important.

    These reprioritization discussions happen both in executive forums and in team-specific forums. There’s imperfect awareness across these two sorts of forums.

Explore

Prioritization is a deep topic with a wide variety of popular solutions. For example, many software companies rely on “RICE” scoring, calculating priority as (Reach times Impact times Confidence) divided by Effort. At the other extreme are complex methodologies like Scaled Agile Framework.

In addition to generalized planning solutions, many companies carve out special mechanisms to solve for particular prioritization gaps. Google historically offered 20% time to allow individuals to work on experimental projects that didn’t align directly with top-down priorities. Stripe’s Foundation Engineering organization developed the concept of Foundational Initiatives to prioritize cross-pillar projects with long-term implications, which otherwise struggled to get prioritized within the team-led planning process.

All these methods have clear examples of succeeding, and equally clear examples of struggling. Where these initiatives have succeeded, they had an engaged executive sponsoring the practice’s rollout, including triaging escalations when the rollout inconvenienced supporters of the prior method. Where they lacked a sponsor, or were misaligned with the company’s culture, these methods have consistently failed despite the fact that they’ve previously succeeded elsewhere.

Systems model of API deprecation

2025-04-01 20:00:00

In How should Stripe deprecate APIs?, the diagnosis depends on the claim that deprecating APIs is a significant cause of customer churn. While there is internal data that can be used to correlate deprecation with churn, it’s also valuable to build a model to help us decide if we believe that correlation and causation are aligned in this case.

In this chapter, we’ll cover:

  1. What we learn from modeling API deprecation’s impact on user retention
  2. Developing a system model using the lethain/systems package on GitHub. That model is available in the lethain/eng-strategy-models repository
  3. Exercising that model to learn from it

Time to investigate whether it’s reasonable to believe that API deprecation is a major influence on user retention and churn.

This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.

Learnings

In an initial model that has 10% baseline for customer churn per round, reducing customers experiencing API deprecation from 50% to 10% per round only increases the steady state of integrated customers by about 5%.

Impact of 10% and 50% API deprecation on integrated customers

However, if we eliminate the baseline for customer churn entirely, then we see a massive difference between a 10% and 50% rate of API deprecation.

Impact of rates of API deprecation with zero baseline churn

The biggest takeaway from this model is that eliminating API-deprecation churn alone won’t significantly increase the number of integrated customers. However, we also can’t fully benefit from reducing baseline churn without simultaneously reducing API deprecations. Meaningfully increasing the number of integrated customers requires lowering both sorts of churn in tandem.

Sketch

We’ll start by sketching the model’s happiest path: potential customers flowing into engaged customers and then becoming integrated customers. This represents a customer who decides to integrate with Stripe’s APIs, and successfully completes that integration process.

Happiest path for Stripe API integration

Business would be good if that were the entire problem space. Unfortunately, customers do occasionally churn. This churn is represented in two ways:

  1. baseline churn where integrated customers leave Stripe for any number of reasons, including things like dissolution of their company
  2. experience deprecation followed by deprecation-influenced churn, which represent the scenario where a customer decides to leave after an API they use is deprecated

There is also a flow for reintegration, where a customer impacted by API deprecation can choose to update their integration to comply with the API changes.

Pulling things together, the final sketch shows five stocks and six flows.

Final version of systems model for API deprecation

You could imagine modeling additional dynamics, such as recovery of churned customers, but it seems unlikely that would significantly influence our understanding of how API deprecation impacts churn.

Reason

In terms of acquiring customers, the most important flows are customer acquisition and initial integration with the API. Optimizing those flows will increase the number of existing integrations.

The flows driving churn are baseline churn, and the combination of API deprecation and deprecation-influenced churn. It’s difficult to move baseline churn for a payments API, as many churning customers leave due to company dissolution. From a revenue-weighted perspective, baseline churn is largely driven by non-technical factors, primarily pricing. In either case, it’s challenging to impact this flow without significantly lowering margin.

Engineering decisions, on the other hand, have a significant impact on both the number of API deprecations, and on the ease of reintegration after a migration. Because the same work to support reintegration also supports the initial integration experience, that’s a promising opportunity for investment.

Model

You can find the full implementation of this model on GitHub if you want to see the full model rather than these emphasized snippets.

Now that we have identified the most interesting avenues for experimentation, it’s time to develop the model to evaluate which flows are most impactful.

Our initial model specification is:

# User Acquisition Flow
[PotentialCustomers] > EngagedCustomers @ 100
# Initial Integration Flow
EngagedCustomers > IntegratedCustomers @ Leak(0.5)
# Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.1)
# Experience Deprecation Flow
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.5)
# Reintegrated Flow
DeprecationImpactedCustomers > IntegratedCustomers @ Leak(0.9)
# Deprecation-Influenced Churn
DeprecationImpactedCustomers > ChurnedCustomers @ Leak(0.1)

Whether these are reasonable values depends largely on how we think about the length of each round. If a round was a month, then assuming half of integrated customers would experience an API deprecation would be quite extreme. If we assumed it was a year, then it would still be high, but there are certainly some API providers that routinely deprecate at that rate. (From my personal experience, I can say with confidence that Facebook’s Ads API deprecated at least one important field on a quarterly basis in the 2012-2014 period.)

Admittedly, for a payments API this would be a high rate, and is intended primarily as a contrast with more reasonable values in the exercise section below.

Exercise

Our goal with exercising this model is to understand how much API deprecation impacts customer churn. We’ll start by charting the initial baseline, then move to compare it with a variety of scenarios until we build an intuition for how the lines move.

Initial model stabilizing integrated customers around 1,000 customers

The initial chart stabilizes in about forty rounds, maintaining about 1,000 integrated customers and 400 customers dealing with deprecated APIs. Now let’s change the experience deprecation flow to impact significantly fewer customers:

# Initial setting with 50% experiencing deprecation per round
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.5)
# Less deprecation, only 10% experiencing per round
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.1)

After those changes, we can compare the two scenarios.

Impact of 10% and 50% API deprecation on integrated customers

Lowering the deprecation rate significantly reduces the number of companies dealing with deprecations at any given time, but it has a relatively small impact on increasing the steady state for integrated customers. This must mean that another flow is significantly impacting the size of the integrated customers stock.

Since there’s only one other flow impacting that stock, baseline churn, that’s the one to exercise next. Let’s set the baseline churn flow to zero to compare that with the initial model:

# Initial Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.1)
# Zeroed out Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.0)

These results make a compelling case that baseline churn is dominating the impact of deprecation. With no baseline churn, the number of integrated customers stabilizes at around 1,750, as opposed to around 1,000 for the initial model.

Impact of eliminating baseline churn from model

Next, let’s compare two scenarios without baseline churn, where one has high API deprecation (50%) and the other has low API deprecation (10%).

Impact of rates of API deprecation with zero baseline churn

In the case of two scenarios without baseline churn, we can see having an API deprecation rate of 10% leads to about 6,000 integrated customers, as opposed to 1,750 for a 50% rate of API deprecation. More importantly, in the 10% scenario, the integrated customers line shows no sign of flattening, and continues to grow over time rather than stabilizing.

The takeaway here is that significantly reducing either baseline churn or API deprecation magnifies the benefits of reducing the other. These results also reinforce the value of treating churn reduction as a system-level optimization, not merely a collection of discrete improvements.