2025-04-24 21:00:00
While Stripe is a widely admired company for things like its creation of the Sorbet typer project, I personally think that Stripe’s most interesting strategy work is also among its most subtle: its willingness to significantly prioritize API stability.
This strategy is almost invisible externally. Internally, discussions around it were frequent and detailed, but mostly confined to dedicated API design conversations. API stability isn’t just a technical design quirk, it’s a foundational decision in an API-driven business, and I believe it is one of the unsung heroes of Stripe’s business success.
This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.
To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore.
More detail on this structure in Making a readable Engineering Strategy document.
Our policies for managing API changes are:
Design for long API lifetime. APIs are not inherently durable. Instead we have to design thoughtfully to ensure they can support change. When designing a new API, build a test application that doesn’t use this API, then migrate to the new API. Consider how integrations might evolve as applications change. Perform these migrations yourself to understand potential friction with your API. Then think about the future changes that we might want to implement on our end. How would those changes impact the API, and how would they impact the application you’ve developed.
At this point, take your API to API Review for initial approval as described below. Following that approval, identify a handful of early adopter companies who can place additional pressure on your API design, and test with them before releasing the final, stable API.
All new and modified APIs must be approved by API Review.
API changes may not be enabled for customers prior to API Review approval.
Change requests should be sent to api-review
email group.
For examples of prior art, review the api-review
archive for prior requests
and the feedback they received.
All requests must include a written proposal. Most requests will be approved asynchronously by a member of API Review. Complex or controversial proposals will require live discussions to ensure API Review members have sufficient context before making a decision.
We never deprecate APIs without an unavoidable requirement to do so. Even if it’s technically expensive to maintain support, we incur that support cost. To be explicit, we define API deprecation as any change that would require customers to modify an existing integration.
If such a change were to be approved as an exception to this policy, it must first be approved by the API Review, followed by our CEO. One example where we granted an exception was the deprecation of TLS 1.2 support due to PCI compliance obligations.
When significant new functionality is required, we add a new API.
For example, we created /v1/subscriptions
to
support those workflows
rather than extending /v1/charges
to add subscriptions support.
With the benefit of hindsight, a good example of this policy in action was the introduction of the Payment Intents APIs to maintain
compliance with Europe’s Strong Customer Authentication
requirements. Even in that case the charge
API continued to work as it did previously,
albeit only for non-European Union payments.
We manage this policy’s implied technical debt via an API translation layer. We release changed APIs into versions, tracked in our API version changelog. However, we only maintain one implementation internally, which is the implementation of the latest version of the API. On top of that implementation, a series of version transformations are maintained, which allow us to support prior versions without maintaining them directly. While this approach doesn’t eliminate the overhead of supporting multiple API versions, it significantly reduces complexity by enabling us to maintain just a single, modern implementation internally.
All API modifications must also update the version transformation layers to allow the new version to coexist peacefully with prior versions.
In the future, SDKs may allow us to soften this policy. While a significant number of our customers have direct integrations with our APIs, that number has dropped significantly over time. Instead, most new integrations are performed via one of our official API SDKs.
We believe that in the future, it may be possible for us to make more backwards incompatible changes because we can absorb the complexity of migrations into the SDKs we provide. That is certainly not the case yet today.
Our diagnosis of the impact on API changes and deprecation on our business is:
If you are a small startup composed of mostly engineers, integrating a new payments API seems easy. However, for a small business without dedicated engineers—or a larger enterprise involving numerous stakeholders—handling external API changes can be particularly challenging.
Even if this is only marginally true, we’ve modeled the impact of minimizing API changes on long-term revenue growth, and it has a significant impact, unlocking our ability to benefit from other churn reduction work.
While we believe API instability directly creates churn, we also believe that API stability directly retains customers by increasing the migration overhead even if they wanted to change providers. Without an API change forcing them to change their integration, we believe that hypergrowth customers are particularly unlikely to change payments API providers absent a concrete motivation like an API change or a payment plan change.
We are aware of relatively few companies that provide long-term API stability in general, and particularly few for complex, dynamic areas like payments APIs. We can’t assume that companies that make API changes are ill-informed. Rather it appears that they experience a meaningful technical debt tradeoff between the API provider and API consumers, and aren’t willing to consistently absorb that technical debt internally.
Future compliance or security requirements—along the lines of our upgrade from TLS 1.2 to TLS 1.3 for PCI—may necessitate API changes. There may also be new tradeoffs exposed as we enter new markets with their own compliance regimes. However, we have limited ability to predict these changes at this point.
2025-04-24 20:00:00
In How should Stripe deprecate APIs?, the diagnosis depends on the claim that deprecating APIs is a significant cause of customer churn. While there is internal data that can be used to correlate deprecation with churn, it’s also valuable to build a model to help us decide if we believe that correlation and causation are aligned in this case.
In this chapter, we’ll cover:
Time to investigate whether it’s reasonable to believe that API deprecation is a major influence on user retention and churn.
This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.
In an initial model that has 10% baseline for customer churn per round, reducing customers experiencing API deprecation from 50% to 10% per round only increases the steady state of integrated customers by about 5%.
However, if we eliminate the baseline for customer churn entirely, then we see a massive difference between a 10% and 50% rate of API deprecation.
The biggest takeaway from this model is that eliminating API-deprecation churn alone won’t significantly increase the number of integrated customers. However, we also can’t fully benefit from reducing baseline churn without simultaneously reducing API deprecations. Meaningfully increasing the number of integrated customers requires lowering both sorts of churn in tandem.
We’ll start by sketching the model’s happiest path: potential customers flowing into engaged customers and then becoming integrated customers. This represents a customer who decides to integrate with Stripe’s APIs, and successfully completes that integration process.
Business would be good if that were the entire problem space. Unfortunately, customers do occasionally churn. This churn is represented in two ways:
baseline churn
where integrated customers leave Stripe for any number of reasons,
including things like dissolution of their companyexperience deprecation
followed by deprecation-influenced churn
, which represent
the scenario where a customer decides to leave after an API they use is deprecatedThere is also a flow for reintegration
, where a customer impacted by API deprecation
can choose to update their integration to comply with the API changes.
Pulling things together, the final sketch shows five stocks and six flows.
You could imagine modeling additional dynamics, such as recovery of churned customers, but it seems unlikely that would significantly influence our understanding of how API deprecation impacts churn.
In terms of acquiring customers, the most important flows are customer acquisition and initial integration with the API. Optimizing those flows will increase the number of existing integrations.
The flows driving churn are baseline churn, and the combination of API deprecation and deprecation-influenced churn. It’s difficult to move baseline churn for a payments API, as many churning customers leave due to company dissolution. From a revenue-weighted perspective, baseline churn is largely driven by non-technical factors, primarily pricing. In either case, it’s challenging to impact this flow without significantly lowering margin.
Engineering decisions, on the other hand, have a significant impact on both the number of API deprecations, and on the ease of reintegration after a migration. Because the same work to support reintegration also supports the initial integration experience, that’s a promising opportunity for investment.
You can find the full implementation of this model on GitHub if you want to see the full model rather than these emphasized snippets.
Now that we have identified the most interesting avenues for experimentation, it’s time to develop the model to evaluate which flows are most impactful.
Our initial model specification is:
# User Acquisition Flow
[PotentialCustomers] > EngagedCustomers @ 100
# Initial Integration Flow
EngagedCustomers > IntegratedCustomers @ Leak(0.5)
# Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.1)
# Experience Deprecation Flow
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.5)
# Reintegrated Flow
DeprecationImpactedCustomers > IntegratedCustomers @ Leak(0.9)
# Deprecation-Influenced Churn
DeprecationImpactedCustomers > ChurnedCustomers @ Leak(0.1)
Whether these are reasonable values depends largely on how we think about the length of each round. If a round was a month, then assuming half of integrated customers would experience an API deprecation would be quite extreme. If we assumed it was a year, then it would still be high, but there are certainly some API providers that routinely deprecate at that rate. (From my personal experience, I can say with confidence that Facebook’s Ads API deprecated at least one important field on a quarterly basis in the 2012-2014 period.)
Admittedly, for a payments API this would be a high rate, and is intended primarily as a contrast with more reasonable values in the exercise section below.
Our goal with exercising this model is to understand how much API deprecation impacts customer churn. We’ll start by charting the initial baseline, then move to compare it with a variety of scenarios until we build an intuition for how the lines move.
The initial chart stabilizes in about forty rounds, maintaining about 1,000 integrated customers and 400 customers dealing with deprecated APIs. Now let’s change the experience deprecation flow to impact significantly fewer customers:
# Initial setting with 50% experiencing deprecation per round
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.5)
# Less deprecation, only 10% experiencing per round
IntegratedCustomers > DeprecationImpactedCustomers @ Leak(0.1)
After those changes, we can compare the two scenarios.
Lowering the deprecation rate significantly reduces the number of companies dealing with deprecations at any given time, but it has a relatively small impact on increasing the steady state for integrated customers. This must mean that another flow is significantly impacting the size of the integrated customers stock.
Since there’s only one other flow impacting that stock, baseline churn, that’s the one to exercise next. Let’s set the baseline churn flow to zero to compare that with the initial model:
# Initial Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.1)
# Zeroed out Baseline Churn Flow
IntegratedCustomers > ChurnedCustomers @ Leak(0.0)
These results make a compelling case that baseline churn is dominating the impact of deprecation. With no baseline churn, the number of integrated customers stabilizes at around 1,750, as opposed to around 1,000 for the initial model.
Next, let’s compare two scenarios without baseline churn, where one has high API deprecation (50%) and the other has low API deprecation (10%).
In the case of two scenarios without baseline churn, we can see having an API deprecation rate of 10% leads to about 6,000 integrated customers, as opposed to 1,750 for a 50% rate of API deprecation. More importantly, in the 10% scenario, the integrated customers line shows no sign of flattening, and continues to grow over time rather than stabilizing.
The takeaway here is that significantly reducing either baseline churn or API deprecation magnifies the benefits of reducing the other. These results also reinforce the value of treating churn reduction as a system-level optimization, not merely a collection of discrete improvements.
2025-04-22 19:00:00
At work, we’ve been building agentic workflows to support our internal Delivery team on various accounting, cash reconciliation, and operational tasks. To better guide that project, I wrote my own simple workflow tool as a learning project in January. Since then, the Model Context Protocol (MCP) has become a prominent solution for writing tools for agents, and I decided to spend some time writing an MCP server over the weekend to build a better intuition.
The output of that project is library-mcp,
a simple MCP that you can use locally with tools like Claude Desktop
to explore Markdown knowledge bases.
I’m increasingly enamored with the idea of “datapacks” that I load into context windows with relevant
work, and I am currently working to release my upcoming book in a “datapack” format that’s optimized for usage with LLMs.
library-mcp
allows any author to dynamically build datapacks relevant to their current question,
as long as they have access to their content in Markdown files.
A few screenshots tell the story. First, here’s a list of the tools provided by this server. These tools give a variety of ways to search through content and pull that content into your context window.
Each time you access a tool for the first time in a chat, Claude Desktop prompts you to verify you want that tool to operate. This is a nice feature, and I think it’s particularly important that approval is done at the application layer, not at the agent layer. If agents approve their own usage, well, security is going to be quite a mess.
Here’s an example of retrieving all the tags to figure out what I’ve written about. You could do a follow-up like, “Get me posts I’ve written about ‘python’” after seeing the tags. The interesting thing here is you can combine retrieval and intelligence. For example, you could ask “Get me all the tags I’ve written, and find those that seem related to software architecture” and it does a good job of filtering.
Finally, here’s an example of actually using a datapack to answer a question. In this case, it’s evaluating how my writing has changed between 2018 and 2025.
More practically, I’ve already experimented with friends writing their CTO onboarding plans with Your first 90 days as CTO as a datapack in the context window, and you can imagine the right datapacks allowing you to go much further. Writing a company policy with all the existing policies in a datapack, along with a document about how to write policies effectively, for example, would improve consistency and be likely to identify conflicting policies.
Altogether, I am currently enamored with the vision of useful datapacks facilitating
creation, and hope that library-mcp
is a useful tool for folks as we experiment
our way towards this idea.
2025-04-20 19:00:00
Ahead of announcing the title and publisher of
my thus-far-untitled book on engineering strategy
in the next week or two, I put together a website for its content.
That site is pretty much the same format as this blog,
but with some improvements like better mobile rendering on /
than this blog has historically had.
After finishing that work, I ported the improvements back to lethain.com
, but also
decided to bring them to staffeng.com.
That was slightly trickier because, unlike this blog, StaffEng was historically a Gatsby app.
(Why a Gatsby app? Because Calm was using Gatsby for our web frontend and I wanted to get some experience with it.)
Over the weekend, I took some time to migrate to Hugo and apply the same enhancements.
which you can now see in the lethain:staff-eng
repository or on staffeng.com.
Here’s a screenshot of the old version.
Then here’s a screenshot of the updated version.
Overall, I think it’s slightly easier to read, and I took it as a chance to update the various links. For example, I removed the newsletter link and pointed that to this blog’s newsletter instead, given that one’s mailing list went quiet a long time ago.
Speaking of going quiet, I also brought these updates to infraeng.dev, which is the very-stuck-in-time site for the book I may-or-may-not one day write about infrastructure engineering. That means that I now have four essentially equivalent Hugo sites running different content websites: this blog, staffeng.com, infraeng.dev, and the site for the upcoming book. All of these build and deploy automatically onto GitHub Pages, which has been an extremely easy, reliable workflow for me.
While I was working on this, someone asked me why I don’t just write my own blog server to host my blogs. The answer here is pretty straightforward. I’ve written three blog servers for my blog over the years. The first two were in Python, and the last one was in Go. They all worked well enough, but maintaining them was eventually a pain point because they required a real build pipeline and deal with libraries that could have security issues. Even in the best case, the containers they run in would get end-of-lifed periodically as Ubuntu versions got deprecated.
What I’ve slowly learned from that is that, as a frequent writer, you really want your content to live somewhere that can work properly for decades. Even small maintenance costs can be prohibitive over time, and I’ve seen some good blogs disappear rather than e.g. figure out a WordPress upgrade. Individually, these are all minor, but over decades they can really add up. This is also my argument against using hosted providers: I’m sure Substack will be around in five years, but I have no idea if Substack will be around in twenty years, but I know that I’ll still be writing then, and will also want my previous writing to still be accessible.
2025-04-17 21:00:00
Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact.
Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe’s Product Infrastructure had similar goals to other companies’ Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes.
This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring.
Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe’s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can’t take credit for this strategy’s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe’s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results.
This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.
To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore.
More detail on this structure in Making a readable Engineering Strategy document.
The Product Infrastructure team is investing in Stripe’s developer experience by:
Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas.
We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review.
Our three highest priority areas for this half are:
Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team’s time to get our type checker to an effective place.
Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase.
It’s possible for Product Infrastructure to take on both, despite its fixed size. We’ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code’s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable.
Based on our knowledge of Facebook’s Hack project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It’s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios.
Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value.
In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the Command Query Responsibility Segregation (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase.
Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code.
Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified.
Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues.
For escalations and issues, reach out in the #product-infra channel.
In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity.
Our current diagnosis of our developer productivity is:
We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.)
We have two primary mechanisms for understanding our engineer’s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey.
Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners.
Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request’s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete.
By the time their build-test cycle completes, they’ve lost their focus and maybe take several hours to return to addressing the results.
There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes.
Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase.
Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity.
Our data infrastructure is implemented in Scala. While these developers have concerns–primarily slow build times–they manage their build and deployment infrastructure independently, and the group remains relatively small.
Ruby is not a highly performant programming language, but we’ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language.
Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers.
Developers working in Golang and Scala routinely ask for more centralized support, but it’s challenging to prioritize those requests as we’re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala.
If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.
2025-04-10 20:00:00
One of the most memorable quotes in Arthur Miller’s The Death of a Salesman comes from Uncle Ben, who describes his path to becoming wealthy as, “When I was seventeen, I walked into the jungle, and when I was twenty-one I walked out. And by God I was rich.” I wish I could describe the path to learning engineering strategy in similar terms, but by all accounts it’s a much slower path. Two decades in, I am still learning more from each project I work on. This book has aimed to accelerate your learning path, but my experience is that there’s still a great deal left to learn, despite what this book has hoped to accomplish.
This final chapter is focused on the remaining advice I have to give on how you can continue to improve at strategy long after reading this book’s final page. Inescapably, this chapter has become advice on writing your own strategy for improving at strategy. You are already familiar with my general suggestions on creating strategy, so this chapter provides focused advice on creating your own plan to get better at strategy.
It covers:
With that preamble, let’s write this book’s final strategy: your personal strategy for developing your strategy practice.
This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts.
Ideally, we’d begin improving our engineering strategy skills by broadly reading publicly available examples. Unfortunately, there simply aren’t many easily available works to learn from others’ experience. Nonetheless, resources do exist, and we’ll discuss the three categories that I’ve found most useful:
Each of these is explored in its own section below.
While there aren’t as many public engineering strategy resources as I’d like, I’ve found that there are still a reasonable number available. This book collects a number of such resources in the appendix of engineering strategy resources. That appendix also includes some individuals’ blog posts that are adjacent to this topic. You can go a long way by searching and prompting your way into these resources.
As you read them, it’s important to recognize that public strategies are often misleading, as discussed previously in evaluating strategies. Everyone writing in public has an agenda, and that agenda often means that they’ll omit important details to make themselves, or their company, come off well. Make sure you read through the lines rather than taking things too literally.
Ironically, where public resources are hard to find, I’ve found it much easier to find privately held strategy resources. While private recollections are still prone to inaccuracies, the incentives to massage the truth are less pronounced.
The most useful sources I’ve found are:
peers’ stories – strategies are often oral histories, and they are shared freely among peers within and across companies. As you build out your professional network, you can usually get access to any company’s engineering strategy on any topic by just asking.
There are brief exceptions. Even a close peer won’t share a sensitive strategy before its existence becomes obvious externally, but they’ll be glad to after it does. People tend to overestimate how much information companies can keep private anyway. Even reading recent job postings can usually expose a surprising amount about a company.
internal strategy archaeologists – while surprisingly few companies formally collect their strategies into a repository, the stories are informally collected by the tenured members of the organization. These folks are the company’s strategy archaeologists, and you can learn a great deal by explicitly consulting them
becoming a strategy archaeologist yourself – whether or not you’re a tenured member of your company, you can learn a tremendous amount by starting to build your own strategy repository. As you start collecting them, you’ll interest others in contributing their strategies as well.
As discussed in Staff Engineer’s section on the Write five then synthesize approach to strategy, over time you can foster a culture of documentation where one didn’t exist before. Even better, building that culture doesn’t require any explicit authority, just an ongoing show of excitement.
There are other sources as well, ranging from attending the hallway track in conferences to organizing dinners where stories are shared with a commitment to privacy.
My final suggestion for seeing how others work on strategy is to form a learning circle. I formed a learning circle when I first moved into an executive role, and at this point have been running it for more than five years. What’s surprised me the most is how much I’ve learned from it.
There are a few reasons why ongoing learning circles are exceptional for sharing strategy:
Although putting one of these communities together requires a commitment, they are the best mechanism I’ve found. As a final secret, many people get stuck on how they can get invited to an existing learning circle, but that’s almost always the wrong question to be asking. If you want to join a learning circle, make one. That’s how I got invited to mine.
Collecting strategies to learn from is a valuable part of improving, but it’s only the first step. You also have to determine what to take away from each strategy. For example, you have to determine whether Calm’s approach to resourcing Engineering-driven projects is something to copy or something to avoid.
What I’ve found effective is to apply the strategy rubric we developed in the “Is this strategy any good?” chapter to each of the strategies you’ve collected. Even by splitting a strategy into its various phases, you’ll learn a lot. Applying the rubric to each phase will teach you more. Each time you do this to another strategy, you’ll get a bit faster at applying the rubric, and you’ll start to see interesting, recurring patterns.
As you dig into a strategy that you’ve split into phases and applied the evaluation rubric to, here are a handful of questions that I’ve found interesting to ask myself:
It’s not necessary to work through all of these questions for every strategy you’re learning from. I often try to pick the two that I think might be most interesting for a given strategy.
At a high level, there are just a few key policies to consider for improving your strategic abilities. The first is implementing strategy, and the second is practicing implementing strategy. While those are indeed the starting points, there are a few more detailed options worth consideration:
If your company has existing strategies that are not working, debug one and work to fix it. If you lack the authority to work at the company scope, then decrease altitude until you find an altitude you can work at. Perhaps setting Engineering organizational strategies is beyond your circumstances, but strategy for your team is entirely accessible.
If your company has no documented strategies, document one to make it debuggable. Again, if operating at a high altitude isn’t attainable for some reason, operate at a lower altitude that is within reach.
If your company’s or team’s strategies are effective but have low adoption, see if you can iterate on operational mechanisms to increase adoption. Many such mechanisms require no authority at all, such as low-noise nudges or the model-document-share approach.
If existing strategies are effective and have high adoption, see if you can build excitement for a new strategy. Start by mining for which problems Staff-plus engineers and senior managers believe are important. Once you find one, you have a valuable strategy vein to start mining.
If you don’t feel comfortable sharing your work internally, then try writing proposals while only sharing them to a few trusted peers.
You can even go further to only share proposals with trusted external peers, perhaps within a learning circle that you create or join.
Trying all of these at once would be overwhelming, so I recommend picking one in any given phase. If you aren’t able to gain traction, then try another approach until something works. It’s particularly important to recognize in your diagnosis where things are not working–perhaps you simply don’t have the sponsorship you need to enforce strategy so you need to switch towards suggesting strategies instead–and you’ll find something that works.
If you’re looking to find one, you’ll always unearth a reason why it’s not possible to do strategy in your current environment.
If you believe your current role prevents you from engaging in strategy work, I’ve found two useful approaches:
Lower your altitude – there’s always a scale where you can perform strategy, even if it’s just your team or even just yourself.
Only you can forbid yourself from developing personal strategies.
Practice rather than perform – organizations can only absorb so much strategy development at a given time, so sometimes they won’t be open to you doing more strategy. In that case, you should focus on practicing strategy work rather than directly performing it.
Only you can stop yourself from practice.
Don’t believe the hype: you can always do strategy work.
As the refrain goes, even the best policies don’t accomplish much if they aren’t paired with operational mechanisms to ensure the policies actually happen, and debug why they aren’t happening. It’s tempting to overlook operations for personal habits, but that would be a mistake. These habits profoundly impact us in the long term, yet they’re easiest to neglect because others rarely inquire about them.
The mechanisms I’d recommend:
Clearly track the strategies you’ve implemented, refined, documented, or read. Maintain these in a document, spreadsheet, or folder that makes it easy to monitor your progress.
Review your tracked strategies every quarter: are you working on the expected number and in the expected way? If not, why not?
Ideally, your review should be done in community with a peer or a learning circle. It’s too easy to deceive yourself, it’s much harder to trick someone else.
If your periodic review ever discovers that you’re simply not doing the work you expected, sit down for an hour with someone that you trust–ideally someone equally or more experienced than you–and debug what’s going wrong. Commit to doing this before your next periodic review.
Tracking your personal habits can feel a bit odd, but it’s something I highly recommend. I’ve been setting and tracking personal goals for some time now—for example, in my 2024 year in review—and have benefited greatly from it.
Many companies convince themselves that they’re too much in a rush to make good decisions. I’ve certainly gotten stuck in this view at times myself, although at this point in my career I find it increasingly difficult to not recognize that I have a number of tools to create time for strategy, and an obligation to do strategy rather than inflict poor decisions on the organizations I work in. Here’s my advice for creating time:
If you do try all those things and still aren’t making progress, then accept your reality: you don’t view doing strategy as particularly important. Spend some time thinking about why that is, and if you’re comfortable with your answer, then maybe this is a practice you should come back to later.
At this point, you’ve read everything I have to offer on drafting engineering strategy. I hope this has refined your view on what strategy can be in your organization, and has given you the tools to draft a more thoughtful future for your corner of the software engineering industry.
What I’d never ask is for you to wholly agree with my ideas here. They are my best thinking on this topic, but strategy is a topic where I’m certain Hegel’s world view is the correct one: even the best ideas here are wrong in interesting ways, and will be surpassed by better ones.