MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Figma Design to Code, Code to Design: Clearly Explained

2026-04-14 23:31:25

Are your AI investments paying off? (Sponsored)

AI budgets are under the microscope and most engineering teams only cite time savings from code generation when asked if it’s working.

The real impact is in production, where teams spend 70% of engineering time investigating incidents, jumping between tools, and losing time that could go toward shipping product.

That operational load only grows with every line of AI-generated code that hits prod.

Learn how engineering teams at Coinbase, Zscaler, and Salesforce are seeing AI impact across the full engineering lifecycle. Plus, get a practical worksheet for modeling AI ROI with your own operational data.

Get the free playbook →


Turning a design into working code is one of the most common tasks in frontend development, and one of the hardest to automate. The design lives in Figma. The code lives in a repository. Bridging the two has traditionally required a developer to manually interpret layouts, colors, spacing, and component structure from a visual reference. AI coding agents promise to close that gap, but the naive approaches fall short in important ways.

Figma launched its MCP server in June 2025 to bring design context into code. This year, they released two new workflows: the ability to generate designs from coding tools like Claude Code and Codex, and the ability for agents to write directly to Figma design.

We spoke with Emil Sjölander, Aditya Muttur, and Shannon Toliver from the Figma team behind these releases to understand the details and engineering challenges. This article covers how Figma’s design-to-code and code-to-design workflows actually work, starting with why the obvious approaches fail, how MCP solves them, and the engineering challenges that remain.

The Gap Between Design and Code

Before diving into how Figma’s MCP server works, it helps to understand the approaches that came before it, and why each one hits a wall. There are two natural ways to give an LLM access to a design: show it a picture, or hand it the raw data. Both have fundamental limitations that motivated a different approach.

Approach 1: Screenshot the design

The most obvious way to turn a design into code with an LLM is to take a screenshot of your Figma file and paste it into a coding agent. The LLM sees the image, interprets the layout, and generates code.

This works for simple UIs. But it breaks down for anything complex. The LLM has to guess values based on pixels. It doesn’t know the exact color or that the spacing between cards is 24px, not 20px. The output may look close, but not identical.

Figure 1: The LLM guesses pixel values from a screenshot.

So screenshots give the LLM a visual reference but no precise values. The next natural step is to go in the opposite direction: give it all the data.

Approach 2: Get Design JSON via Figma’s API

Figma exposes a REST API that returns a file’s entire structure as JSON. Every node, property, and style is included. Now the LLM has real data instead of pixels.

Figure 2: The REST API returns the full file structure as JSON

But having all the data introduces its own problem: there is far too much of it. A single Figma page can produce thousands of lines of JSON, filled with pixel coordinates, visual effects, internal layout rules, and other metadata that are not useful for code generation. Dumping all of this into a prompt can exceed the context window. Even when it fits, the LLM has to wade through pixel coordinates, blend modes, export settings, and other visual metadata that have nothing to do with building a UI, which degrades the output quality.

Figure 3: Raw JSON exceeds the context window and degrades output quality

Neither approach works on its own. Screenshots lack precision. Raw API data has precision but drowns the LLM in noise. What you actually need is something in between: structured design data that preserves exact values like colors, spacing, and component names, but strips out the noise that is not needed for code generation.

The middle ground: Figma’s MCP server

That is what Figma’s MCP server does. MCP stands for Model Context Protocol, a standard that defines how AI agents discover and call external tools. Figma’s MCP server takes the raw design data from Figma’s REST API, filters out the noise, and transforms what remains into a clean, structured representation. Pixel positions become layout relationships like “centered inside its parent.” Raw hex colors become design token references. Deeply nested layers get flattened to match what a developer would actually build. The result is a compact, token-efficient context that an LLM can act on directly.

With that context, let’s look at how the two main workflows, design to code and code to design, actually work under the hood.

Design to Code

The design-to-code workflow starts when a developer selects a frame in Figma, copies its URL, and pastes it into a coding agent like Claude Code or Codex with a prompt like “Implement this design.” The agent then produces working code that matches the design. Here is what happens behind the scenes.

Figure 4: Design to code workflow

The coding agent and Figma’s MCP server work together through four steps. The first two are generic MCP mechanics: tool discovery and tool calling. The last two are where Figma’s engineering makes the difference.

Step 1. The agent discovers available tools

When you first connect the Figma MCP server, the agent receives a list of available tools. These include get_design_context, get_screenshot, get_metadata, and more. Each tool comes with a name, description, and parameter schema.

Figure 5: Each MCP tool has a name, description, and parameter schema

The agent does not know how Figma works internally. It reads these descriptions the same way a developer reads API documentation, then decides which tool to call based on the user’s prompt.

Figure 6: The agent picks the right tool by matching the user’s intent to tool descriptions.

Step 2. The agent prepares the arguments and calls the tool

The agent prepares the arguments to call the selected tool. In this case, since the selected tool is get_design_context, it needs a file key and a node ID. So it parses both from the Figma URL you pasted and calls the tool.

Figure 7: The agent calls the get_design_context tool with the parsed arguments

Step 3. The request hits Figma’s backend

The tool call is sent over the network to Figma’s MCP server at mcp.figma.com/mcp over Streamable HTTP. The MCP server handles authentication, then calls Figma’s internal services to read the design data such as node trees, component properties, styles, and variable definitions.

Step 4. Transform raw design data into LLM-friendly context

This is where the most important engineering happens. The MCP server transforms the raw JSON from Figma’s REST API into a structured representation that maps to how a developer thinks about building a UI. Pixel positions become layout relationships like “this element is centered inside its parent.” Color values become references to design tokens like brand-blue instead of raw color codes. Deeply nested layers get simplified to reflect what the user actually sees. And components get enriched with code mappings. For example, when a Figma button component is mapped to src/components/ui/Button.tsx through Code Connect, that reference appears in the output. The LLM reuses the existing component instead of recreating it from scratch.

Figure 8: The MCP server transforms raw Figma JSON into a structured representation

The output defaults to a React + Tailwind framing because that is the most common frontend stack. But it is a structured representation of the design, not generated code. The LLM takes this representation and generates actual code in whatever framework the developer specifies.

Figure 9: The LLM uses the representation to generate actual code

Code to Design

Design to code is only half the story. In practice, the code often evolves faster than the design files. A developer ships a feature, tweaks the layout based on user feedback, adds a new section, and now the Figma file no longer matches what is actually running in production. Code to design closes that gap. A developer opens Claude Code, types “send this to Figma,” and a few seconds later the live UI appears in Figma as fully editable layers. Not a flat screenshot, but real frames with auto-layout, editable text, and separate components.

Figure10: Figma’s MCP server enables a bidirectional loop.

This is powered by one key tool in the MCP server: generate_figma_design. Here is what happens under the hood.

Step 1: The Figma tool launches the capture tool

When the developer prompts “send this to Figma,” the agent calls MCP server’s generate_figma_design tool.

Figure 11: coding agents picks generate_figma_design and calls it

The tool opens the target URL in a browser and injects a JavaScript capture script. For a local dev server, it connects directly. For production or staging URLs, it uses a browser automation tool like Playwright to open the page and inject the script programmatically.

When the browser window opens, two things appear: the running UI and a capture toolbar overlay. An initial capture happens automatically when the page loads. From there, the developer can capture the entire screen or select specific elements.

Figure 12: A capture toolbar overlays the running UI

Step 2: The script reads the DOM

When the user selects the desired UI from the live capture, the injected script does not take a screenshot. It reads the live DOM.

It walks the DOM tree and extracts computed styles, layout properties, text content, and image sources for every visible element. It also preserves the parent-child hierarchy. A flex container with three children stays structured as a container with three children, not a flat collection of boxes.

Figure 13: The injected script walks the live DOM tree and extracts selected properties

This is what makes the output editable in Figma. A screenshot captures pixels. The DOM walk captures structure and relationships between elements.

Step 3: DOM data becomes Figma layers

The captured DOM data gets sent to Figma’s backend, where it is reconstructed as native Figma design layers. Each HTML element maps to a Figma frame or shape. CSS flexbox and grid layouts become Figma auto-layout groups. Text nodes become editable Figma text layers with the correct font, size, weight, and color. Images get extracted and embedded as image fills.

Figure 14: Each HTML element maps to a native Figma layer

That covers the two core workflows. But making them work reliably in production, across millions of Figma files, multiple coding agents, and real design systems, introduces a different set of problems.

Engineering Challenges

Here are some of the most important challenges Figma’s team faced, and how they addressed them.

Challenge 1: Context window limits

LLMs have fixed context windows, so token count is a hard constraint. The design data for a complex Figma page can be enormous, far more than what a coding agent can handle in a single call. Claude Code, for example, defaults to a 25,000-token limit for MCP tool responses. If you call get_design_context on an entire page instead of a specific node, the response can easily exceed that limit and get truncated. This challenge is not unique to Figma. Any MCP server that exposes large structured data like a codebase, a document store, or a design file, has to solve the same problem: how to give the LLM enough context to be useful without exceeding what it can process.

Figure 15: First scan the outline with get_metadata, then zoom into specific nodes.

To mitigate this, Figma developed the get_metadata tool. Instead of the full styled representation, it returns a sparse XML outline. A developer can call get_metadata on an entire page to see the structure, identify the specific nodes they need, and then call get_design_context only on those nodes. It is a two-step pattern: scan first, then zoom in.

Challenge 2: Component mapping

By default, the coding agent has no way to know which Figma components map to which code components. Without that mapping, the agent will spend time searching the codebase to find the right component. If it does not find a match, it will create a new one from scratch. Multiply that across every reusable component in a design system, and the generated code diverges from the codebase fast.

Figma mitigates this with Code Connect, which lets teams create explicit mappings between Figma node IDs and code file paths. Once set up, the MCP server includes these mappings in its response, and the agent reuses the actual component instead of guessing.

Figure 16: Code Connect creates explicit mappings between Figma components and code files

Setting up Code Connect requires manual effort. Someone has to create and maintain those mappings. Figma has been working to reduce this friction with tools like get_code_connect_suggestions, which automatically detects and proposes mappings. But the quality of the generated code is still directly tied to how much the team has invested in connecting their design system to their codebase.

Challenge 3: The lossy roundtrip

The bidirectional loop sounds seamless, but each handoff loses information. When a design goes from Figma to code, the structured context captures layout, styles, and component references, but not business logic, event handlers, state management, or API calls. The agent fills those in when generating code.

When that code gets captured back to Figma through generate_figma_design, the DOM walk captures visual structure and styles but strips out everything that is not visible: the React state, the API integration, the route handling.

Figure 17: The design ↔ code roundtrip is not lossless. Each handoff strips some information

The result is that each roundtrip requires re-inference. When a designer modifies a captured UI in Figma and a developer pulls it back into code with get_design_context, the agent is translating visual decisions into implementation from scratch. It does not have access to the previous version of the code. Code Connect mappings help here by preserving the link between design components and their code implementations across roundtrips, but the non-visual logic still has to be re-added each time.

Challenge 4: Serving multiple agents with different capabilities

Figma’s MCP server does not serve a single client. It serves Claude Code, Cursor, Codex, and any other MCP-compatible tool. Each agent has different context window sizes, different tool-calling behaviors, and different levels of sophistication in how it sequences multiple tool calls. A workflow that works well in one agent may not work the same way in another.

Figure 18: Different agents have different context limits and tool-calling capabilities.

The generate_figma_design tool, for instance, is now available in a growing number of coding tools, including Claude Code and Codex. Code-to-design requires tighter integration with the browser (script injection, capture toolbar, multi-screen state) than most agents currently support.

Building an MCP server that works well across a growing ecosystem of agents with varying capabilities is one of the harder ongoing challenges in this space.

The recent opening of the Figma canvas to agents marks an important evolution in this workflow. Agents can now not only read and understand design context, but actively modify and create designs using the use_figma MCP tool. This tool complements the design-to-code workflow by enabling agents to edit designs directly on the Figma canvas and create new assets using your components and variables.

What’s Next?

The hardest part of building an MCP server is not implementing the protocol. It is making the design decisions that Figma’s team had to work through: what context to include, what to leave out, how to structure it so LLMs can reason about it, and how to stay within token budgets. Those decisions are what separate a useful MCP server from one that just wraps an existing API.

Figma’s server is a useful reference point not because of the design tool specifics, but because the design decisions behind it like what to include, how to structure it, and how to handle token budgets, are well-documented and applicable to anyone building an MCP server for a complex domain.

How LinkedIn Feed Uses LLMs to Serve 1.3 Billion Users

2026-04-13 23:31:28

How to stop babysitting your agents (Sponsored)

Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.

More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.

Join us for a FREE webinar on April 23 to see:

  • Where teams get stuck on the AI maturity curve and why common fixes fall short

  • How a context engine solves for quality, efficiency, and cost

  • Live demo: the same coding task with and without a context engine

If you want to maximize the value you get from AI agents, this one is worth your time.

Register now


LinkedIn used to run five separate systems just to decide which posts to show you. One tracked trending content. Another did collaborative filtering. A third handled embedding-based retrieval.

Each had its own infrastructure, its dedicated team, and its own optimization logic. The setup worked, but when the Feed team wanted to improve one part, they’d break another. Therefore, they made a radical bet and ripped out all five systems, replacing them with a single LLM-powered retrieval model. That solved the complexity problem, but it raised new questions, such as:

  • How do you teach an LLM to understand structured profile data?

  • How do you make a transformer serve predictions in under 50 milliseconds for 1.3 billion users?

  • How do you train the model when most of the data is noise?

In this article, we will look at how the LinkedIn engineering team rebuilt the Feed and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.

Five Librarians, One Library

For years, LinkedIn’s Feed retrieval relied on what engineers call a heterogeneous architecture. When you opened the Feed, content came from multiple specialized sources running in parallel.

  • A chronological index of network activity.

  • Trending posts by geography.

  • Collaborative filtering based on similar members.

  • Industry-specific pipelines.

  • Several embedding-based retrieval systems.

Each maintained its own infrastructure, index structure, and optimization strategy.

See the diagram below:

This architecture surfaced diverse, relevant content. But optimizing one retrieval source could degrade another, and no team could tune across all sources simultaneously. Holistic improvement was nearly impossible.

So the Feed team asked a simple question. What if they replaced all of these sources with a single system powered by LLM-generated embeddings?

Under the hood, this works through a dual encoder architecture. A shared LLM converts both members and posts into vectors in the same mathematical space. The training process pushes member and post representations close together when there’s genuine engagement, and pulls them apart when there isn’t. When you open your Feed, the system fetches your member embedding and runs a nearest-neighbor search against an index of post embeddings, retrieving the most relevant candidates in under 50 milliseconds.

However, the real power comes from what the LLM brings to those embeddings. Traditional keyword-based systems rely on surface-level text overlap. If your profile says “electrical engineering” and a post is about “small modular reactors,” a keyword system misses the connection.

An LLM-based system understands that these topics are related because the model carries world knowledge from pretraining. It knows that electrical engineers often work on power grid optimization and nuclear infrastructure. This is especially powerful for cold-start scenarios, when a new member joins with just a profile headline. The LLM can infer likely interests without waiting for engagement history to accumulate.

The downstream benefits compounded the benefits. Instead of receiving candidates from disparate sources with different biases, the ranking layer now receives a coherent candidate set selected through the same semantic similarity. Ranking became easier, and each optimization to the ranking model became more effective.

But replacing five systems with one LLM created a new problem. LLMs expect text, and recommendation systems run on structured data and numbers.

The Model Is Only As Good As Its Input

To feed structured data into an LLM, LinkedIn built a “prompt library” that transforms structured features into templated text sequences. For posts, it includes author information, engagement counts, and post text. For members, it incorporates profile information, skills, work history, and a chronologically ordered sequence of posts they’ve previously engaged with. Think of it as prompt engineering for recommendation systems.

The most striking example is what happened with numerical features. Initially, LinkedIn passed raw engagement counts directly into prompts. For example, a post with 12,345 views would appear as “views:12345” in the text. The model treated those digits like any other text tokens. When the team measured the correlation between item popularity counts and embedding similarity scores, they found it was essentially zero (-0.004). Popularity is one of the strongest relevance signals in recommendation. And the model was completely ignoring it.

The problem is fundamental. LLMs don’t understand magnitude. They process “12345” as a sequence of digit tokens, not as a quantity.

The fix was quite simple. Instead of passing raw counts, LinkedIn converted them into percentile buckets wrapped in special tokens. This meant that “Views:12345” became <view_percentile>71</view_percentile>, indicating this post sits in the 71st percentile of view counts. Most values between 1 and 100 get processed by the LLM as a single unit rather than a multi-digit sequence, giving the model a stable, learnable vocabulary for quantity. The model can learn that anything above 90 means “very popular” without trying to parse arbitrary digit sequences.

The correlation between popularity features and embedding similarity jumped 30x. Recall@10, which measures whether the top 10 retrieved posts are actually relevant, improved by 15%. LinkedIn applied the same strategy to engagement rates, recency signals, and affinity scores.

Less Data, Better Model

When building the member’s interaction history for training, LinkedIn initially included everything. Every post that was shown to a member went into the sequence, whether they engaged with it or scrolled past. The idea was that more data should mean a better model.

However, this didn’t turn out to be the case. Including scrolled-past posts not only made model performance worse, but it also made training significantly more expensive. GPU compute for transformer models scales quadratically with context length.

When the team filtered to include only positively-engaged posts, the results improved across every dimension.

  • Memory footprint per sequence dropped by 37%.

  • The system could process 40% more training sequences per batch.

  • Training iterations ran 2.6x faster

The reason comes down to signal clarity. A scrolled-past post is ambiguous. Maybe the post was irrelevant. Maybe the member was busy. Maybe the headline was mildly interesting, but not enough to stop for. Posts you actively chose to engage with are a much cleaner learning target.

The gains compounded due to this change. Better signal quality meant faster training. Faster training meant more experimentation. More experimentation meant better hyperparameter tuning. When a single change improves both quality and efficiency, the benefits multiply through the entire development cycle.

The training strategy had one more clever element. LinkedIn used two types of negative examples:

  • Easy negatives were randomly sampled posts never shown to a member, providing a broad contrastive signal.

  • Hard negatives were posts actually shown but not engaged with, the almost-right cases where the model must learn nuanced distinctions between relevant and genuinely valuable.

The difficulty of the negative examples shapes what the model learns. Easy negatives teach broad distinction, whereas hard negatives teach the fine-grained ones. Using both together is a common and effective pattern across retrieval systems, and at LinkedIn, adding just two hard negatives per member improved recall by 3.6%.

With retrieval producing high-quality candidates, the next question was how to rank them. LinkedIn’s answer was to stop treating each post as an isolated decision.

The Feed Is a Story, Not a Snapshot

Traditional ranking models evaluate each member-post pair independently. This works, but it misses something fundamental about how professionals consume content.

LinkedIn built a Generative Recommender (GR) model that treats your Feed interaction history as a sequence. Instead of scoring each post in isolation, GR processes more than a thousand of a user’s historical interactions to understand temporal patterns and long-term interests.

The practical difference matters. If the user engages with machine learning content on Monday, distributed systems on Tuesday, and opens LinkedIn again on Wednesday, a sequential model understands these aren’t random events. They’re the continuation of a learning trajectory. A traditional pointwise model sees three independent decisions, whereas the sequential model sees the story.

The GR model uses a transformer with causal attention, meaning each position in the history can only attend to previous positions, mirroring how you actually experienced content over time. Recent posts might matter more for predicting immediate interests, but a post from two weeks ago might suddenly become relevant if recent activity suggests renewed interest.

See the diagram below that shows the transformer architecture:

One of the most practical architectural decisions is what LinkedIn calls late fusion. Not every feature benefits from full self-attention. Count features and affinity signals carry a strong independent signal, and running them through the transformer would inflate computational cost quadratically without clear benefit. Instead, these features are concatenated with the transformer output after sequence processing. This results in rich sequential understanding from the transformer, plus contextual signals that drive relevance, without the cost of processing them through self-attention.

The serving challenge is equally important. Processing 1,000+ historical interactions through multiple transformer layers for every ranking request is expensive. LinkedIn’s solution is shared context batching. The system computes the user’s history representation once, then scores all candidates in parallel using custom attention masks.

On top of the transformer, a Multi-gate Mixture-of-Experts (MMoE) prediction head routes different engagement predictions like clicks, likes, comments, and shares through specialized gates while sharing the same sequential representations underneath.

See the diagram below that shows a typical Mixture-of-Experts architecture.

This lets the model handle multiple prediction tasks without duplicating the expensive transformer computation. Together, shared context batching and the MMoE head are what make the sequential model viable at production scale.

Making It All Work at Scale

Even the best model is useless without the infrastructure to serve it. LinkedIn’s historical ranking models ran on CPUs. Transformers are fundamentally different, with self-attention scaling quadratically with sequence length and massive parameter counts requiring GPU memory. At LinkedIn’s scale, cost-per-inference determines whether sophisticated AI models can serve every member, or only high-engagement users.

The team invested heavily in custom infrastructure on both sides. For training, a custom C++ data loader eliminates Python’s multiprocessing overhead, custom GPU routines reduced metric computation from a bottleneck to negligible overhead, and parallelized evaluation across all checkpoints cut pipeline time substantially. For serving, a disaggregated architecture separates CPU-bound feature processing from GPU-heavy model inference, and a custom Flash Attention variant called GRMIS delivered an additional 2x speedup over PyTorch’s standard implementation.

See the diagram below that shows the GR Infrastructure Stack

Freshness required its own solution.

Three continuously running background pipelines keep the system current, capturing platform activity, generating updated embeddings through LLM inference servers, and ingesting them into a GPU-accelerated index.

Each pipeline optimizes independently, while the end-to-end system stays fresh within minutes. LinkedIn’s models are also regularly audited to ensure posts from different creators compete on an equal footing, with ranking relying on professional signals and engagement patterns, never demographic attributes.

Conclusion

There are some takeaways:

  • Replacing five retrieval systems with one trades resilience for simplicity.

  • LLM-based embeddings are richer but more expensive than lightweight alternatives.

  • The bottleneck is rarely the model architecture. It’s everything around it.

The infrastructure investment represents an effort most teams can’t replicate. And this approach leans on LinkedIn’s rich text data. For primarily visual platforms, the calculus would be different.

The next time you open LinkedIn and see a post from someone you don’t follow, on a topic you didn’t search for, but it’s exactly what you needed to read, that’s all of this working together under the hood.

References:

EP210: Monolithic vs Microservices vs Serverless

2026-04-11 23:30:48

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


This week’s system design refresher:

  • Monolithic vs Microservices vs Serverless

  • CLI vs MCP

  • Comparing 5 Major Coding Agents

  • Essential AWS Services Every Engineer Should Know

  • JWT Visualized


Monolithic vs Microservices vs Serverless

A monolith is usually one codebase, one database, and one deployment. For a small team, that’s often the simplest way to build and ship quickly. The problem arises when the codebase grows. A tiny fix in the cart code requires redeploying the whole app, and one bad release can take down everything with it.

Microservices try to solve that by breaking the system into separate services. Product, Cart, and Order run on their own, scale separately, and often manage their own data. That means you can ship changes to Cart without affecting the rest of the system.

But now you are dealing with multiple moving parts. You generally need service discovery, distributed tracing, and request routing between services.

Serverless is a different model. Instead of managing servers, you write functions that run when something triggers them, and the cloud provider handles the scaling. In many cases, you only pay when those functions actually run.

However, in serverless, cold starts can add latency, debugging across lots of stateless functions can get messy, and the more you build around one cloud’s runtime, the harder it gets to switch later.

Most production systems don't use just one approach. There's usually a monolith at the core, and over time teams spin up a few services where they need independent scaling or faster deploys. Serverless tends to show up later for things like notifications or background jobs.


CLI vs MCP

AI agents need to talk to external tools, but should they use CLI or MCP?

Both call the same APIs under the hood. The difference is how the agent invokes them.

Here's a side-by-side comparison across 6 dimensions:

  1. Token Cost: MCP loads the full JSON schema (tool names, descriptions, field types) into the context window before any work begins. CLI needs no schema, so saves more context window.

  2. Native Knowledge: LLMs were trained on billions of CLI examples. MCP schemas are custom JSON the model encounters for the first time at runtime.

  3. Composability: CLI tools chain with Unix pipes. Something like gh | jq | grep runs in a single LLM call. MCP has no native chaining. The agent must orchestrate each tool call separately.

  4. Multi-User Auth: CLI agents inherit a single shared token. You can't revoke one user without rotating everyone's key. MCP supports per-user OAuth.

  5. Stateful Sessions: CLI spawns a new process and TCP connection per command. MCP keeps a persistent server with connection pooling.

  6. Enterprise Governance: CLI's only audit trail is ~/.bash_history. MCP provides structured audit logs, access revocation, and monitoring built into the protocol.

Over to you: For which use cases do you prefer CLI over MCP, or vice versa?


Comparing 5 Major Coding Agents

The diagram below compares the 5 leading agents across interface, model, context window, autonomy, and more.

Here's what the landscape tells us:

  1. The terminal is the new IDE. Most coding agents now live in your terminal, not inside an editor. The command line is back.

  2. Context windows are getting massive. We've gone from 8K tokens to 1M in just two years. Agents can now reason over entire codebases in a single prompt.

  3. Autonomy is a spectrum. Some agents run fully async in the background. Others keep you in the loop on every edit. Teams are still figuring out how much to delegate.

  4. Open source is gaining ground. The open-source coding agent ecosystem is maturing fast, giving teams full control over their toolchain.

  5. Pricing varies wildly. From completely free (Gemini CLI, Deep Agents) to $15 per 1M output tokens. Check the cost row before you commit.

There is no single winner. The best agent depends on your workflow, budget, and how much autonomy you're comfortable with.

Over to you: Which coding agent is your daily driver in 2026?


Essential AWS Services Every Engineer Should Know

AWS has 200+ services, but most production systems only use a small subset. In many setups, a request ends up going through API Gateway, then an ALB, executes on Lambda or ECS, reads from DynamoDB, and gets cached in ElastiCache.

Each service on its own is straightforward. Deciding where it actually fits is where things get tricky.

EC2 and S3 are usually the starting point for a lot of people. But when things break, the focus shifts to services that didn’t get much attention early on, like CloudWatch for observability, IAM for access control, and KMS for encryption.

Networking tends to be where things get confusing. VPC, subnets, security groups, Route 53, and CloudFront run behind everything. When something is off, the errors don’t always help much.

Database choices are not easy to reverse later. RDS, DynamoDB, and Aurora solve different problems, and changing direction means redesigning a lot of what you've already built. It’s similar with the integration layer. SQS, SNS, and EventBridge each handle a different pattern (queuing vs fan-out vs event routing), and choosing the wrong one causes problems you notice when the system is under load.

SageMaker and Bedrock are newer services, but they're already part of the stack at many companies. SageMaker is for training and hosting models, and Bedrock is for calling foundation models directly.

CloudFormation lets you define infrastructure as code, and CodePipeline handles CI/CD. Once set up, deployments run without manual steps.


JWT Visualized

Imagine you have a special box called a JWT. Inside this box, there are three parts: a header, a payload, and a signature.

The header is like the label on the outside of the box. It tells us what type of box it is and how it's secured. It's usually written in a format called JSON, which is just a way to organize information using curly braces { } and colons : .

The payload is like the actual message or information you want to send. It could be your name, age, or any other data you want to share. It's also written in JSON format, so it's easy to understand and work with.

Now, the signature is what makes the JWT secure. It's like a special seal that only the sender knows how to create. The signature is created using a secret code, kind of like a password. This signature ensures that nobody can tamper with the contents of the JWT without the sender knowing about it.

When you want to send the JWT to a server, you put the header, payload, and signature inside the box. Then you send it over to the server. The server can easily read the header and payload to understand who you are and what you want to do.

Over to you: When should we use JWT for authentication? What are some other authentication methods?

Must-Know Cross-Cutting Concerns in API Development

2026-04-09 23:30:31

What do authentication, logging, rate limiting, and input validation have in common?

The obvious answer is that they’re all important parts of an API. But the real answer is deeper is that none of them belong to any single endpoint or show up in usual product requirements. For all purposes, they are invisible to users when they work and catastrophic when they’re missing. And the hardest part about all of them is making sure they’re applied uniformly across every single route an API exposes.

This family of problems has a name. They’re called cross-cutting concerns, and they’re the invisible layer that separates a collection of API endpoints from a production-ready system.

In this article, we will learn about these key concerns and their trade-offs in detail.

What Makes a Concern “Cross-Cutting”

Read more

How Spotify Ships to 675 Million Users Every Week Without Breaking Things

2026-04-08 23:30:20

Unlock access to the data your product needs (Sponsored)

Most tools are still locked to their own database, blind to everything users already have in Slack, GitHub, Salesforce, Google Drive, and dozens of other apps. That's the ceiling on what you can build.

WorkOS Pipes removes it. One API call connects your product to the apps your users live in. Pull context from their tools, cross-reference data across silos, power AI agents that act across services. All with fresh, managed credentials you never have to think about.

Turn data to insight →


Every Friday morning, a team at Spotify takes hundreds of code changes written by dozens of engineering teams and begins packaging them into a single app update. That update will eventually reach more than 675 million users on Android, iOS, and Desktop. They do this every single week. And somehow, more than 95% of those releases ship to every user without a hitch.

The natural assumption is that they’re either incredibly careful and therefore slow, or incredibly fast and therefore reckless. The truth is neither.

How do you ship to 675 million users every week, with hundreds of changes from dozens of teams running on thousands of device configurations, without breaking things?

The answer is not to test really hard. Spotify built a release architecture where speed and safety reinforce each other. In this article, we will take a look at this process in detail and attempt to derive learnings.

Disclaimer: This post is based on publicly shared details from the Spotify Engineering Team. Please comment if you notice any inaccuracies.

The Two-Week Journey of a Spotify Release

To see how this works, let us follow a single release from code merge to production.

Spotify practices trunk-based development, which means that all developers merge their code into a single main branch as soon as it’s tested and reviewed. There are no long-lived feature branches where code sits in isolation for weeks. Everyone pushes to the same branch continuously, which keeps integration problems small but requires discipline and solid automated testing.

See the diagram below that shows the concept of trunk-based development:

Each release cycle starts on a Friday morning. The version number gets bumped on the main branch. From that point, nightly builds start going out to Spotify employees and a group of external alpha testers. During this first week, teams develop and merge new code freely. Bug reports flow in from internal and alpha users. Crash rates and other quality metrics are tracked for each build, both automatically and by human review. When a crash or issue crosses a predefined severity threshold, a bug ticket gets created automatically. When something looks suspicious but falls below that threshold, the Release Manager can create one manually.

On the Friday of the second week, the release branch gets cut, meaning a separate copy of the codebase is created specifically for this release. This is the key moment in the release cycle. From this point, only critical bug fixes are allowed on the release branch. Meanwhile, the main branch keeps moving. New features and non-critical fixes continue to merge there, destined for next week’s release. This separation is the mechanism that lets Spotify develop at full speed while simultaneously stabilizing what’s about to ship.

Teams then perform regression testing, checking that existing features still work correctly after the new changes, and report their results. Teams with high confidence in their automated tests and pre-merge routines can opt out of manual testing entirely. Beta testers receive builds from the more stable release branch, providing additional real-world runtime over the weekend.

By Monday, the goal is to submit the app to the stores. By Tuesday, if the app store review passes and quality metrics look good, Spotify rolls it out to 1% of users. By Wednesday, if nothing alarming surfaces, they roll out to 100%.

The flow below shows all the steps in a typical release process:

As an example, for version 8.9.2, which carried the Audiobooks feature launch in new markets, this timeline played out almost exactly as planned. What made that possible was everything happening behind the timeline.

Rings of Exposure: Catching Bugs Where They’re Cheapest to Fix

The code doesn’t go from a developer’s laptop to 675 million users in one jump. It passes through concentric rings of users, and each ring exists to catch a specific category of failure.

  • The first ring is Spotify’s own employees. They run nightly builds from the main branch, using the app the way real users do. This catches obvious functional bugs early. Even a crash that only affects a small number of employees gets investigated, because a bug that appears minor internally could signal a much larger problem once it hits millions of devices.

  • The second ring is external alpha testers. These users introduce more device diversity and real-world usage patterns that the internal team may not have anticipated. They’re running builds that are still being actively developed, so rough edges are expected, but the data they generate is invaluable.

  • The third ring is beta testers, who receive builds from the release branch rather than the main branch. These builds are expected to be more stable. Beta users provide additional runtime over weekends and evenings, and their feedback either builds confidence that the release is solid or surfaces issues that slipped through the first two rings.

  • The fourth ring is the 1% production rollout. Real users, real devices, real conditions. Spotify’s user base is large enough that even 1% provides statistically meaningful data. If a severe issue appears during this phase, the rollout is paused immediately, and the responsible team starts working on a fix.

  • The fifth and final ring is the 100% rollout. Only after the 1% rollout looks clean does the release go out to everyone.

For reference, the Audiobooks launch in version 8.9.2 shows how this system works at an even more granular level.

The Audiobooks feature didn’t just pass through these five rings of app release. It had its own layered rollout on top of that. The feature code had been sitting in the app for multiple releases already, hidden behind a backend feature flag. It was turned on for most employees first. The team watched for any crash, no matter how small, that might indicate trouble. Only after the app release itself reached a sufficient user base did the Audiobooks team begin gradually enabling the feature for real users in specific markets, using the same backend flag to control the percentage.

See the diagram below that shows the concept of a feature flag:

This separation between deploying code and activating a feature is a powerful pattern in the Spotify release process. It allows code to sit in the app, baking in production conditions invisibly, and get turned on later. If something goes wrong after activation, the feature can be turned off without shipping a new release. At Spotify’s scale, feature flags are a core safety mechanism, though managing hundreds of them across a large organization, each with per-market and per-user-percentage controls, is its own engineering challenge.

The Release Manager also made a deliberate coordination decision for 8.9.2. Since the Audiobooks feature was a high-stakes launch with marketing events already scheduled, another major feature that had been planned for the same release was rescheduled to the following week. Fewer variables in a single release means easier diagnosis if something goes wrong. That kind of judgment call is one of the things that separates release management from pure automation.

From Jira to a Release Command Center

The multi-ring system generates a lot of data, such as Crash rates, bug tickets, sign-off statuses, build verification results, and app store review progress. Someone has to make sense of all of it, and this wasn’t an easy task.

Before the Release Manager Dashboard existed, everything lived in Jira. The Release Manager had to jump between tickets, check statuses across multiple views, and verify conditions manually, all while answering questions from teams on Slack. It was easy to miss a small detail, and a missed detail could mean extra work or a bug slipping through.

So the Release team built a dedicated command center with clear goals:

  • Optimize for the Release Manager’s workflow

  • Minimize context switching

  • Reduce cognitive load

  • Enable fast and accurate decisions

The result was the Release Manager Dashboard, built as a plugin on Backstage, Spotify’s internal developer portal.

It pulls and aggregates data from around 10 different backend systems into a single view. For each platform (Android, iOS, Desktop), the dashboard shows blocking bugs, the latest build status, automated test results, crash rates normalized against actual usage (so a crash rate is meaningful whether 1,000 or 1,000,000 people are using the build), team sign-off progress, and rollout state. Everything is color-coded

  • Green means ready to advance

  • Yellow means something needs attention

  • Red means there’s a problem requiring action

Here’s an example of how the dashboard appears:

The dashboard also surfaces release criteria as a visible checklist:

  • All commits on the release branch are included in the latest build and passing tests

  • No open blocking bug tickets

  • All teams signed off

  • Crash rates below defined thresholds

  • Sufficient real-world usage of the build

When everything goes green, the release is ready to advance.

The dashboard got off to a rocky start, however. The first version was slow and expensive. Every page reload triggered queries to all 10 of the source systems it depended on, causing long load times and high costs. The Spotify engineering team noted that each reload cost about as much as a decent lunch in Stockholm. After switching to caching and pre-aggregating data every five minutes, load time dropped to eight seconds, and the cost became negligible.

The Robot: Automating the Predictable, Keeping Humans for the Ambiguous

The dashboard gave the Release Manager the information to make fast decisions.

However, by analyzing the time-series data the dashboard generated, the team noticed that a lot of the time in the release cycle wasn’t spent on hard decisions, but waiting.

The biggest time sinks were testing and fixing bugs (unavoidable), waiting for app store approval (outside Spotify’s control), and delays from manually advancing a release when a step was completed outside working hours. That last one alone could cost up to 12 hours. If the app store approved a build at 11 PM, the release just sat there until someone woke up and clicked “next.”

Therefore, the team built what they called “the Robot.”

It’s a backend service that models the release process as a state machine, a set of defined stages with specific conditions that must be met before moving to the next one. The Robot tracks seven states. The five states on the normal path forward are release branched, final release candidate (the build that will actually ship), submitted for app store review, rolled out to 1%, and rolled out to 100%. Two additional states handle problems, which means either the rollout gets paused or the release gets cancelled entirely.

See the diagram below:

The Robot continuously checks whether the conditions for advancing to the next state are met. If manual testing is signed off, no blocking bugs are open, and automated tests pass on the latest commit on the release branch, the Robot automatically submits the build for app store review without human intervention. If the app store approves the build at 3 AM, the Robot initiates the 1% rollout immediately instead of waiting for someone to show up at the office.

The result was an average reduction of about eight hours per release cycle.

However, the Robot doesn’t make the hard calls. It doesn’t decide whether a crash affecting users in a specific region is severe enough to block a release. It doesn’t decide whether a bug in a new feature like Audiobooks, with marketing events already scheduled, should delay the entire release or just the feature rollout. It doesn’t negotiate with feature teams about timing. Those decisions require judgment, context, and sometimes difficult conversations. The Release Manager handles all of them.

This split is deliberate. Predictable transitions that depend on rule-checks get automated. Ambiguous decisions that require coordination and judgment stay with humans.

Conclusion

Spotify ships weekly to 675 million users through a strong release architecture. Layered exposure catches bugs where they’re cheapest to fix and centralized tooling turns scattered data into fast decisions. Automation handles the predictable so humans can focus on the ambiguous.

The key lesson here is that speed and safety aren’t opposites. At Spotify, each one enables the other. A weekly cadence means each release carries fewer changes. Fewer changes mean less risk per release. Less risk means shipping with confidence.

Since a cancelled release only costs one week, not a month or a quarter, teams are more willing to kill a bad release rather than push it through and hope for the best.

References:

Nextdoor’s Database Evolution: A Scaling Ladder

2026-04-07 23:32:00

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Nextdoor operates as a hyper-local social networking service that connects neighbors based on their geographic location.

The platform allows people to share local news, recommend local businesses, and organize neighborhood events. Since the platform relies on high-trust interactions within specific communities, the data must be both highly available and extremely accurate.

However, as the service scaled to millions of users across thousands of global neighborhoods, the underlying database architecture had to evolve from a simple setup into a sophisticated distributed system.

This engineering journey at Nextdoor highlights a fundamental rule of system design.

Every performance gain introduces a new requirement for data integrity. The team followed a predictable progression, moving from a single database instance to a complex hierarchy of connection poolers, read replicas, versioned caches, and background reconcilers. In this article, we will look at how the Nextdoor engineering team handled this evolution and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the Nextdoor Engineering Team. Please comment if you notice any inaccuracies.

The Limits of the “Big Box”

In the early days, Nextdoor relied on a single PostgreSQL instance to handle every post, comment, and neighborhood update.

For many growing platforms, this is the most logical starting point. It is simple to manage, and PostgreSQL provides a robust engine capable of handling significant workloads. However, as more neighbors joined and the volume of simultaneous interactions grew, the team hit a wall that was not related to the total amount of data stored, but more to do with the connection limit.

PostgreSQL uses a process-per-connection model. In other words, every time an application worker wants to talk to the database, the server creates a completely new process to handle that request. If an application has five thousand web workers trying to access the database at the same time, the server must manage five thousand separate processes. Each process consumes a dedicated slice of memory and CPU cycles just to exist.

Managing thousands of processes creates a massive overhead for the operating system. The server eventually spends more time switching between these processes than it does running the actual queries that power the neighborhood feed. This is often the point where vertical scaling, or buying a larger server with more cores, starts to show diminishing returns. The overhead of the “process-per-connection” model remains a bottleneck regardless of how much hardware is thrown at the problem.

To solve this, Nextdoor introduced a layer of middleware called PgBouncer. This is a connection pooler that sits between the application and the database. Instead of every application worker maintaining its own dedicated line to the database, they all talk to PgBouncer.

  • The Request Phase: A web worker requests a connection from PgBouncer to execute a quick query.

  • The Assignment Phase: PgBouncer assigns an idle connection from its pre-established pool rather than forcing the database to create a new process.

  • The Execution Phase: The query runs against the database using that shared connection.

  • The Release Phase: The worker finishes its task, and the connection returns to the pool immediately for the next worker to use.

This allows thousands of application workers to share a few hundred “warm” database connections. This effectively removed the connection bottleneck and allowed the primary database to focus entirely on data processing.

Dividing the Labor and the “Lag” Problem

Once connection management was stable, the next bottleneck appeared in the form of read traffic.

In a social network like Nextdoor, the ratio of people reading the feed compared to people writing a post is heavily skewed. For every one person who saves a new neighborhood update, hundreds of others might view it. A single database server must handle both the “Writes” and the “Reads” at the same time. This creates resource contention where heavy read queries can slow down the ability of the system to save new data.

The solution was to move to a Primary-Replica architecture. In this setup, one database server is designated as the Primary. It is the only server allowed to modify or change data. Several other servers, known as Read Replicas, maintain copies of the data from the Primary. All the “Read” traffic from the application is routed to these replicas, while only the “Write” traffic goes to the Primary.

See the diagram below:

This separation of labor allows for massive horizontal scaling of reads. However, this introduces the challenge of Asynchronous Replication. The Primary database sends its changes to the replicas using a stream of logs. It takes time for a new post saved on the Primary to travel across the network and appear on the replicas. This delay is known as replication lag.

See the diagram below that shows the difference between synchronous and asynchronous replication:

To solve the issue of a neighbor making a post and then seeing it disappear upon a refresh, Nextdoor uses Time-Based Dynamic Routing. This is a smart routing logic that ensures users always see the results of their own actions. Here’s how it works:

  • The Write Marker: When a user performs a write action, like posting a comment, the application notes the exact timestamp of that event.

  • The Protected Window: For a specific period of time after that write, often a few seconds, the system treats that specific user as sensitive

  • Dynamic Routing: During this window, all read requests from that user are dynamically routed to the Primary database instead of a replica.

  • The Handover: Once the time window expires and the system is confident the replicas have caught up with the Primary, the user’s traffic is routed back to the replicas to save resources.

This ensures that while the general neighborhood sees eventually consistent data, the person who made the change always sees strongly consistent data.


Why writing code isn’t the hard part anymore (Sponsored)

Coding is no longer the bottleneck, it’s prod.

With the rise in AI coding tools, teams are shipping code faster than they can operate it. And production work still means jumping between fragmented tools, piecing together context from systems that don’t talk to each other, and relying on the few engineers who know how everything connects.

Leading teams like Salesforce, Coinbase, and Zscaler cut investigation time by over 80% with Resolve AI, using multi-agent investigation that works across code, infrastructure, and telemetry.

Learn how AI-native engineering teams are implementing AI in their production systems

Get the free AI for Prod ebook ➝


The High-Speed Library

Even with multiple replicas, hitting a database for every single page load is an expensive operation.

Databases must read data from a disk or a large memory pool and often perform complex joins between different tables to assemble a single record. To provide the millisecond response times neighbors expect, Nextdoor implemented a caching layer using Valkey. This is an open-source high-performance data store that holds information in RAM for near-instant access.

The team uses a Look-aside Cache pattern. When the application needs data, it follows a specific sequence:

  • The Cache Check: The application looks for the data in Valkey using a unique key.

  • The Cache Hit: If the data is found, it is returned instantly to the user without touching the database.

  • The Cache Miss: If the data is missing, the application queries the PostgreSQL database to find the truth.

  • The Population Step: The application takes the database result, saves a copy in Valkey for future requests, and then returns it to the user.

Efficiency is vital when managing a cache at this scale. RAM is much more expensive than disk storage, so the data must be as small as possible.

Nextdoor uses a binary serialization format called MessagePack. In other words, instead of storing data as a bulky text format like JSON, they convert it into a highly compressed binary format that is much faster for the computer to parse.

MessagePack is particularly useful for Nextdoor because it supports schema evolution. If the engineering team adds a new field to a neighbor’s profile, the older cached data can still be read without crashing the application. For even larger pieces of data, they use Zstd compression. By combining these two tools, Nextdoor reduces the memory footprint of its cache servers.

Versioning and Atomic Updates

Caching can create a serious problem when it starts lying in particular scenarios. For example, if the database is updated but the cache is not refreshed, users can see old, incorrect information. Most simple caching strategies rely on a “Time to Live” or TTL. This is a timer that tells the cache to delete an entry after a few minutes. For a real-time social network, waiting several minutes for a post to update is not an acceptable solution.

Nextdoor built a sophisticated versioning engine to ensure the cache stays up to date. They added a special column called system_version to their database tables and used PostgreSQL Triggers to manage this number. For reference, a trigger is a small script that runs automatically inside the database whenever a row is touched. Every time a post is updated, the trigger increments the version number. This ensures that the database remains the ultimate source of truth regarding which version of a post is the newest.

When the application tries to update the cache, it does not just overwrite the old data. It uses a Lua script executed inside Valkey. This script performs an atomic Compare and set operation that works as follows:

  • The Metadata Fetch: The script retrieves the version number currently stored in the cache entry.

  • The Version Comparison: It compares the version to the version number of the new update being sent by the application.

  • The Conditional Write: If the new version is strictly greater than the cached version, the update is saved.

  • The Rejection: If the cached version is already equal to or higher than the new update, the script rejects the change entirely.

This prevents “race conditions.” Imagine two different servers trying to update the same post at the same time. Without this logic, an older update could arrive a millisecond later and overwrite a newer update. This would leave the cache permanently out of sync with the database. By using Lua, the entire process of checking the version and updating the data happens as a single, unbreakable step that cannot be interrupted.

CDC and Reconciliation

Even with versioning and Lua scripts, errors can occur.

A network partition might prevent a cache update from reaching Valkey, or an application process might crash before it can finish the population step. Nextdoor needed a final safety net to catch these discrepancies. They implemented Change Data Capture, also known as CDC, using a tool called Debezium.

See the diagram below:

CDC works by “listening” to the internal logs of the PostgreSQL database. Specifically, it watches the Write-Ahead Log, where every single change is recorded before it is committed. Every time a change happens in the database, Debezium captures that event and turns it into a message in a data stream. A background service known as the Reconciler watches this stream.

The reconciliation flow provides a “self-healing” mechanism for the entire setup:

  • The Database Update: A user updates their neighborhood bio in the Primary PostgreSQL database.

  • The Log Capture: Debezium detects the new log entry and publishes a change event message.

  • The Reconciler Action: The background service receives this message and identifies which cache key needs to be corrected.

  • The Invalidation: The service tells the cache to delete the old entry. The next time a neighbor requests that bio, the application will experience a “Cache Miss” and fetch the perfectly fresh data from the database.

This process provides eventual consistency. While the primary cache update might fail for a fraction of a second, the CDC Reconciler will eventually detect the change and fix the cache. It acts like a detective that constantly audits the system to ensure the fast truth in the cache matches the real truth in the database.

Sharding

There comes a point where even the most optimized single Primary database cannot handle the volume of incoming writes. When a platform processes billions of rows, the hardware itself reaches physical limits. This is when Nextdoor moves to the final rung of the ladder. This rung is Sharding.

Sharding is the process of breaking a single, massive table into smaller pieces and spreading them across entirely different database clusters. Nextdoor typically shards data by a unique identifier such as a Neighborhood ID.

  • The Cluster Split: All data for Neighborhoods 1 through 500 might live on Cluster A, while Neighborhoods 501 through 1,000 live on Cluster B.

  • The Shard Key: The application uses the neighborhood_id to know exactly which database cluster to talk to for any given request.

Sharding allows for much greater scaling because we can keep adding more clusters based on growth. However, it comes at a high cost in complexity. Once we shard a database, we can no longer easily perform a “Join” between data on two different shards.

Conclusion

The journey of Nextdoor’s database shows that great engineering is rarely about choosing the most complex tool first. It is about a disciplined progression.

They started with a single server and added connection pooling when the lines got too long. They added replicas when the read traffic became too heavy. Finally, they built a world-class versioned caching system to provide the speed neighbors expect without sacrificing the accuracy they require.

The takeaway is that complexity must be earned. Each layer of the scaling ladder solves one problem while introducing a new challenge in data consistency. By building robust safety nets such as versioning and reconciliation, the Nextdoor engineering team ensured that its system could grow without losing the trust of the communities it serves.

References: