MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Linear Implemented Multi-Region Support For Customers

2025-09-15 23:30:45

Streamfest 2025: Streaming Meets AI (Sponsored)

Join us November 5–6 for Redpanda Streamfest, a two-day online event dedicated to streaming data technologies for agentic and data-intensive applications. Learn how to build scalable, reliable, and secure data pipelines through technical sessions, live demos, and hands-on workshops. Sessions include keynotes from industry leaders, real-world case studies, and tutorials on next-gen connectors for AI use cases. Discover why a streaming data foundation is essential for LLM-powered applications, how to simplify architectures, and new approaches to cost-effective storage. Connect with experts, sharpen your skills, and get ready to unlock the full potential of AI with streaming.

Sign Up Now


Disclaimer: The details in this post have been derived from the official documentation shared online by the Linear Engineering Team. All credit for the technical details goes to the Linear Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Linear represents a new generation of project management platforms specifically designed for modern software teams.

Founded in 2019, the company has built its reputation on delivering exceptional speed and developer experience, distinguishing itself in a crowded market dominated by established players like Jira and Asana.

What sets Linear apart is its focus on performance, achieving sub-50ms interactions that make the application feel instantaneous. The platform embraces a keyboard-first design philosophy, allowing developers to navigate and manage their work without reaching for the mouse, a feature that resonates strongly with its technical user base.

The modern SaaS landscape presents a fundamental challenge that every growing platform must eventually face: how to serve a global customer base while respecting regional data requirements and maintaining optimal performance.

In this article, we look at how Linear implemented multi-region support for its customers. We will explore the architecture they built, along with the technical implementation details.

The Need for Multi-Region Support

The decision to implement multi-region support at Linear wasn't made in a vacuum but emerged from concrete business pressures and technical foresight. There were a couple of reasons:

  • Compliance: The most immediate driver came from the European market, where larger enterprises expressed clear preferences for hosting their data within Europe. This was due to the need for GDPR compliance and internal data governance policies.

  • Technical: The primary technical concern centered on the eventual scaling limits of their PostgreSQL infrastructure. While their single-region deployment in Google Cloud's us-east-1 was serving them well, the team understood that continuing to scale vertically by simply adding more resources to a single database instance would eventually hit hard limits. By implementing multi-region support early, they created a horizontal scaling path that would allow them to distribute workspaces across multiple independent deployments, each with its own database infrastructure.

  • Competitive Advantage: The implementation of multi-region support also positioned Linear more favorably in the competitive project management space. By offering European data hosting, Linear could compete more effectively for enterprise contracts against established players who might not offer similar regional options.

The Architecture Design Principles

The multi-region architecture Linear implemented follows four strict requirements that shaped every technical decision in the system.

  • Invisible to Users: This meant maintaining single domains (linear.app and api.linear.app) regardless of data location. This constraint eliminated the simpler approach of region-specific subdomains like eu.linear.app, which would have pushed complexity onto users and broken existing integrations. Instead, the routing logic lives entirely within the infrastructure layer.

  • Developer Simplicity: This meant that engineers writing application features shouldn’t need to consider multi-region logic in their code. This constraint influenced numerous implementation details, from the choice to replicate entire deployments rather than shard databases to the decision to handle all synchronization through background tasks rather than synchronous cross-region calls.

  • Feature Parity: Every Linear feature, integration, and API endpoint must function identically regardless of which region hosts a workspace. This eliminated the possibility of region-specific feature flags or degraded functionality, which would have simplified the implementation but compromised the user experience.

  • Full Regional Isolation: This meant that each region operates independently. A database failure, deployment issue, or traffic spike in one region cannot affect the other. This isolation provides both reliability benefits and operational flexibility. Each region can be scaled, deployed, and maintained independently based on its specific requirements.

The following architecture diagram reveals a three-tier structure.

User-facing clients (API users, the Linear web client, and OAuth applications) all connect to a central proxy layer. This proxy communicates with an authentication service to determine request routing, then forwards traffic to one of two regional deployments. Each region contains a complete Linear stack: API servers, sync engine, background task processors, and databases.

The proxy layer, implemented using Cloudflare Workers, serves as the routing brain of the system. When a request arrives, the proxy extracts authentication information, queries the auth service for the workspace's region, and obtains a signed JWT, then forwards the request to the appropriate regional deployment. This happens on every request, though caching mechanisms reduce the overhead for frequent requests from the same client.

The resulting architecture trades implementation complexity for operational benefits and user experience. Rather than distributing complexity across the application or pushing it onto users, Linear concentrated it within well-defined infrastructure components—primarily the proxy and authentication service.

Technical Implementation Phases

There were three main phases to the technical implementation:

1 - Infrastructure as Code Transformation

Before implementing multi-region support, Linear's infrastructure existed as manually configured resources in Google Cloud Platform. While functional for a single-region deployment, this approach wouldn't scale to managing multiple regional deployments. The manual configuration would have required duplicating every setup step for each new region, creating opportunities for configuration drift and human error.

The transformation began with Google Cloud's Terraform export tooling, which generated Terraform configurations from the existing infrastructure. This automated export provided a comprehensive snapshot of their us-east-1 deployment, but the raw export required significant cleanup. The team removed resources that weren't essential for the main application, particularly global resources that wouldn't need regional replication and resources that hadn't been manually created originally.

The critical work involved refactoring these Terraform resources into reusable modules. Each module was designed to accept region as a variable parameter, along with region-specific configurations for credentials and secrets. This modular approach transformed infrastructure deployment from a manual process into a parameterized, repeatable operation. Spinning up a new region became a matter of instantiating these modules with appropriate regional values rather than recreating infrastructure from scratch.

The team also built a staging environment using these Terraform modules, which served multiple purposes.

  • Validated that the infrastructure-as-code accurately replicated their production environment.

  • Provided a safe space for testing infrastructure changes before production deployment, and crucially

  • Provided an environment for testing the proxy's routing logic that would direct traffic between regions.

2 - Authentication Service Architecture

The authentication service extraction represented the most complex phase of Linear's multi-region implementation, touching large portions of their codebase.

The extraction followed a gradual approach designed to minimize risk. Initially, while still operating in a single region, the new authentication service shared a database with the main backend service in its US region. This co-location allowed the team to develop and test the extraction logic without immediately dealing with network latency.

Once the extraction logic was functionally complete, they implemented strict separation at the database level. Tables were split into distinct schemas with database-level permission boundaries—the authentication service couldn't read or write regional data, and regional services couldn't directly access authentication tables. This hard boundary, enforced by PostgreSQL permissions rather than application code, guaranteed that the architectural separation couldn't be accidentally violated by a coding error.

Some tables required splitting between the two services. For example, workspace configuration contained both authentication-relevant settings and application-specific data. The solution involved maintaining parallel tables with a one-to-one relationship for shared fields, requiring careful synchronization to maintain consistency.

Linear adopted a one-way data flow pattern that simplified the overall architecture. Regional services could call the authentication service directly through a GraphQL API, but the authentication service never made synchronous calls to regional services. When the authentication service needed regional actions, it scheduled background tasks using Google Pub/Sub's one-to-many pattern, broadcasting tasks to all regions.

This design choice meant the authentication service only handled HTTP requests without needing its own background task runner, simplifying deployment and operations. The choice of GraphQL for internal service communication leveraged Linear's existing investment in GraphQL tooling from their public API. Using Zeus to generate type-safe clients eliminated many potential integration errors and accelerated development by reusing familiar patterns and tools.

Three distinct patterns emerged for maintaining data consistency between services:

  • Record creation always began in the authentication service to ensure global uniqueness constraints (like workspace URL keys) were enforced before creating regional records. The authentication service would return an ID that regional services used to create corresponding records,

  • Updates used Linear's existing sync engine infrastructure. When a regional service updated a shared record, it would asynchronously propagate changes to the authentication service. This approach kept the update path simple for developers. They just updated the records normally in the regional service.

  • Deletion worked similarly to creation, with additional complexity from foreign key cascades. PostgreSQL triggers created audit logs of deleted records, capturing deletions that occurred due to cascading from related tables.

3 - Request Routing Layer

The final piece of Linear's multi-region architecture involved implementing a proxy layer to route requests to the appropriate regional deployment. Since Linear already used Cloudflare Workers extensively, extending their use for request routing was a natural choice.

See the diagram below:

The proxy worker's core responsibility is straightforward but critical.

For each incoming request, it extracts authentication information (from cookies, headers, or API tokens), makes a call to the authentication service to determine both the target region and obtain a signed JWT, then forwards the request to the appropriate regional deployment with the pre-signed header attached.

The JWT signing mechanism serves dual purposes. It validates that requests have been properly authenticated and authorized by the central authentication service, while also carrying metadata about the user and workspace context. This eliminates the need for regional services to make their own authentication calls, reducing latency and system complexity.

To optimize performance, the Cloudflare Worker implements sophisticated caching of authentication signatures.

When the same client makes frequent requests, the worker can serve cached authentication tokens without making repeated round-trip to the authentication service. The caching strategy had to balance performance with security. Tokens are cached for limited periods and include enough context to prevent cache poisoning attacks while still providing meaningful performance benefits for active users making multiple rapid requests.

Linear's real-time sync functionality relies heavily on WebSockets, making their efficient handling crucial for the multi-region architecture. The team leveraged an important Cloudflare Workers optimization: when a worker returns a fetch request without modifying the response body, Cloudflare automatically hands off to a more efficient code path.

This optimization is particularly valuable for long-lived connections. Rather than keeping a worker instance active for the entire duration of a WebSocket connection, the handoff mechanism allows the worker to complete its routing decision and then step aside, letting Cloudflare's infrastructure efficiently proxy the established connection to the regional deployment.

During the implementation, the team kept fallback mechanisms in place. While building out the proxy layer, the API service could still authenticate requests directly if they weren't pre-signed by the proxy.

Conclusion

Linear's multi-region implementation demonstrates that supporting geographical data distribution doesn't require sacrificing simplicity or performance.

By concentrating complexity within well-defined infrastructure components (the proxy layer and authentication service) the architecture shields both users and developers from the underlying regional complexity. The extensive behind-the-scenes work touched sensitive authentication logic throughout the codebase, yet careful planning meant most bugs remained invisible to users through strategic use of fallbacks and gradual rollouts.

The implementation now enables workspace creation in Linear's European region with full feature parity, automatically selecting the default region based on the user's timezone while preserving choice. The architecture positions Linear for future expansion, with the framework in place to add additional regions as needed. The team plans to extend this capability further by supporting workspace migration between regions, allowing existing customers to relocate their data as requirements change.

References:


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP180: Python vs Java

2025-09-13 23:31:25

Kiss bugs goodbye with fully automated end-to-end test coverage (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.

QA Wolf’s AI-native service provides high-volume, high-speed test coverage for web and mobile apps, reducing your organizations QA cycle to less than 15 minutes.

They can get you:

Engineering teams move faster, releases stay on track, and testing happens automatically—so developers can focus on building, not debugging.

The result? Drata achieved 4x more test cases and 86% faster QA cycles.

⭐ Rated 4.8/5 on G2

Schedule a demo to learn more


This week’s system design refresher:

  • Python vs Java

  • Help us Make ByteByteGo Newsletter Better

  • Design Patterns Cheat Sheet

  • CI/CD Simplified Visual Guide

  • How Apache Kafka Works?

  • Load Balancers vs API Gateways vs Reverse Proxy

  • SPONSOR US


Python vs Java

Ever wondered what happens behind the scenes when you run a Python script or a Java program? Let’s find out:

Python (CPython Runtime):

  • Python source code (.py) is compiled into bytecode automatically in memory.

  • Bytecode can also be cached in .pyc files, making re-runs faster by using the cached version.

  • The Import System loads modules and dependencies.

  • The Python Virtual Machine (PVM) interprets the bytecode line by line, making Python flexible but relatively slower.

Java (JVM Runtime):

  • Java source code (.java) is compiled into .class bytecode using javac.

  • The Class Loader loads bytecode into the Java Runtime Environment (JVM).

  • Bytecode is verified and executed.

  • JVM uses both an Interpreter and a JIT Compiler, frequently used code (hot paths) is converted into native machine code, making Java faster.

Over to you: Do you prefer the flexibility of Python or the performance consistency of Java?


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


Engineering impact lost in translation? (Sponsored)

Outcomes speak louder than outputs.

DevStats gives engineering leaders the shared language they need to align with business goals and prove impact without the endless back-and-forth.

✅ Show how dev work connects to business outcomes

✅ Translate engineering metrics into exec-friendly insights

✅ Spot bottlenecks early and keep delivery flowing

✅ Prove your team’s value with every release

Finally, a way to make the business understand engineering, and make engineering impossible to ignore.

👉 Try DevStats for free


Design Patterns Cheat Sheet

The cheat sheet briefly explains each pattern and how to use it.

What's included?

  • Factory

  • Builder

  • Prototype

  • Singleton

  • Chain of Responsibility

  • And many more!


CI/CD Simplified Visual Guide

Whether you're a developer, a DevOps specialist, a tester, or involved in any modern IT role, CI/CD pipelines have become an integral part of the software development process.

Continuous Integration (CI) is a practice where code changes are frequently combined into a shared repository. This process includes automatic checks to ensure the new code works well with the existing code.

Continuous Deployment (CD) takes care of automatically putting these code changes into real-world use. It makes sure that the process of moving new code to production is smooth and reliable.

This visual guide is designed to help you grasp and enhance your methods for creating and delivering software more effectively.

Over to you: Which tools or strategies do you find most effective in implementing CI/CD in your projects?


How Apache Kafka Works?

Apache Kafka is a distributed event streaming platform that lets producers publish data and consumers subscribe to it in real-time. Here’s how it works:

  1. A producer application creates data, like website clicks or payment events.

  2. The data is converted by a serializer into bytes so Kafka can handle it.

  3. A partitioner decides which topic partition the message should go to.

  4. The message is published into a Kafka cluster made of multiple brokers.

  5. Each broker stores partitions of topics and replicates them to others for safety.

  6. Messages inside partitions are stored in order and available for reading.

  7. A consumer group subscribes to the topic and takes responsibility for processing data.

  8. Each consumer in the group reads from different partitions to balance the work.

  9. Consumers process the data in real-time, such as updating dashboards or triggering actions.

Over to you: Have you used Apache Kafka?


Load Balancers vs API Gateways vs Reverse Proxy! And how can they Work Together?

While load balancers, API gateways, and reverse proxies often overlap each other’s functionalities, they also have specific roles that can be leveraged in a project.

  1. A client request first hits the edge load balancer, which provides a single entry point and distributes traffic to the API Gateway.

  2. The API Gateway takes over and performs initial parameter validations to ensure the request is well-formed.

  3. Next, the whitelist verification checks if the source is trusted and allowed to access the APIs.

  4. Authentication and authorization validate the identity of the requester and their permissions.

  5. Rate limiting ensures the client is not overwhelming the system with too many requests.

  6. Finally, the reverse proxy inside the gateway forwards the request to the correct service endpoint.

  7. At the service layer, another load balancer distributes the request across multiple instances of the target microservice.

  8. The chosen service instance processes the request and returns the response to the client.

Over to you: Have you used load balancers, API gateways, and reverse proxies in your applications?


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are,. what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Start Learning AI — Our New YouTube Channel is Live

2025-09-12 23:30:39

BIG announcement: We’ve launched a new YouTube channel to make learning AI easier — ByteByteAI!

Subscribe Here

Our first video is already live, and we plan to post a new one every week.

Here's a sneak peak into some titles you might see in the future:

- How Are Reasoning LLMs Like “GPT-5” Built?

- How to Build a Coding Agent?

- How LLMs See the World

- The $250M Paper - Molmo

- What Is Prompt and Context Engineering?

- How Does YouTube Recommend Videos?

- How Does Netflix Recommend Shows?

- How Does Google Translate Work?

- How to Build a Text-to-Image System?

- Are Small Language Models the future of agentic AI?

- How do LLMs remember things?

- Hacking AI with Words: Prompt Injection Explained

- And many more…

Subscribe Here

Consistent Hashing 101: How Modern Systems Handle Growth and Failure

2025-09-11 23:30:46

In the world of distributed systems, one of the hardest problems isn’t just storing or retrieving data. It’s figuring out where that data should live when we have dozens, hundreds, or even thousands of servers. 

Imagine running a large-scale web service where user profiles, cached web pages, or product catalogs need to be spread across multiple machines. 

Consistent hashing emerged as a clever solution to this problem and quickly became one of the foundational ideas for scaling distributed systems. 

Instead of scattering keys randomly and having to reshuffle them every time the cluster size changes, consistent hashing ensures that only a small, predictable portion of keys needs to move when servers are created or destroyed. This property, often described as “minimal disruption,” is what makes the technique so powerful. 

Over the years, consistent hashing has been adopted by some of the largest companies in technology. It underpins distributed caching systems like memcached, powers databases like Apache Cassandra and Riak, and is at the heart of large-scale architectures such as Amazon Dynamo. When browsing a social media feed, streaming a video, or shopping online, chances are that consistent hashing is working quietly in the background to keep the experience smooth and fast.

In this article, we will look at consistent hashing in detail. We will also understand the improvements to consistent hashing using virtual nodes and how it helps scale systems.

The Problem with Traditional Hashing

Read more

How DoorDash uses AI Models to Understand Restaurant Menus

2025-09-10 23:30:09

Make tribal knowledge self-serve (Sponsored)

Cut onboarding time, reduce interruptions, and ship faster by surfacing the knowledge locked across GitHub, Slack, Jira, and Confluence (and more). You get:

  • Instant answers to questions about your architecture, past workarounds, and current projects.

  • An MCP Server that supercharges Claude and Cursor with your team knowledge so they generate code that makes sense in your codebase.

  • Agent that posts root cause and fix suggestions for CI failures directly in your Pull Request. 

  • A virtual member of your team that automates internal support without extra overhead.

Check out Unblocked


Disclaimer: The details in this post have been derived from the official documentation shared online by the DoorDash Engineering Team. All credit for the technical details goes to the DoorDash Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When we order food online, the last thing we want is an out-of-date or inaccurate menu. 

However, for delivery platforms, keeping menus fresh is a never-ending challenge. Restaurants constantly update items, prices, and specials, and doing all of this manually at scale is costly and slow.

DoorDash tackled this problem by applying large language models (LLMs) to automate the process of turning restaurant menu photos into structured, usable data. The technical goal of their project was clear: achieve accurate transcription of menu photos into structured menu data while keeping latency and cost low enough for production at scale.

On the surface, the idea is straightforward: take a photo, run it through AI, and get back a clean digital menu. In practice, though, the messy reality of real-world images (cropped photos, poor lighting, cluttered layouts) quickly exposes the limitations of LLMs on their own.

But the key insight was that LLMs, while strong at summarization and organization, break down when faced with noisy or incomplete inputs. To overcome this, DoorDash designed a system with guardrails. These are mechanisms that decide when automation is reliable enough to use and when a human needs to step in.

In this article, we will look at how DoorDash designed such a system and the challenges they faced.

Baseline MVP

The first step was to prove whether menus could be digitized at all in an automated way. 

The engineering team started with a simple pipeline: OCR to LLM. The OCR system extracted raw text from menu photos, and then a large language model was tasked with converting that text into a structured schema of categories, items, and attributes.

This approach worked well enough as a prototype. 

It showed that a machine could, in principle, take a photo of a menu and output something resembling a digital menu. But once the system was tested at scale, cracks began to appear. Accuracy suffered in ways that were too consistent to ignore.

The main reasons were as follows:

  • Inconsistent menu structures: Real-world menus are not neatly ordered lists. Some are multi-column, others use mixed fonts, and many scatter categories and items in unpredictable ways. OCR tools often scramble the reading order, which means the LLM ends up pairing items with the wrong attributes or misplacing categories entirely.

  • Incomplete menus: Photos are often cropped or partial, capturing only sections of a menu. When the LLM receives attributes without their parent items, or items without their descriptions, it makes guesses. These guesses lead to mismatches and incorrect entries in the structured output.

  • Low photographic quality: Many menu photos are taken in dim lighting, with glare from glass frames or clutter in the background. Small fonts and angled shots add to the noise. Poor image quality reduces OCR accuracy, and the errors cascade into the LLM stage, degrading the final transcription.

Through human evaluation, the team found that nearly all transcription failures could be traced back to one of these three buckets. 


The Gold standard for AI news (Sponsored)

AI is the most essential technical skill of this decade.

CEOs of GitHub, Box, and others are prioritising hiring engineers with AI skills.

Engineers, devs, and technical leaders at Fortune 1000s + leading Silicon Valley startups read Superhuman AI to stay ahead of the curve and future-proof their skills.

Join 1M+ pros


LLM Guardrail Model

To solve the accuracy problem, the engineering team introduced what they call a guardrail model. 

At its core, this is a classifier that predicts whether the transcription produced from a given menu photo will meet the accuracy bar required for production. The logic is straightforward:

  • If the guardrail predicts that the output is good enough, the structured menu data is automatically published.

  • If the guardrail predicts a likely failure, the photo is routed to a human for transcription.

Building the guardrail meant more than just looking at the image. 

The team realized the model needed to understand how the photo, the OCR system, and the LLM all interacted with each other. So they engineered features from three different sources:

  • Image-level features: These capture the quality of the photo itself, whether it is dark, blurry, has glare, or is cluttered with background objects.

  • OCR-derived features: These measure the reliability of the text extraction, such as how orderly the tokens are, whether confidence scores are high, or if the system has produced fragments and junk text.

  • LLM-output features: These reflect the quality of the structured transcription, such as how internally consistent the categories and attributes are, or whether the coverage looks incomplete.

This multi-view approach directly targets the three failure modes identified earlier: inconsistent menu structure, incomplete menus, and poor photographic quality. 

By combining signals from the image, the OCR process, and the LLM itself, the guardrail learns to separate high-confidence transcriptions from those that are likely to go wrong.

Guardrail Model Training and Performance

Designing the guardrail model opened up the question of which architecture would actually work best in practice. 

The team experimented with a three-component neural network design that looked like this:

  • Image encoding: The raw menu photo was passed through a pretrained vision backbone. They tried CNN-based models like VGG16 and ResNet, as well as transformer-based models such as ViT (Vision Transformer) and DiT (Document Image Transformer).

  • Tabular features: Alongside the image encoding, the network ingested features derived from the OCR output and the LLM transcription.

  • Fusion and classification: These inputs were combined through fully connected layers, ending in a classifier head that predicted whether a transcription was accurate enough.

The diagram below illustrates this design: an image model on one side, OCR/LLM tabular features on the other, both feeding into dense layers and then merging into a final classifier. It’s a standard multimodal fusion approach designed to capture signals from multiple sources simultaneously.

The results, however, were surprising. 

Despite the sophistication of the neural network, the simplest model (LightGBM: a gradient-boosted decision tree) outperformed all the deep learning variants. 

LightGBM not only achieved higher accuracy but also ran faster, which made it far more suitable for production deployment. Among the neural network variants, ResNet-based encoding came closest, while ViT-based models performed worst. The main reason was data: limited labeled samples made it difficult for the more complex architectures to shine.

Production Pipeline

Once the guardrail model was in place, the team built a full production pipeline that balanced automation with human review. It works step by step:

  • Photo validation: Every submitted menu photo goes through basic checks to ensure the file is usable.

  • Transcription stage: The candidate model (initially the OCR + LLM pipeline) generates a structured transcription from the photo.

  • Guardrail inference: Features from the photo, OCR output, and LLM summary are fed into the guardrail model, which predicts whether the transcription meets accuracy requirements.

  • Routing decisions: If the guardrail predicts the transcription is accurate, the structured data is published automatically. If the guardrail predicts likely errors, the photo is escalated to human transcription.

The diagram below shows this pipeline as a flow: menu photos enter, pass through the transcription model, then are evaluated by the guardrail. From there, accurate cases flow directly into the system, while uncertain ones branch off toward human operators.

This setup immediately raised efficiency. Machines handled the straightforward cases quickly, while humans focused their effort on the difficult menus. The result was a balanced process: automation sped up operations and cut costs without lowering the quality of the final menu data.

Rapid Evolution with Multimodal GenAI

The pace of AI research did not stand still. In the months after the first guardrail model went live, multimodal LLMs (models that could process both images and text directly) became practical enough to try in production. Instead of relying only on OCR to extract text, these models could look at the raw photo and infer structure directly.

The DoorDash engineering team integrated these multimodal models alongside the existing OCR + LLM pipeline. Each approach had clear strengths and weaknesses:

  • Multimodal LLMs proved excellent at understanding context and layout. They could better interpret menus with unusual designs, multi-column layouts, or visual cues that OCR often scrambled. However, they were also more brittle when the photo itself was of poor quality, with dark lighting, glare, or partial cropping.

  • OCR and LLM models were more stable across noisy or degraded inputs, but they struggled with nuanced layout interpretation, often mislinking categories and attributes.

The diagram below shows how the two pipelines now coexist under the same guardrail system.

Both models attempt transcription, and their outputs are evaluated. The guardrail then decides which transcriptions meet the accuracy bar and which need human review.

This hybrid setup led to the best of both worlds. By letting the guardrail arbitrate quality between multimodal and OCR-based models, the system boosted automation rates while still preserving the high accuracy required for production.

Conclusion

Automating the transcription of restaurant menus from photos is a deceptively complex problem. What began as a simple OCR-to-LLM pipeline quickly revealed its limits when confronted with messy, real-world inputs: inconsistent structures, incomplete menus, and poor image quality. 

The engineering team’s solution was not just to push harder on the models themselves, but to rethink the system architecture. The introduction of a guardrail classifier allowed automation to scale responsibly, ensuring that customers and restaurants always saw accurate menus while machines handled the simpler cases.

As the field of generative AI evolved, the system evolved with it. 

By combining OCR and LLM models with newer multimodal approaches under the same guardrail framework, DoorDash was able to harness the strengths of both families of models without being trapped by their weaknesses.

Looking ahead, several opportunities remain open:

  • Domain fine-tuning: The growing dataset of human-verified transcriptions can be used to fine-tune LLMs and multimodal models for the specific quirks of restaurant menus.

  • Upstream quality controls: Investing in photo preprocessing with techniques like de-glare, de-noising, de-skewing, and crop detection will lift the accuracy of both OCR-based and multimodal systems.

  • Guardrail refinement: As models continue to improve, so can the guardrail. Expanding its feature set, retraining LightGBM, or even exploring hybrid architectures will push safe automation further.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Netflix Tudum Supports 20 Million Users With CQRS

2025-09-09 23:30:44

Build MCP servers that perform with Gram by Speakeasy (Sponsored)

AI agents get confused by MCP servers which include too many tools, lack crucial context, and are simple API mirrors. Without development, your carefully designed APIs causes agent headaches.

Gram fixes this. It's an open source platform where you can curate tools: add context, design multi-step tools, and deploy your MCP server in minutes.

Transform your APIs into agent-ready infrastructure that is ready to scale with OAuth 2.1 support, centralized management, and hosted infrastructure.

Start building your MCP server →


Disclaimer: The details in this post have been derived from the official documentation shared online by the Netflix Engineering Team. All credit for the technical details goes to the Netflix Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When Netflix launched Tudum as its official home for behind-the-scenes stories, fan interviews, and interactive experiences, the engineering challenge was clear: deliver fresh, richly formatted content to millions of viewers at high speed, while giving editors a seamless way to preview updates in real time. 

The initial architecture followed a classic CQRS (Command Query Responsibility Segregation) pattern, separating the “write path” for editorial tools from the “read path” for visitors. Kafka connected these paths, pushing read-optimized data into backend services for page construction. 

The approach was scalable and reliable, but not without trade-offs.

As Tudum grew, editors noticed a frustrating delay between saving an update and seeing it live in previews. The culprit was a chain of sequential processes and cache refresh cycles that, while suitable for production visitors, slowed down the creative workflow. 

To solve this, Netflix engineers replaced the read-path’s external key-value store and per-request I/O with RAW Hollow: a compressed, distributed, in-memory object store embedded directly in the application. 

The result was near-instant editorial preview, simpler infrastructure, and a major drop in page construction time for end users. In this article, we will look at the evolution of this design decision and how Netflix went about implementing it.

Early Design

Netflix’s Tudum platform had to support two fundamentally different workflows:

  • Write path: This is where content editors create and update rich, media-heavy stories in a content management system (CMS).

  • Read path: This is where millions of global visitors consume those stories in a format optimized for fast rendering and delivery.

To keep these workflows independent and allow each to scale according to its needs, Netflix adopted a CQRS (Command Query Responsibility Segregation) architecture. 

See the diagram below for a general overview of CQRS.

The write store contains the raw editorial data (internal CMS objects with IDs, metadata, and references), while the read store contains a fully “render-ready” version of the same data, such as resolved movie titles instead of IDs, CDN-ready image URLs instead of internal asset references, and precomputed layout elements.

As mentioned, Kafka served as the bridge between the two paths. When an editor made a change, the CMS emitted an event to Tudum’s ingestion layer. This ingestion pipeline performed the following steps:

  • Pulled the content from the CMS.

  • Applied templates and business rules to ensure formatting consistency.

  • Validated the data for required fields and constraints.

  • Transformed placeholders into production-ready assets (for example, movie title lookups, CDN URL resolution).

The processed content was published to a Kafka topic.

A Data Service Consumer subscribed to this topic, reading each new or updated page element. It wrote this data into a Cassandra-backed read store, structured for fast retrieval. Finally, an API layer exposed these read-optimized entities to downstream consumers such as the Page Construction Service (which assembles full pages for rendering), personalization services, and other internal tools.

See the diagram below:

This event-driven design ensured that editorial changes would eventually appear on the Tudum website without impacting write-side performance, while also allowing Netflix to scale the read and write paths independently.

The Pain of Eventual Consistency

While the CQRS-with-Kafka design was robust and scalable, it introduced a workflow bottleneck that became increasingly visible as Tudum’s editorial output grew.

Every time an editor made a change in the CMS, that change had to travel through a long chain before it appeared in a preview environment or on the live site. Here is a quick look at the various steps involved:

  • Save in CMS: The updated content is stored in the write database.

  • Webhook to Tudum ingestion: The CMS notifies Tudum’s ingestion layer of the change.

  • Re-read and process: Ingestion fetches the relevant sections, applies templates, validates data, and performs asset resolution.

  • Publish to Kafka: The processed content is sent to the designated Kafka topic.

  • Consume and store: The Data Service Consumer ingests the message, writes it into Cassandra as a read-optimized record.

  • Cache refresh: The Page Data Service, sitting in front of Cassandra, maintains a near-cache that must refresh for the new content to be visible.

This near-cache was a key contributor to the delay. Technically speaking, the near-cache is a small, per-instance, memory layer that sits in front of the read store. However, rather than refreshing instantly for every update, it operated on a scheduled per-key refresh policy. Each key has a timer. When the timer fires, the instance refreshes that key from the backing store.  While this approach was designed for production traffic efficiency, it meant that fresh edits often waited for the next scheduled refresh cycle before appearing.

As content volume and the number of page elements increased, these refresh cycles stretched longer. A page is assembled from multiple fragments, each with its key and timer. They do not refresh together. This meant that the more elements a page had, the more staggered the refresh completion became, leading to inconsistent preview states. In other words, some elements got updated, but others remained stale.

The result was that editors had to sometimes wait minutes to see their changes reflected in a preview, even though the system had already processed and stored the update. 

For a platform like Tudum, where timing is critical for publishing stories tied to new releases and events, this delay disrupted editorial flow and complicated collaboration between writers, editors, and designers.

The Solution: RAW Hollow

To eliminate the bottlenecks in Tudum’s read path, Netflix engineers turned to RAW Hollow: a compressed, distributed, in-memory object store designed for scenarios where datasets are small-to-medium in size, change infrequently, and must be served with extremely low latency.

Unlike the earlier setup, where read services fetched data from an external Cassandra-backed key-value store (with network calls, cache layers, and refresh cycles), RAW Hollow keeps the entire dataset loaded directly into the memory of every application instance that needs it. This means all lookups happen in-process, avoiding the I/O and cache-invalidation complexities of the old approach.

The key characteristics of RAW Hollow in the Tudum context are as follows:

  • Distributed and co-located: Each service instance holds the full dataset in memory. Updates are propagated so all instances stay synchronized without having to query an external store on demand.

  • Compression for scalability: Data is stored in a compressed binary form, significantly reducing RAM usage. In Tudum’s case, three years of unhydrated data fit into ~130 MB, which is about one-quarter the size of the same data in Apache Iceberg.

  • Consistency control per request: By default, RAW Hollow favors eventual consistency for high availability, but services can opt in to strong read-after-write consistency on a per-request basis. This is especially useful for editor preview flows, ensuring that a just-published update is immediately visible without waiting for the standard sync interval.

  • Built for in-memory computation: Because the dataset is always in RAM, services like page construction, search, and personalization can retrieve data in O(1) time, eliminating network round-trip.

For Tudum, adopting RAW Hollow meant removing the Page Data Service, its near-cache, the external key-value store, and even Kafka from the read path. Instead, the Hollow client was embedded directly inside each microservice that needed content. This collapsed the number of sequential operations, tightened the feedback loop for editors, and simplified the architecture by removing multiple moving parts.

The result was a big shift: instead of “store to fetch to cache to refresh,” the system now operates on “load once into memory to serve instantly to propagate changes.” 

The New TUDUM Design

After adopting RAW Hollow, Netflix rebuilt Tudum’s read path to remove the layers that were slowing down editorial previews and adding unnecessary complexity. 

The new design still follows the CQRS principle (separating the editorial content creation from the visitor-facing content), but the way data moves through the read side is now radically simplified.

See the diagram below:

Here’s what changed in the architecture:

  • No more Page Data Service and near-cache: The old facade over the key-value store, which managed its own scheduled cache refresh cycles, was removed entirely. Services no longer wait for caches to update: the data they need is already in memory.

  • No external key-value store for read path: Cassandra (and the I/O it required) is no longer queried during page rendering or personalization. Instead, the entire read-optimized dataset is stored in RAM on every service instance via RAW Hollow.

  • No Kafka in the read path: While Kafka still has use cases elsewhere, the read side no longer depends on Kafka to propagate updates for serving live traffic. RAW Hollow handles data distribution and synchronization internally.

  • Embedded Hollow client in every microservice: Services like Page Construction, Search, and Personalization now run their own embedded Hollow clients. This gives them direct, O(1)-time access to the latest read-optimized content without network calls.

The new flow works as follows:

  • When content is updated in the write path, it’s processed into a read-optimized format

  • RAW Hollow distributes that update to all Hollow clients across service instances. This is known as the Hollow State, which is an in-memory dataset that each service process holds locally.

  • Because each instance has the full dataset in memory, any request (whether from an editor preview or a live user) is served immediately without cache checks or datastore queries.

  • For editorial preview, services can request strong read-after-write consistency, ensuring that the very latest update is displayed instantly.

This re-architecture shifted Tudum’s read path from a multi-hop network-bound pipeline to a memory-local lookup model. In essence, Netflix took the scalability and separation of CQRS but stripped away the read path’s I/O-heavy plumbing, replacing it with a memory-first, embedded data model.

Conclusion

The shift from a Kafka and Cassandra with a cache-based read path to a RAW Hollow in-memory model produced immediate and measurable improvements for Tudum. Some of the key benefits were as follows:

  • Massive latency reduction: In controlled tests (factoring out TLS, authentication, WAF, and logging overhead), home page construction time dropped from roughly 1.4 seconds to about 0.4 seconds once all read-path services consumed Hollow in-memory state. This speed gain directly impacts both editorial previews and live visitor requests.

  • Near-instant editorial preview: Editors now see updates reflected in seconds instead of waiting minutes for cache refresh cycles. Strong read-after-write consistency on preview requests ensures changes are visible immediately after saving in the CMS.

  • Small memory footprint: Compression allows the entire read dataset to fit comfortably in RAM. Three years’ worth of unhydrated content weighs in at about 130 MB, roughly 25% of its size in the previous Apache Iceberg format. This makes it viable to keep the full dataset co-located with every service instance without expensive scaling.

  • Operational simplicity: Removing Kafka, the external key-value store, and near-cache layers from the read path reduced moving parts and failure points, while eliminating cache-invalidation headaches.

Netflix’s re-architecture of Tudum’s read path shows how rethinking data access patterns can yield outsized gains in performance, simplicity, and developer experience.

By combining the scalability of CQRS with the speed of an in-memory, compressed object store like RAW Hollow, they created a system that serves both editorial agility and end-user responsiveness.

The lessons here are broadly applicable:

  • Minimize sequential operations in latency-critical flows.

  • Keep frequently accessed datasets as close to the application as possible.

  • Use consistency controls selectively to balance freshness with availability.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].