MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Start Learning AI — Our New YouTube Channel is Live

2025-09-12 23:30:39

BIG announcement: We’ve launched a new YouTube channel to make learning AI easier — ByteByteAI!

Subscribe Here

Our first video is already live, and we plan to post a new one every week.

Here's a sneak peak into some titles you might see in the future:

- How Are Reasoning LLMs Like “GPT-5” Built?

- How to Build a Coding Agent?

- How LLMs See the World

- The $250M Paper - Molmo

- What Is Prompt and Context Engineering?

- How Does YouTube Recommend Videos?

- How Does Netflix Recommend Shows?

- How Does Google Translate Work?

- How to Build a Text-to-Image System?

- Are Small Language Models the future of agentic AI?

- How do LLMs remember things?

- Hacking AI with Words: Prompt Injection Explained

- And many more…

Subscribe Here

Consistent Hashing 101: How Modern Systems Handle Growth and Failure

2025-09-11 23:30:46

In the world of distributed systems, one of the hardest problems isn’t just storing or retrieving data. It’s figuring out where that data should live when we have dozens, hundreds, or even thousands of servers. 

Imagine running a large-scale web service where user profiles, cached web pages, or product catalogs need to be spread across multiple machines. 

Consistent hashing emerged as a clever solution to this problem and quickly became one of the foundational ideas for scaling distributed systems. 

Instead of scattering keys randomly and having to reshuffle them every time the cluster size changes, consistent hashing ensures that only a small, predictable portion of keys needs to move when servers are created or destroyed. This property, often described as “minimal disruption,” is what makes the technique so powerful. 

Over the years, consistent hashing has been adopted by some of the largest companies in technology. It underpins distributed caching systems like memcached, powers databases like Apache Cassandra and Riak, and is at the heart of large-scale architectures such as Amazon Dynamo. When browsing a social media feed, streaming a video, or shopping online, chances are that consistent hashing is working quietly in the background to keep the experience smooth and fast.

In this article, we will look at consistent hashing in detail. We will also understand the improvements to consistent hashing using virtual nodes and how it helps scale systems.

The Problem with Traditional Hashing

Read more

How DoorDash uses AI Models to Understand Restaurant Menus

2025-09-10 23:30:09

Make tribal knowledge self-serve (Sponsored)

Cut onboarding time, reduce interruptions, and ship faster by surfacing the knowledge locked across GitHub, Slack, Jira, and Confluence (and more). You get:

  • Instant answers to questions about your architecture, past workarounds, and current projects.

  • An MCP Server that supercharges Claude and Cursor with your team knowledge so they generate code that makes sense in your codebase.

  • Agent that posts root cause and fix suggestions for CI failures directly in your Pull Request. 

  • A virtual member of your team that automates internal support without extra overhead.

Check out Unblocked


Disclaimer: The details in this post have been derived from the official documentation shared online by the DoorDash Engineering Team. All credit for the technical details goes to the DoorDash Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When we order food online, the last thing we want is an out-of-date or inaccurate menu. 

However, for delivery platforms, keeping menus fresh is a never-ending challenge. Restaurants constantly update items, prices, and specials, and doing all of this manually at scale is costly and slow.

DoorDash tackled this problem by applying large language models (LLMs) to automate the process of turning restaurant menu photos into structured, usable data. The technical goal of their project was clear: achieve accurate transcription of menu photos into structured menu data while keeping latency and cost low enough for production at scale.

On the surface, the idea is straightforward: take a photo, run it through AI, and get back a clean digital menu. In practice, though, the messy reality of real-world images (cropped photos, poor lighting, cluttered layouts) quickly exposes the limitations of LLMs on their own.

But the key insight was that LLMs, while strong at summarization and organization, break down when faced with noisy or incomplete inputs. To overcome this, DoorDash designed a system with guardrails. These are mechanisms that decide when automation is reliable enough to use and when a human needs to step in.

In this article, we will look at how DoorDash designed such a system and the challenges they faced.

Baseline MVP

The first step was to prove whether menus could be digitized at all in an automated way. 

The engineering team started with a simple pipeline: OCR to LLM. The OCR system extracted raw text from menu photos, and then a large language model was tasked with converting that text into a structured schema of categories, items, and attributes.

This approach worked well enough as a prototype. 

It showed that a machine could, in principle, take a photo of a menu and output something resembling a digital menu. But once the system was tested at scale, cracks began to appear. Accuracy suffered in ways that were too consistent to ignore.

The main reasons were as follows:

  • Inconsistent menu structures: Real-world menus are not neatly ordered lists. Some are multi-column, others use mixed fonts, and many scatter categories and items in unpredictable ways. OCR tools often scramble the reading order, which means the LLM ends up pairing items with the wrong attributes or misplacing categories entirely.

  • Incomplete menus: Photos are often cropped or partial, capturing only sections of a menu. When the LLM receives attributes without their parent items, or items without their descriptions, it makes guesses. These guesses lead to mismatches and incorrect entries in the structured output.

  • Low photographic quality: Many menu photos are taken in dim lighting, with glare from glass frames or clutter in the background. Small fonts and angled shots add to the noise. Poor image quality reduces OCR accuracy, and the errors cascade into the LLM stage, degrading the final transcription.

Through human evaluation, the team found that nearly all transcription failures could be traced back to one of these three buckets. 


The Gold standard for AI news (Sponsored)

AI is the most essential technical skill of this decade.

CEOs of GitHub, Box, and others are prioritising hiring engineers with AI skills.

Engineers, devs, and technical leaders at Fortune 1000s + leading Silicon Valley startups read Superhuman AI to stay ahead of the curve and future-proof their skills.

Join 1M+ pros


LLM Guardrail Model

To solve the accuracy problem, the engineering team introduced what they call a guardrail model. 

At its core, this is a classifier that predicts whether the transcription produced from a given menu photo will meet the accuracy bar required for production. The logic is straightforward:

  • If the guardrail predicts that the output is good enough, the structured menu data is automatically published.

  • If the guardrail predicts a likely failure, the photo is routed to a human for transcription.

Building the guardrail meant more than just looking at the image. 

The team realized the model needed to understand how the photo, the OCR system, and the LLM all interacted with each other. So they engineered features from three different sources:

  • Image-level features: These capture the quality of the photo itself, whether it is dark, blurry, has glare, or is cluttered with background objects.

  • OCR-derived features: These measure the reliability of the text extraction, such as how orderly the tokens are, whether confidence scores are high, or if the system has produced fragments and junk text.

  • LLM-output features: These reflect the quality of the structured transcription, such as how internally consistent the categories and attributes are, or whether the coverage looks incomplete.

This multi-view approach directly targets the three failure modes identified earlier: inconsistent menu structure, incomplete menus, and poor photographic quality. 

By combining signals from the image, the OCR process, and the LLM itself, the guardrail learns to separate high-confidence transcriptions from those that are likely to go wrong.

Guardrail Model Training and Performance

Designing the guardrail model opened up the question of which architecture would actually work best in practice. 

The team experimented with a three-component neural network design that looked like this:

  • Image encoding: The raw menu photo was passed through a pretrained vision backbone. They tried CNN-based models like VGG16 and ResNet, as well as transformer-based models such as ViT (Vision Transformer) and DiT (Document Image Transformer).

  • Tabular features: Alongside the image encoding, the network ingested features derived from the OCR output and the LLM transcription.

  • Fusion and classification: These inputs were combined through fully connected layers, ending in a classifier head that predicted whether a transcription was accurate enough.

The diagram below illustrates this design: an image model on one side, OCR/LLM tabular features on the other, both feeding into dense layers and then merging into a final classifier. It’s a standard multimodal fusion approach designed to capture signals from multiple sources simultaneously.

The results, however, were surprising. 

Despite the sophistication of the neural network, the simplest model (LightGBM: a gradient-boosted decision tree) outperformed all the deep learning variants. 

LightGBM not only achieved higher accuracy but also ran faster, which made it far more suitable for production deployment. Among the neural network variants, ResNet-based encoding came closest, while ViT-based models performed worst. The main reason was data: limited labeled samples made it difficult for the more complex architectures to shine.

Production Pipeline

Once the guardrail model was in place, the team built a full production pipeline that balanced automation with human review. It works step by step:

  • Photo validation: Every submitted menu photo goes through basic checks to ensure the file is usable.

  • Transcription stage: The candidate model (initially the OCR + LLM pipeline) generates a structured transcription from the photo.

  • Guardrail inference: Features from the photo, OCR output, and LLM summary are fed into the guardrail model, which predicts whether the transcription meets accuracy requirements.

  • Routing decisions: If the guardrail predicts the transcription is accurate, the structured data is published automatically. If the guardrail predicts likely errors, the photo is escalated to human transcription.

The diagram below shows this pipeline as a flow: menu photos enter, pass through the transcription model, then are evaluated by the guardrail. From there, accurate cases flow directly into the system, while uncertain ones branch off toward human operators.

This setup immediately raised efficiency. Machines handled the straightforward cases quickly, while humans focused their effort on the difficult menus. The result was a balanced process: automation sped up operations and cut costs without lowering the quality of the final menu data.

Rapid Evolution with Multimodal GenAI

The pace of AI research did not stand still. In the months after the first guardrail model went live, multimodal LLMs (models that could process both images and text directly) became practical enough to try in production. Instead of relying only on OCR to extract text, these models could look at the raw photo and infer structure directly.

The DoorDash engineering team integrated these multimodal models alongside the existing OCR + LLM pipeline. Each approach had clear strengths and weaknesses:

  • Multimodal LLMs proved excellent at understanding context and layout. They could better interpret menus with unusual designs, multi-column layouts, or visual cues that OCR often scrambled. However, they were also more brittle when the photo itself was of poor quality, with dark lighting, glare, or partial cropping.

  • OCR and LLM models were more stable across noisy or degraded inputs, but they struggled with nuanced layout interpretation, often mislinking categories and attributes.

The diagram below shows how the two pipelines now coexist under the same guardrail system.

Both models attempt transcription, and their outputs are evaluated. The guardrail then decides which transcriptions meet the accuracy bar and which need human review.

This hybrid setup led to the best of both worlds. By letting the guardrail arbitrate quality between multimodal and OCR-based models, the system boosted automation rates while still preserving the high accuracy required for production.

Conclusion

Automating the transcription of restaurant menus from photos is a deceptively complex problem. What began as a simple OCR-to-LLM pipeline quickly revealed its limits when confronted with messy, real-world inputs: inconsistent structures, incomplete menus, and poor image quality. 

The engineering team’s solution was not just to push harder on the models themselves, but to rethink the system architecture. The introduction of a guardrail classifier allowed automation to scale responsibly, ensuring that customers and restaurants always saw accurate menus while machines handled the simpler cases.

As the field of generative AI evolved, the system evolved with it. 

By combining OCR and LLM models with newer multimodal approaches under the same guardrail framework, DoorDash was able to harness the strengths of both families of models without being trapped by their weaknesses.

Looking ahead, several opportunities remain open:

  • Domain fine-tuning: The growing dataset of human-verified transcriptions can be used to fine-tune LLMs and multimodal models for the specific quirks of restaurant menus.

  • Upstream quality controls: Investing in photo preprocessing with techniques like de-glare, de-noising, de-skewing, and crop detection will lift the accuracy of both OCR-based and multimodal systems.

  • Guardrail refinement: As models continue to improve, so can the guardrail. Expanding its feature set, retraining LightGBM, or even exploring hybrid architectures will push safe automation further.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Netflix Tudum Supports 20 Million Users With CQRS

2025-09-09 23:30:44

Build MCP servers that perform with Gram by Speakeasy (Sponsored)

AI agents get confused by MCP servers which include too many tools, lack crucial context, and are simple API mirrors. Without development, your carefully designed APIs causes agent headaches.

Gram fixes this. It's an open source platform where you can curate tools: add context, design multi-step tools, and deploy your MCP server in minutes.

Transform your APIs into agent-ready infrastructure that is ready to scale with OAuth 2.1 support, centralized management, and hosted infrastructure.

Start building your MCP server →


Disclaimer: The details in this post have been derived from the official documentation shared online by the Netflix Engineering Team. All credit for the technical details goes to the Netflix Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When Netflix launched Tudum as its official home for behind-the-scenes stories, fan interviews, and interactive experiences, the engineering challenge was clear: deliver fresh, richly formatted content to millions of viewers at high speed, while giving editors a seamless way to preview updates in real time. 

The initial architecture followed a classic CQRS (Command Query Responsibility Segregation) pattern, separating the “write path” for editorial tools from the “read path” for visitors. Kafka connected these paths, pushing read-optimized data into backend services for page construction. 

The approach was scalable and reliable, but not without trade-offs.

As Tudum grew, editors noticed a frustrating delay between saving an update and seeing it live in previews. The culprit was a chain of sequential processes and cache refresh cycles that, while suitable for production visitors, slowed down the creative workflow. 

To solve this, Netflix engineers replaced the read-path’s external key-value store and per-request I/O with RAW Hollow: a compressed, distributed, in-memory object store embedded directly in the application. 

The result was near-instant editorial preview, simpler infrastructure, and a major drop in page construction time for end users. In this article, we will look at the evolution of this design decision and how Netflix went about implementing it.

Early Design

Netflix’s Tudum platform had to support two fundamentally different workflows:

  • Write path: This is where content editors create and update rich, media-heavy stories in a content management system (CMS).

  • Read path: This is where millions of global visitors consume those stories in a format optimized for fast rendering and delivery.

To keep these workflows independent and allow each to scale according to its needs, Netflix adopted a CQRS (Command Query Responsibility Segregation) architecture. 

See the diagram below for a general overview of CQRS.

The write store contains the raw editorial data (internal CMS objects with IDs, metadata, and references), while the read store contains a fully “render-ready” version of the same data, such as resolved movie titles instead of IDs, CDN-ready image URLs instead of internal asset references, and precomputed layout elements.

As mentioned, Kafka served as the bridge between the two paths. When an editor made a change, the CMS emitted an event to Tudum’s ingestion layer. This ingestion pipeline performed the following steps:

  • Pulled the content from the CMS.

  • Applied templates and business rules to ensure formatting consistency.

  • Validated the data for required fields and constraints.

  • Transformed placeholders into production-ready assets (for example, movie title lookups, CDN URL resolution).

The processed content was published to a Kafka topic.

A Data Service Consumer subscribed to this topic, reading each new or updated page element. It wrote this data into a Cassandra-backed read store, structured for fast retrieval. Finally, an API layer exposed these read-optimized entities to downstream consumers such as the Page Construction Service (which assembles full pages for rendering), personalization services, and other internal tools.

See the diagram below:

This event-driven design ensured that editorial changes would eventually appear on the Tudum website without impacting write-side performance, while also allowing Netflix to scale the read and write paths independently.

The Pain of Eventual Consistency

While the CQRS-with-Kafka design was robust and scalable, it introduced a workflow bottleneck that became increasingly visible as Tudum’s editorial output grew.

Every time an editor made a change in the CMS, that change had to travel through a long chain before it appeared in a preview environment or on the live site. Here is a quick look at the various steps involved:

  • Save in CMS: The updated content is stored in the write database.

  • Webhook to Tudum ingestion: The CMS notifies Tudum’s ingestion layer of the change.

  • Re-read and process: Ingestion fetches the relevant sections, applies templates, validates data, and performs asset resolution.

  • Publish to Kafka: The processed content is sent to the designated Kafka topic.

  • Consume and store: The Data Service Consumer ingests the message, writes it into Cassandra as a read-optimized record.

  • Cache refresh: The Page Data Service, sitting in front of Cassandra, maintains a near-cache that must refresh for the new content to be visible.

This near-cache was a key contributor to the delay. Technically speaking, the near-cache is a small, per-instance, memory layer that sits in front of the read store. However, rather than refreshing instantly for every update, it operated on a scheduled per-key refresh policy. Each key has a timer. When the timer fires, the instance refreshes that key from the backing store.  While this approach was designed for production traffic efficiency, it meant that fresh edits often waited for the next scheduled refresh cycle before appearing.

As content volume and the number of page elements increased, these refresh cycles stretched longer. A page is assembled from multiple fragments, each with its key and timer. They do not refresh together. This meant that the more elements a page had, the more staggered the refresh completion became, leading to inconsistent preview states. In other words, some elements got updated, but others remained stale.

The result was that editors had to sometimes wait minutes to see their changes reflected in a preview, even though the system had already processed and stored the update. 

For a platform like Tudum, where timing is critical for publishing stories tied to new releases and events, this delay disrupted editorial flow and complicated collaboration between writers, editors, and designers.

The Solution: RAW Hollow

To eliminate the bottlenecks in Tudum’s read path, Netflix engineers turned to RAW Hollow: a compressed, distributed, in-memory object store designed for scenarios where datasets are small-to-medium in size, change infrequently, and must be served with extremely low latency.

Unlike the earlier setup, where read services fetched data from an external Cassandra-backed key-value store (with network calls, cache layers, and refresh cycles), RAW Hollow keeps the entire dataset loaded directly into the memory of every application instance that needs it. This means all lookups happen in-process, avoiding the I/O and cache-invalidation complexities of the old approach.

The key characteristics of RAW Hollow in the Tudum context are as follows:

  • Distributed and co-located: Each service instance holds the full dataset in memory. Updates are propagated so all instances stay synchronized without having to query an external store on demand.

  • Compression for scalability: Data is stored in a compressed binary form, significantly reducing RAM usage. In Tudum’s case, three years of unhydrated data fit into ~130 MB, which is about one-quarter the size of the same data in Apache Iceberg.

  • Consistency control per request: By default, RAW Hollow favors eventual consistency for high availability, but services can opt in to strong read-after-write consistency on a per-request basis. This is especially useful for editor preview flows, ensuring that a just-published update is immediately visible without waiting for the standard sync interval.

  • Built for in-memory computation: Because the dataset is always in RAM, services like page construction, search, and personalization can retrieve data in O(1) time, eliminating network round-trip.

For Tudum, adopting RAW Hollow meant removing the Page Data Service, its near-cache, the external key-value store, and even Kafka from the read path. Instead, the Hollow client was embedded directly inside each microservice that needed content. This collapsed the number of sequential operations, tightened the feedback loop for editors, and simplified the architecture by removing multiple moving parts.

The result was a big shift: instead of “store to fetch to cache to refresh,” the system now operates on “load once into memory to serve instantly to propagate changes.” 

The New TUDUM Design

After adopting RAW Hollow, Netflix rebuilt Tudum’s read path to remove the layers that were slowing down editorial previews and adding unnecessary complexity. 

The new design still follows the CQRS principle (separating the editorial content creation from the visitor-facing content), but the way data moves through the read side is now radically simplified.

See the diagram below:

Here’s what changed in the architecture:

  • No more Page Data Service and near-cache: The old facade over the key-value store, which managed its own scheduled cache refresh cycles, was removed entirely. Services no longer wait for caches to update: the data they need is already in memory.

  • No external key-value store for read path: Cassandra (and the I/O it required) is no longer queried during page rendering or personalization. Instead, the entire read-optimized dataset is stored in RAM on every service instance via RAW Hollow.

  • No Kafka in the read path: While Kafka still has use cases elsewhere, the read side no longer depends on Kafka to propagate updates for serving live traffic. RAW Hollow handles data distribution and synchronization internally.

  • Embedded Hollow client in every microservice: Services like Page Construction, Search, and Personalization now run their own embedded Hollow clients. This gives them direct, O(1)-time access to the latest read-optimized content without network calls.

The new flow works as follows:

  • When content is updated in the write path, it’s processed into a read-optimized format

  • RAW Hollow distributes that update to all Hollow clients across service instances. This is known as the Hollow State, which is an in-memory dataset that each service process holds locally.

  • Because each instance has the full dataset in memory, any request (whether from an editor preview or a live user) is served immediately without cache checks or datastore queries.

  • For editorial preview, services can request strong read-after-write consistency, ensuring that the very latest update is displayed instantly.

This re-architecture shifted Tudum’s read path from a multi-hop network-bound pipeline to a memory-local lookup model. In essence, Netflix took the scalability and separation of CQRS but stripped away the read path’s I/O-heavy plumbing, replacing it with a memory-first, embedded data model.

Conclusion

The shift from a Kafka and Cassandra with a cache-based read path to a RAW Hollow in-memory model produced immediate and measurable improvements for Tudum. Some of the key benefits were as follows:

  • Massive latency reduction: In controlled tests (factoring out TLS, authentication, WAF, and logging overhead), home page construction time dropped from roughly 1.4 seconds to about 0.4 seconds once all read-path services consumed Hollow in-memory state. This speed gain directly impacts both editorial previews and live visitor requests.

  • Near-instant editorial preview: Editors now see updates reflected in seconds instead of waiting minutes for cache refresh cycles. Strong read-after-write consistency on preview requests ensures changes are visible immediately after saving in the CMS.

  • Small memory footprint: Compression allows the entire read dataset to fit comfortably in RAM. Three years’ worth of unhydrated content weighs in at about 130 MB, roughly 25% of its size in the previous Apache Iceberg format. This makes it viable to keep the full dataset co-located with every service instance without expensive scaling.

  • Operational simplicity: Removing Kafka, the external key-value store, and near-cache layers from the read path reduced moving parts and failure points, while eliminating cache-invalidation headaches.

Netflix’s re-architecture of Tudum’s read path shows how rethinking data access patterns can yield outsized gains in performance, simplicity, and developer experience.

By combining the scalability of CQRS with the speed of an in-memory, compressed object store like RAW Hollow, they created a system that serves both editorial agility and end-user responsiveness.

The lessons here are broadly applicable:

  • Minimize sequential operations in latency-critical flows.

  • Keep frequently accessed datasets as close to the application as possible.

  • Use consistency controls selectively to balance freshness with availability.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Grab Built An Authentication System for 180+ Million Users

2025-09-08 23:31:30

Build Real-Time IoT with AWS & Redpanda (Sponsored)

IoT devices are producing massive amounts of data every second—but how do you capture, process, and act on that data in real time?

Join AWS and Redpanda experts for a live Tech Talk: Building Real-Time IoT Architectures with Redpanda and AWS.

We’ll walk through practical patterns for streaming data from thousands of devices, building low-latency pipelines, and enabling smarter decision-making at the edge and in the cloud.
Whether you’re working on industrial IoT, smart devices, or connected infrastructure, you’ll leave with a clear blueprint for scaling IoT securely and efficiently.

Register Now


Disclaimer: The details in this post have been derived from the official documentation shared online by the Grab Engineering Team. All credit for the technical details goes to the Grab Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Grab is a superapp, operating in more than 800 cities across eight countries. Through a single app, it offers ride-hailing, food and grocery delivery, payments, financial services, and logistics, making it a core part of everyday life for millions of people in the region. To support all these services, Grab also builds many internal applications used by its employees and business teams.

To make these apps secure, two things are important:

  • Authentication: making sure the person is really who they say they are.

  • Authorization: deciding what that person is allowed to do once logged in.

The problem was that Grab didn’t have a single, unified way of handling this. 

Different apps were using different systems. For example, some used Google’s OAuth2.0 in a custom way, while others relied on outside tools like Databricks or Datadog that had their own login methods.

This led to a messy situation:

  • Employees had to juggle multiple accounts and passwords.

  • The user experience was clunky and inconsistent.

  • It created extra work for administrators who had to manage all these separate identities.

  • Security wasn’t uniform across the organisation.

So, Grab decided they needed one central system that would make authentication and authorization the same across all applications. This would simplify the process, improve security, and make things smoother both for users and for administrators.

Existing Solution - Concedo

Before moving to a new approach, Grab had already built its own internal system called Concedo. 

The idea behind Concedo was to make life easier for developers so they didn’t have to build authentication and authorization features from scratch every time they created a new service.

Concedo worked using something called a Role-to-Permission Matrix (R2PM). This basically means every role in the company (like “engineer,” “manager,” or “analyst”) is mapped to specific permissions (what actions they can or cannot do). For example, a manager might have permission to view financial reports, while an analyst might not.

This setup allowed services at Grab to plug into Concedo quickly and enforce access rules consistently. Developers could focus on building their actual applications instead of worrying about identity and access management.

However, there was a catch: Concedo’s authentication was built on Google OAuth2.0, but with custom changes. These tweaks worked fine inside Grab but made it difficult to integrate with external platforms like Databricks or Datadog, which expected the standard version. That meant users still had a fragmented sign-in experience whenever they needed to use third-party tools.

Evaluation of Industry Standards

When Grab looked for a standard way to fix its messy authentication setup, the team compared a few industry protocols that are widely used for login and access control.

  • SAML (Security Assertion Markup Language): This is an older system that works mainly with web apps. It uses browser cookies to keep track of your login. It handles authentication (proving who you are), but not much beyond that.

  • OAuth 2.0: This one is more about authorization, granting apps limited access to your data without giving them your password. For example, when you log in to a site using your Google account and let it pull your profile picture, that’s OAuth 2.0 in action. But it doesn’t really confirm your full identity on its own.

  • OpenID Connect (OIDC): This builds on top of OAuth 2.0 and adds identity verification. In other words, it doesn’t just say “this app can access your data,” it also says “this really is Alice, logged in through Google.” With OIDC, a person can log in once and get into many apps (known as Single Sign-On or SSO). It also works smoothly across mobile apps, APIs, and web apps.

Because of these advantages, Grab chose OIDC as its standard.

However, OIDC isn’t perfect. It depends on external identity providers (like Google or Microsoft). If one of them goes down, users might lose access. Also, if someone’s credentials are stolen, attackers could use them to access multiple systems at once. So, Grab had to think about mitigation strategies like fallback options and scoping tokens carefully to reduce the risks.

Adoption of Dex (Open Source)

Once Grab decided to adopt OpenID Connect (OIDC) as its standard, the next question was around implementation.

Instead of building a whole new identity system from scratch, Grab looked into existing open-source projects. That’s when the team found Dex, a project under the Cloud Native Computing Foundation (CNCF).

Dex is a federated OpenID Connect (OIDC) provider. That means it acts like a bridge or middleman between Grab’s applications and various identity providers (IdPs) such as Google, Microsoft, or any other login system.

Here’s how it works in practice:

  • When a user or a machine tries to access an application, the request is redirected to Dex.

  • Dex then talks to the configured IdP (say Google).

  • Once the IdP verifies the user’s identity, Dex issues a standard OIDC token that the application can understand and trust.

  • This token contains all the necessary identity and access information, so the app doesn’t need to worry about custom login logic.

See the diagram below:

Dex was the right choice for Grab for the following reasons:

  • Single Sign-On (SSO): Users only need to log in once, and Dex lets them access multiple applications without re-entering credentials.

  • Standardised tokens: Dex issues OIDC-compliant tokens, which avoid the integration problems caused by Grab’s earlier custom OAuth tweaks.

  • Easy to add new IdPs: If Grab wants to integrate a new identity provider in the future, Dex can handle it without requiring every single application to be changed.

  • Scalability: Dex is built for cloud-native environments, making it suitable for Grab’s large-scale deployments across both internal and third-party apps.

  • Open-source contribution: By using Dex, Grab benefits from community-driven improvements and can also contribute back to strengthen the ecosystem.

In short, Dex gave Grab exactly what they needed: a flexible, standardised, and scalable identity layer that sits neatly between their apps and multiple login systems, while also ensuring developers don’t need to reinvent authentication every time.

Key Features in Grab’s Dex Implementation

The main features are as follows:

Token Delegation & Exchange

In Grab’s systems, it’s common for one service (say Service A) to call another service (Service B). However, Service B needs to know who exactly is making the request to decide what’s allowed.

Traditionally, companies solve this using service accounts (sometimes called robot accounts). These are special accounts with high-level permissions that a service uses to “log in” to another service. 

However, there were a couple of problems with this:

  • If a service account is compromised, attackers might gain broad access because these accounts usually have wide privileges.

  • If Service A tries to pass along the user’s identity to Service B, it gets complicated and breaks the clean separation between authentication and business logic.

Dex introduces a smarter way called token exchange. Instead of using fixed service accounts, the following approach is used:

  • When a user (such as Alice) logs into Service A, Dex gives Service A a valid token.

  • Because Service A is a trusted peer, it’s authorised to mint a new token that is valid for both Service A and Service B. This is done using the token’s “aud” (audience) field, which now lists both services ("aud": "serviceA serviceB").

  • Service B is configured to trust Service A as a token issuer.

  • Service B accepts the minted token and processes Alice’s request.

See the diagram below:

The benefits are as follows:

  • Tokens are scoped tightly to specific services, reducing risk.

  • There’s a clear audit trail showing which service minted which token.

  • No need to hand out all-powerful service accounts.

  • Cleaner separation of concerns: Service A doesn’t have to directly manage user-level permissions for Service B.

Kill Switch (IdP Failover)

Another big issue with OIDC is its reliance on external identity providers (IdPs) like Google or Microsoft. If an IdP goes down, every service relying on it might be blocked from authenticating users, which could cause massive downtime.

Dex’s solution was a multi-IdP failover (a “kill switch”) that works as follows:

  • Dex can be configured with multiple IdPs at once.

  • If one IdP (say Google) has an outage, Dex can automatically switch to another IdP (say Microsoft).

  • From the perspective of Grab’s applications, nothing changes — they continue to get valid OIDC tokens from Dex.

See the diagram below:

This matters because of the following reasons:

  • Ensures high availability of authentication across the company.

  • Minimises disruption for users during third-party outages.

  • Applications don’t need to be rewritten or updated to handle IdP failures — Dex handles it centrally.

This design provides Grab with resilience and reliability guarantees, which are crucial when authentication underpins everything from ride-hailing to financial services.

Conclusion

Grab has already achieved a big milestone by unifying authentication through Dex, but the journey isn’t finished yet. 

While users now enjoy a consistent and secure way of proving who they are, the question of what they are allowed to do (authorization) remains scattered and complicated across different systems. Today, each service may have its own way of defining permissions, managing roles, and enforcing rules, which leads to inefficiencies and even security gaps.

The next step is to build a unified authorization model on top of the solid identity foundation provided by Dex. This means bringing all policies together under a central framework, standardising how access control is handled across applications, and making it much simpler to manage user permissions. By consolidating the rules, Grab can reduce complexity for developers, provide a smoother experience for users, and strengthen overall security. The long-term vision is a seamless combination of identity and authorization, where logging in and accessing resources across Grab’s ecosystem feels effortless, reliable, and consistent.

In conclusion, the adoption of Dex has already transformed authentication at Grab, turning a fragmented experience into a unified one that works across both internal and external applications. By extending this approach to authorization, Grab is setting the stage for an even stronger and more secure access management system. This evolution will not only improve developer productivity and reduce administrative burden but also deliver a smoother, safer experience for every user across Grab’s wide range of services.

References:


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP179: Kubernetes Explained

2025-09-06 23:30:23

The Real Benefits of End-to-End Observability (Sponsored)

How does full-stack observability impact engineering speed, incident response, and cost control? In this ebook from Datadog, you'll learn how real teams across industries are using observability to:

  • Reduce mean time to resolution (MTTR)

  • Cut tooling costs and improve team efficiency

  • Align business and engineering KPIs

See how unifying your stack leads to faster troubleshooting and long-term operational gains.

Download the ebook


This week’s system design refresher:

  • System Design: Design YouTube (Youtube video)

  • 9 Docker Best Practices You Should Know

  • Kubernetes Explained

  • N8N versus LangGraph

  • Where Do We Cache Data?

  • ByteByteGo Technical Interview Prep Kit

  • SPONSOR US


System Design: Design YouTube


9 Docker Best Practices You Should Know

  1. Use official images
    This ensures security, reliability, and regular updates.

  2. Use a specific image version
    The default latest tag is unpredictable and causes unexpected behavior.

  3. Multi-Stage builds
    Reduces final image size by excluding build tools and dependencies.

  4. Use .dockerignore
    Excludes unnecessary files, speeds up builds, and reduces image size.

  5. Use the least privileged user
    Enhances security by limiting container privileges.

  6. Use environment variables
    Increases flexibility and portability across different environments.

  7. Order matters for caching
    Order your steps from least to most frequently changing to optimize caching.

  8. Label your images
    It improves organization and helps with image management.

  9. Scan images
    Find security vulnerabilities before they become bigger problems.

Over to you: Which other Docker best practices will you add to the list?


Kubernetes Explained

Kubernetes is the de facto standard for container orchestration. It automates the deployment, scaling, and management of containerized applications.

Control Plane:

  • API Server: Acts as the communication hub between users, the control plane, and worker nodes.

  • Scheduler: Decides which Pod runs on which Node.

  • Controller Manager: Keeps the cluster state in sync.

  • etcd: A distributed key-value store that holds the cluster’s state.

Worker Nodes:

  • Pods: The smallest deployable unit in Kubernetes, representing one or more containers.

  • Container Runtime: The engine (like Docker or containerd) that runs the containers.

  • kubelet: Ensures containers are running as defined in Pods.

  • kube-proxy: Handles networking between Pods and ensures communication.

Over to you: What’s the toughest part of running Kubernetes in production?


N8N versus LangGraph

N8N is an open-source automation tool that lets you visually build workflows by connecting different services, APIs, and AI tools in a sequence. Here’s how it works:

  1. Starts with Input from the user.

  2. Passes it to an AI Agent for processing.

  3. The AI Agent can either make a Tool Call or access Memory.

  4. A Decision node chooses the next action and produces the final LLM output for the user.

LangGraph is a Python framework for building AI Agent workflows using a flexible graph structure that supports branching, looping, and multi-agent collaboration. Here’s how it works:

  1. Starts with a shared State containing workflow context.

  2. Can route tasks to different agents.

  3. Agents interact with a Tool Node to perform tasks.

  4. A Conditional node decides whether to retry or mark the process done.

Over to you: Have you used N8N or LangGraph?


Where Do We Cache Data?

Data is cached everywhere, from the front end to the back end!

This diagram illustrates where we cache data in a typical architecture.

There are multiple layers along the flow.

  1. Client apps: HTTP responses can be cached by the browser. We request data over HTTP for the first time, and it is returned with an expiry policy in the HTTP header; we request data again, and the client app tries to retrieve the data from the browser cache first.

  2. CDN: CDN caches static web resources. The clients can retrieve data from a CDN node nearby.

  3. Load Balancer: The load Balancer can cache resources as well.

  4. Messaging infra: Message brokers store messages on disk first, and then consumers retrieve them at their own pace. Depending on the retention policy, the data is cached in Kafka clusters for a period of time.

  5. Services: There are multiple layers of cache in a service. If the data is not cached in the CPU cache, the service will try to retrieve the data from memory. Sometimes the service has a second-level cache to store data on disk.

  6. Distributed Cache: Distributed cache like Redis hold key-value pairs for multiple services in memory. It provides much better read/write performance than the database.

  7. Full-text Search: we sometimes need to use full-text searches like Elastic Search for document search or log search. A copy of data is indexed in the search engine as well.

  8. Database: Even in the database, we have different levels of caches:

    • WAL(Write-ahead Log): data is written to WAL first before building the B tree index

    • Bufferpool: A memory area allocated to cache query results

    • Materialized View: Pre-compute query results and store them in the database tables for better query performance

    • Transaction log: record all the transactions and database updates

    • Replication Log: used to record the replication state in a database cluster

Over to you: With the data cached at so many levels, how can we guarantee the sensitive user data is completely erased from the systems?


ByteByteGo Technical Interview Prep Kit

Launching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website.

What's included:

  • System Design Interview

  • Coding Interview Patterns

  • Object-Oriented Design Interview

  • How to Write a Good Resume

  • Behavioral Interview (coming soon)

  • Machine Learning System Design Interview

  • Generative AI System Design Interview

  • Mobile System Design Interview

  • And more to come

Launch sale: 50% off


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].